Daily Log for #alfresco IRC Channel

Alfresco discussion and collaboration. Stick around a few hours after asking a question.

Official support for Enterprise subscribers: support.alfresco.com.

Joining the Channel:

Join in the conversation by getting an IRC client and connecting to #alfresco at Freenode. Our you can use the IRC web chat.

More information about the channel is in the wiki.

Getting Help

More help is available in this list of resources.

Daily Log for #alfresco

2020-01-09 08:17:06 GMT <angelborroy> News for the Community

2020-01-09 08:17:35 GMT <angelborroy> Eddie May has joined Alfresco as Community Manager replacing Kristen

2020-01-09 10:14:20 GMT <alfresco-discord> <kumar> Hi All I had doubt , in share config we use the form id="" under the <config evaluator="aspect" condition="ms:metadata">

2020-01-09 10:14:58 GMT <alfresco-discord> <kumar> <forms> <form id="doclib-common-consumer-dashboard">

2020-01-09 10:18:07 GMT <alfresco-discord> <kumar> we are using same aspect under the different site preset so want to show up the order different for each site

2020-01-09 10:19:47 GMT <alfresco-discord> <kumar> when I tried with above one , the properties are not showing those sites so I came to know like this will not work but I want to conform is this correct way not?

2020-01-09 10:37:08 GMT <alfresco-discord> <monica> @kumar if you want to override this, then add replace="true" attribute to config like <config evaluator="aspect" condition="ms:metadata" replace="true">

2020-01-09 11:03:46 GMT <AFaust> Currently at a customer and preparing for a web session with Abbyy by trying out potential alternative (cloud-based) OCR services. Does anyone here have any good suggestions about OCR services (should include zonal OCR / configurable or trainable extraction + classification capabilities) that I/we should look at?

2020-01-09 11:03:59 GMT <AFaust> Currently playing around with a trial of docparser

2020-01-09 11:05:59 GMT <alfresco-discord> <yreg> There is tesseract, the obvious option

2020-01-09 11:06:06 GMT <alfresco-discord> <yreg> texract from aws

2020-01-09 11:06:32 GMT <alfresco-discord> <yreg> and in my experience, for Arabic locale, Abby wins by far both for recognition and for rendering

2020-01-09 11:07:21 GMT <alfresco-discord> <yreg> haven't tried other locales with it

2020-01-09 11:08:38 GMT <AFaust> tesseract is too low-level

2020-01-09 11:09:20 GMT <AFaust> ...that's why I mentioned zonal OCR + configurability / training, which you'd have to build custom with tesseract

2020-01-09 11:09:51 GMT <AFaust> Right - forgot about Textract, which would be the first time to try this...

2020-01-09 11:11:05 GMT <alfresco-discord> <yreg> I had to do some advanced tesseract manipulation during summer

2020-01-09 11:11:21 GMT <alfresco-discord> <yreg> and it wasn't that bad

2020-01-09 11:11:57 GMT <alfresco-discord> <yreg> check this project out, it's awesome, and has a quite extensive documentation : https://github.com/jbarlow83/OCRmyPDF/

2020-01-09 11:11:58 GMT <alfbot> Title:GitHub - jbarlow83/OCRmyPDF: OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched (at github.com)

2020-01-09 11:25:46 GMT <alfresco-discord> <binduwavell> Axel, Abbyy does a great job. Depending on what your doing Nuance may out perform it.

2020-01-09 13:44:42 GMT <angelborroy> @AFaust did you try Ephesoft?

2020-01-09 13:44:50 GMT <angelborroy> I worked with it some years ago

2020-01-09 13:45:08 GMT <angelborroy> Not a bad product but oriented to huge amount of scanning documents

2020-01-09 14:09:48 GMT <AFaust> Based on my past experience with Ephesoft, this doesn't really fit what the customer wants to do. Ideally, Alfresco sends new documents to a service to automatically extract data without (much) user interaction and separate data storage (in a mailroom solution like Ephesoft)

2020-01-09 14:10:54 GMT <AFaust> That's why docparser (and textract) would match quite well. Abbyy I am not so sure about yet because of abysmal public information, but we have a web session tomorrow to address questions....

2020-01-09 14:11:38 GMT <AFaust> I / we also don't like the Windows-reliance of Abbyy products, and the legacy licensing model...

2020-01-09 14:12:33 GMT <alfresco-discord> <yreg> although I only used the desktop client of Abby, I think they listed a server component on their site as well

2020-01-09 14:14:03 GMT <alfresco-discord> <yreg> but indeed either good ol' tesseract (and believe me, it's not that hard to manage it, and continuously enhance models as well) or textract (if you are looking for a managed solution with support) are better suited for usecase indeed

2020-01-09 14:18:26 GMT <AFaust> Oh boy, of course AWS is extremely American-centric: "Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."

2020-01-09 14:20:08 GMT <AFaust> Already saw the kind of problems docparser has with German Umlauts, and am quite curious to see Abbyy, if we can get a no-up-front setup-hassle trial / test.

2020-01-09 14:21:13 GMT <AFaust> Which still amazes me, that OCR products have the same kind of problems like they had 10 years ago (when I last took a more intensive look at them), despite all the AI and ML stickers marketing puts on these products nowadays...

2020-01-09 14:35:24 GMT <alfresco-discord> <yreg> For textract, don't take anything for granted, try it out first

2020-01-09 14:35:56 GMT <alfresco-discord> <yreg> It could actually support those wierd characters upon the input of a valid hint for the language

2020-01-09 14:43:10 GMT <AFaust> Sure, trial will definitely occur before making a choice. It was just an input to manage the expectations of the customer...

2020-01-09 14:49:05 GMT <alfresco-discord> <dgradecak> AFaust: Abby works well with croatian characters, so I guess umlats should be fine too

2020-01-09 14:49:15 GMT <alfresco-discord> <dgradecak> ČĆŠ (if you can read that)

2020-01-09 14:59:00 GMT <alfresco-discord> <binduwavell> Axel, Nuance does have a Linux server version.

2020-01-09 15:03:32 GMT <AFaust> Can you give me a link to Nuance, because what I found via Google did not feel like an OCR focused product suite...

2020-01-09 15:04:18 GMT <AFaust> You mean OmniPage, right? The (cached) URLs I got in the search result always redirected to the more generic company web page...

2020-01-09 15:05:45 GMT <AFaust> Now I found a search result which redirected me to Kofax despite clearly showing a nuance.com URL in Google...

2020-01-09 15:09:43 GMT <AFaust> Ah, OmniPage was sold to Kofax in 2019, so that's part of the confusion and redirects, so looks like Nuance no longer offers the OCR product

2020-01-09 19:31:28 GMT <hi-ko> AFaust: My experience is: if money doensn't count kofax and abby have best results / in recognition quality, tooling and workflow. Also a very good (if not better) way is to split ocr and zonal recognition.

2020-01-09 19:33:40 GMT <hi-ko> we have good experience in chaining plain ocr (I prefer abbyy which also has a very cheap linux cli package), then pdf based extraction in pdfmdx which is like kofax but for pdf

2020-01-09 19:36:36 GMT <hi-ko> OmniPage is cheap but has very bad recognition.

End of Daily Log

The other logs are at http://esplins.org/hash_alfresco