Daily Log for #alfresco IRC Channel

Alfresco discussion and collaboration. Stick around a few hours after asking a question.

Official support for Enterprise subscribers: support.alfresco.com.

Joining the Channel:

Join in the conversation by getting an IRC client and connecting to #alfresco at Freenode. Our you can use the IRC web chat.

More information about the channel is in the wiki.

Getting Help

More help is available in this list of resources.

Daily Log for #alfresco

2020-01-13 15:32:10 GMT <fwu2018> hello all

2020-01-13 15:32:41 GMT <fwu2018> afaust, that problem I had yesterday about ADFS. It seems it was a client problem. Testing with Curl seems to work as expected.

2020-01-13 15:39:42 GMT <fwu2018> ppl, I realize that Alfresco can extract text from some documents, like pdf. Then it is possible to search for documents based on that text. What property is Alfresco using to index this data?

2020-01-13 15:39:57 GMT <angelborroy> No property

2020-01-13 15:40:03 GMT <angelborroy> The text is stored in SOLR

2020-01-13 15:40:23 GMT <fwu2018> angelborroy hello!

2020-01-13 15:41:28 GMT <fwu2018> my question is because we have some image pdf from which we are getting text from OCR. When indexing that document with CMIS we would lile to index that data, so that we can search it from a standar generic search

2020-01-13 15:41:58 GMT <angelborroy> You need to create a readable PDF

2020-01-13 15:42:00 GMT <fwu2018> I can set that data to a custom field, but then I must search explicitly for that field values

2020-01-13 15:42:21 GMT <angelborroy> Yes, you need to create a PDF with a text layer

2020-01-13 15:42:29 GMT <fwu2018> angelborroy, but I already have the text. so would liek to avoid PDF conversion

2020-01-13 15:42:34 GMT <angelborroy> So SOLR can extract that text and.index it

2020-01-13 15:42:45 GMT <angelborroy> There is no way to do that

2020-01-13 15:42:51 GMT <fwu2018> ok, so I cant index it by myself?

2020-01-13 15:43:10 GMT <angelborroy> I don’t think soy

2020-01-13 15:43:16 GMT <angelborroy> soy > so

2020-01-13 15:44:04 GMT <fwu2018> the OCR we do can get some data, but some data will not be very good. If I convert the PDF, I will work inside Alfresco with a stange PDF

2020-01-13 15:44:29 GMT <fwu2018> or not very readble PDF

2020-01-13 15:44:49 GMT <angelborroy> I don’t get your point

2020-01-13 15:45:01 GMT <angelborroy> You have a PDF with images and text, right?

2020-01-13 15:45:10 GMT <angelborroy> Why don’t put a layer with the text in that PDF?

2020-01-13 15:45:18 GMT <angelborroy> Why the PDF will be “not very readable"?

2020-01-13 15:45:41 GMT <fwu2018> so, a layer, that is not visisble?

2020-01-13 15:45:47 GMT <angelborroy> No

2020-01-13 15:46:26 GMT <angelborroy> This addon is producing something similar

2020-01-13 15:46:28 GMT <angelborroy> https://github.com/keensoft/alfresco-simple-ocr

2020-01-13 15:46:29 GMT <alfbot> Title:GitHub - keensoft/alfresco-simple-ocr: Simple OCR action for Alfresco (at github.com)

2020-01-13 15:47:01 GMT <fwu2018> well, if what not visible would be great. The problem is that OCR just get some data. Most will have strange characters. that is why I dont want the users to see it.

2020-01-13 15:48:37 GMT <fwu2018> angelborroy, we would like to keep the OCR extracting outside Alfresco, even outside the Alfresco machine

2020-01-13 15:48:43 GMT <fwu2018> we are using kofax for this

2020-01-13 15:49:05 GMT <angelborroy> You can produce a Readable PDF with Kofax (I guess)

2020-01-13 15:49:12 GMT <fwu2018> but, if I cant index this data that comes from outside like Alfresco/solr does, then I have a problem

2020-01-13 15:49:48 GMT <fwu2018> angelborroy, but that readble PDF will get stange PDF. Not all text will be recognized as expected.

2020-01-13 15:50:02 GMT <fwu2018> stange PDF = strange data

2020-01-13 15:50:12 GMT <angelborroy> Again, it’s an invisible layer in top of your images

2020-01-13 15:50:23 GMT <angelborroy> The user will no see that layer but when searching

2020-01-13 15:50:31 GMT <fwu2018> ok, if it is invisible, than ok!

2020-01-13 15:50:42 GMT <fwu2018> I thought you said it was not invisible

2020-01-13 15:50:49 GMT <fwu2018> sorry

2020-01-13 15:51:05 GMT <fwu2018> then I will get a try on that ;)

2020-01-13 15:51:29 GMT <angelborroy> I think Kofax call that “Searchable PDF”

2020-01-13 15:52:16 GMT <fwu2018> ok, I will try. the problem is that we will not be able to import PDF/A

2020-01-13 15:52:32 GMT <angelborroy> right

2020-01-13 15:52:40 GMT <fwu2018> that may be a problem

2020-01-13 15:52:44 GMT <angelborroy> You can always create a version

2020-01-13 15:52:54 GMT <angelborroy> So version 1.0 is PDF/A and version 1.1 is searchable

2020-01-13 15:55:12 GMT <fwu2018> I understand, but that is why I was thinking about receiving that data in a custom property and then try to index it like Alfresco/solr does :(

2020-01-13 15:55:44 GMT <fwu2018> well, Im seeing that pdf/a supports layers

2020-01-13 15:55:50 GMT <angelborroy> I guess the only way would be to override GetContent webscript in alfresco-remote-api

2020-01-13 15:56:33 GMT <angelborroy> https://github.com/Alfresco/alfresco-remote-api/blob/master/src/main/java/org/alfresco/repo/web/scripts/solr/NodeContentGet.java

2020-01-13 15:56:34 GMT <alfbot> Title:alfresco-remote-api/NodeContentGet.java at master · Alfresco/alfresco-remote-api · GitHub (at github.com)

2020-01-13 15:57:00 GMT <angelborroy> You can check your custom property at this point

2020-01-13 15:57:53 GMT <fwu2018> or in my workflow flow I could write that data as content of the node. But I just dont understand how

2020-01-13 15:58:33 GMT <angelborroy> I guess the approach can be

2020-01-13 15:58:35 GMT <fwu2018> that is my main question: how this is being done by Alfresco when the action to get the PDf data is executed

2020-01-13 15:58:58 GMT <angelborroy> 1. Create a custom property for an aspect / type (let’s say pdf:textContent)

2020-01-13 15:58:59 GMT <fwu2018> so I can replicate in my own workflow logic

2020-01-13 15:59:12 GMT <angelborroy> 2. Set that property from your CMIS invocation

2020-01-13 15:59:25 GMT <fwu2018> im doing that already

2020-01-13 15:59:42 GMT <angelborroy> 3. Override the web script NodeContentGet to return the property if exists

2020-01-13 15:59:55 GMT <fwu2018> actually CMIS from kofax doesnt like properties from aspects, but that is another problem.

2020-01-13 15:59:57 GMT <angelborroy> (avoiding to invoke Transformation Service in this point)

2020-01-13 16:02:54 GMT <fwu2018> but how is the text indexing for those searchable pdf? a property of the node? another associated node? as the node content?

2020-01-13 16:03:36 GMT <angelborroy> SOLR is requesting Alfresco Repository to get the Text that must be indexed for a Node

2020-01-13 16:03:54 GMT <angelborroy> The Web Script invoked is the one I pasted (NodeContentGet)

2020-01-13 16:04:26 GMT <angelborroy> Once the repo generates the Text (using the Transform Service), it’s stored in SOLR as a property

2020-01-13 16:04:34 GMT <angelborroy> But has no storage in the repository

2020-01-13 16:04:40 GMT <angelborroy> So the text only lives in SOLR

2020-01-13 16:07:49 GMT <fwu2018> ok, I will have alookto that. Thank you angelborroy!

2020-01-13 16:07:56 GMT <fwu2018> brb

End of Daily Log

The other logs are at http://esplins.org/hash_alfresco