Alfresco discussion and collaboration. Stick around a few hours after asking a question.
Official support for Enterprise subscribers: support.alfresco.com.
Join in the conversation by getting an IRC client and connecting to #alfresco at Freenode. Our you can use the IRC web chat.
More information about the channel is in the wiki.
More help is available in this list of resources.
2020-01-13 15:32:10 GMT <fwu2018> hello all
2020-01-13 15:32:41 GMT <fwu2018> afaust, that problem I had yesterday about ADFS. It seems it was a client problem. Testing with Curl seems to work as expected.
2020-01-13 15:39:42 GMT <fwu2018> ppl, I realize that Alfresco can extract text from some documents, like pdf. Then it is possible to search for documents based on that text. What property is Alfresco using to index this data?
2020-01-13 15:39:57 GMT <angelborroy> No property
2020-01-13 15:40:03 GMT <angelborroy> The text is stored in SOLR
2020-01-13 15:40:23 GMT <fwu2018> angelborroy hello!
2020-01-13 15:41:28 GMT <fwu2018> my question is because we have some image pdf from which we are getting text from OCR. When indexing that document with CMIS we would lile to index that data, so that we can search it from a standar generic search
2020-01-13 15:41:58 GMT <angelborroy> You need to create a readable PDF
2020-01-13 15:42:00 GMT <fwu2018> I can set that data to a custom field, but then I must search explicitly for that field values
2020-01-13 15:42:21 GMT <angelborroy> Yes, you need to create a PDF with a text layer
2020-01-13 15:42:29 GMT <fwu2018> angelborroy, but I already have the text. so would liek to avoid PDF conversion
2020-01-13 15:42:34 GMT <angelborroy> So SOLR can extract that text and.index it
2020-01-13 15:42:45 GMT <angelborroy> There is no way to do that
2020-01-13 15:42:51 GMT <fwu2018> ok, so I cant index it by myself?
2020-01-13 15:43:10 GMT <angelborroy> I don’t think soy
2020-01-13 15:43:16 GMT <angelborroy> soy > so
2020-01-13 15:44:04 GMT <fwu2018> the OCR we do can get some data, but some data will not be very good. If I convert the PDF, I will work inside Alfresco with a stange PDF
2020-01-13 15:44:29 GMT <fwu2018> or not very readble PDF
2020-01-13 15:44:49 GMT <angelborroy> I don’t get your point
2020-01-13 15:45:01 GMT <angelborroy> You have a PDF with images and text, right?
2020-01-13 15:45:10 GMT <angelborroy> Why don’t put a layer with the text in that PDF?
2020-01-13 15:45:18 GMT <angelborroy> Why the PDF will be “not very readable"?
2020-01-13 15:45:41 GMT <fwu2018> so, a layer, that is not visisble?
2020-01-13 15:45:47 GMT <angelborroy> No
2020-01-13 15:46:26 GMT <angelborroy> This addon is producing something similar
2020-01-13 15:46:28 GMT <angelborroy> https://github.com/keensoft/alfresco-simple-ocr
2020-01-13 15:46:29 GMT <alfbot> Title:GitHub - keensoft/alfresco-simple-ocr: Simple OCR action for Alfresco (at github.com)
2020-01-13 15:47:01 GMT <fwu2018> well, if what not visible would be great. The problem is that OCR just get some data. Most will have strange characters. that is why I dont want the users to see it.
2020-01-13 15:48:37 GMT <fwu2018> angelborroy, we would like to keep the OCR extracting outside Alfresco, even outside the Alfresco machine
2020-01-13 15:48:43 GMT <fwu2018> we are using kofax for this
2020-01-13 15:49:05 GMT <angelborroy> You can produce a Readable PDF with Kofax (I guess)
2020-01-13 15:49:12 GMT <fwu2018> but, if I cant index this data that comes from outside like Alfresco/solr does, then I have a problem
2020-01-13 15:49:48 GMT <fwu2018> angelborroy, but that readble PDF will get stange PDF. Not all text will be recognized as expected.
2020-01-13 15:50:02 GMT <fwu2018> stange PDF = strange data
2020-01-13 15:50:12 GMT <angelborroy> Again, it’s an invisible layer in top of your images
2020-01-13 15:50:23 GMT <angelborroy> The user will no see that layer but when searching
2020-01-13 15:50:31 GMT <fwu2018> ok, if it is invisible, than ok!
2020-01-13 15:50:42 GMT <fwu2018> I thought you said it was not invisible
2020-01-13 15:50:49 GMT <fwu2018> sorry
2020-01-13 15:51:05 GMT <fwu2018> then I will get a try on that ;)
2020-01-13 15:51:29 GMT <angelborroy> I think Kofax call that “Searchable PDF”
2020-01-13 15:52:16 GMT <fwu2018> ok, I will try. the problem is that we will not be able to import PDF/A
2020-01-13 15:52:32 GMT <angelborroy> right
2020-01-13 15:52:40 GMT <fwu2018> that may be a problem
2020-01-13 15:52:44 GMT <angelborroy> You can always create a version
2020-01-13 15:52:54 GMT <angelborroy> So version 1.0 is PDF/A and version 1.1 is searchable
2020-01-13 15:55:12 GMT <fwu2018> I understand, but that is why I was thinking about receiving that data in a custom property and then try to index it like Alfresco/solr does :(
2020-01-13 15:55:44 GMT <fwu2018> well, Im seeing that pdf/a supports layers
2020-01-13 15:55:50 GMT <angelborroy> I guess the only way would be to override GetContent webscript in alfresco-remote-api
2020-01-13 15:56:33 GMT <angelborroy> https://github.com/Alfresco/alfresco-remote-api/blob/master/src/main/java/org/alfresco/repo/web/scripts/solr/NodeContentGet.java
2020-01-13 15:56:34 GMT <alfbot> Title:alfresco-remote-api/NodeContentGet.java at master · Alfresco/alfresco-remote-api · GitHub (at github.com)
2020-01-13 15:57:00 GMT <angelborroy> You can check your custom property at this point
2020-01-13 15:57:53 GMT <fwu2018> or in my workflow flow I could write that data as content of the node. But I just dont understand how
2020-01-13 15:58:33 GMT <angelborroy> I guess the approach can be
2020-01-13 15:58:35 GMT <fwu2018> that is my main question: how this is being done by Alfresco when the action to get the PDf data is executed
2020-01-13 15:58:58 GMT <angelborroy> 1. Create a custom property for an aspect / type (let’s say pdf:textContent)
2020-01-13 15:58:59 GMT <fwu2018> so I can replicate in my own workflow logic
2020-01-13 15:59:12 GMT <angelborroy> 2. Set that property from your CMIS invocation
2020-01-13 15:59:25 GMT <fwu2018> im doing that already
2020-01-13 15:59:42 GMT <angelborroy> 3. Override the web script NodeContentGet to return the property if exists
2020-01-13 15:59:55 GMT <fwu2018> actually CMIS from kofax doesnt like properties from aspects, but that is another problem.
2020-01-13 15:59:57 GMT <angelborroy> (avoiding to invoke Transformation Service in this point)
2020-01-13 16:02:54 GMT <fwu2018> but how is the text indexing for those searchable pdf? a property of the node? another associated node? as the node content?
2020-01-13 16:03:36 GMT <angelborroy> SOLR is requesting Alfresco Repository to get the Text that must be indexed for a Node
2020-01-13 16:03:54 GMT <angelborroy> The Web Script invoked is the one I pasted (NodeContentGet)
2020-01-13 16:04:26 GMT <angelborroy> Once the repo generates the Text (using the Transform Service), it’s stored in SOLR as a property
2020-01-13 16:04:34 GMT <angelborroy> But has no storage in the repository
2020-01-13 16:04:40 GMT <angelborroy> So the text only lives in SOLR
2020-01-13 16:07:49 GMT <fwu2018> ok, I will have alookto that. Thank you angelborroy!
2020-01-13 16:07:56 GMT <fwu2018> brb
The other logs are at http://esplins.org/hash_alfresco