Daily Log for #alfresco IRC Channel

Alfresco discussion and collaboration. Stick around a few hours after asking a question.

Official support for Enterprise subscribers: support.alfresco.com.

Joining the Channel:

Join in the conversation by getting an IRC client and connecting to #alfresco at Freenode. Our you can use the IRC web chat.

More information about the channel is in the wiki.

Getting Help

More help is available in this list of resources.

Daily Log for #alfresco

2018-02-01 08:39:46 GMT <yreg> ~later tell mbui What's that webscript supposed to return as a response ?

2018-02-01 08:39:46 GMT <alfbot> yreg: The operation succeeded.

2018-02-01 08:40:00 GMT * yreg bids you all a good morniing !

2018-02-01 10:04:18 GMT <lisa__> hi all i am trying to achieve a way to render all the documents to pdf and view in browser is done by pdfjs

2018-02-01 10:05:41 GMT <angelborroy> lisa__ can you explain it further?

2018-02-01 10:06:32 GMT <lisa__> all i want is to preview say .txt document by pdfJs

2018-02-01 10:07:19 GMT <lisa__> preview as pdf

2018-02-01 10:07:32 GMT <angelborroy> inside Alfresco Share?

2018-02-01 10:07:41 GMT <angelborroy> or using an external application?

2018-02-01 10:08:09 GMT <lisa__> there is an add on

2018-02-01 10:08:16 GMT <lisa__> by parashift

2018-02-01 10:08:20 GMT <angelborroy> what Alfresco version are you using?

2018-02-01 10:08:32 GMT <lisa__> i am using alfresco CE 5.2

2018-02-01 10:08:58 GMT <angelborroy> so your problem is that TXT files are not previewed in Share, right?

2018-02-01 10:09:30 GMT <lisa__> no it is previed as is

2018-02-01 10:09:39 GMT <lisa__> i want to get it previewd as pdf

2018-02-01 10:09:50 GMT <lisa__> i did try some transformation but

2018-02-01 10:10:44 GMT <lisa__> say acb.txt first gets saved in doclib as acb.pdf then when i click on view on browser

2018-02-01 10:10:53 GMT <lisa__> it shows me the desired result

2018-02-01 10:14:37 GMT <angelborroy> so the problem is that text/plain is not transformed to PDF

2018-02-01 10:15:25 GMT <lisa__> all i want is a print preview action on all documents

2018-02-01 10:15:49 GMT <lisa__> and then that can be viewed as pdf on next window

2018-02-01 10:16:43 GMT <angelborroy> like this one? https://addons.alfresco.com/addons/quick-print-para-module

2018-02-01 10:16:45 GMT <alfbot> Title: Quick Print - Para Module | Alfresco Add-ons - Alfresco Customizations (at addons.alfresco.com)

2018-02-01 10:17:19 GMT <angelborroy> you can buy it, it’s really cheap

2018-02-01 10:17:42 GMT <lisa__> yes

2018-02-01 10:19:25 GMT <lisa__> but if i have to develop this and contibute to opensource what should be the development path

2018-02-01 10:27:33 GMT <fwu> ppl!

2018-02-01 11:34:34 GMT <fwu> ppl, is this something anyone already did, or do you think it is possible: upload more or less 2,2 million documents to Alfresco (community) in 24 hours?

2018-02-01 11:34:47 GMT <fwu> each month...

2018-02-01 11:36:35 GMT <fwu> I dont want/need to browse this information in share or something like that. But I will need to find documents using search.

2018-02-01 11:37:06 GMT <fwu> I may consider enterprise also.

2018-02-01 11:37:27 GMT <fwu> I dont know if Alfresco Enterprise may help on this or not.

2018-02-01 12:18:35 GMT <AFaust> fwu: There is no difference between Enterprise or Community with regards to how many documents can be handled. It is purely a question of your infrastructure, system setup and approach to loading that many documents if it can be handled by Alfresco

2018-02-01 12:19:43 GMT <AFaust> The only minor difference is that Enterprise comes with clustering out-of-the-box so you could scale it out. With Community, you can only scale up to deal with higher resource demands (i.e. parallel operations for mass uploads)

2018-02-01 14:12:49 GMT <angelborroy> fwu 2,2 million seems a very high number to be processed in 24 hours

2018-02-01 14:12:58 GMT <angelborroy> fwu you will need a big infraestructure for that

2018-02-01 14:16:54 GMT <yreg> and probably few days for digesting

2018-02-01 14:18:52 GMT <yreg> AFaust, fwu should actually be able to hook multiple instances of Alfresco CE to the same Database/Content store and parallelize even further for the ingestion as long as the extra nodes are guaranteed to only be used for ingesting documents .... and all consultation/edit goes through one single node

2018-02-01 14:33:21 GMT <fwu> afaust, angelborroy, yreg thank you.

2018-02-01 14:34:30 GMT <fwu> is there any good documentation related with Alfresco clustering?

2018-02-01 14:35:38 GMT <angelborroy> fwu Alfresco clustering for Community or Enterprise?

2018-02-01 14:35:42 GMT <fwu> also, is there any benchmark related with indexing millions of documents?

2018-02-01 14:35:45 GMT <fwu> Community

2018-02-01 14:35:52 GMT <angelborroy> no, there is not

2018-02-01 14:36:59 GMT <angelborroy> fwu this is the more detailed I know: beecon.buzz/2017/assets/files/EF09/EF09-Installing-Alfresco-components-1-by-1.pdf

2018-02-01 14:37:05 GMT <angelborroy> fwu but it’s not clustering

2018-02-01 14:37:16 GMT <yreg> fwu, a small correction : "No, there is no clustering in community at all, at least not out of the box, and not in any version post 4.0.x"

2018-02-01 14:37:17 GMT <fwu> the problem is the math: if is possible to index 1 document=1 second. I will only be able to index 86400 in 24 hours. So , I need parallel execution. Maybe 32 processes?

2018-02-01 14:37:54 GMT <fwu> if I imagine 32 services running in parallel... how this maps to alfresco instances?

2018-02-01 14:38:28 GMT <yreg> fwu there is a lot of flows in your logic..

2018-02-01 14:38:46 GMT <fwu> yreg, Im sure there are :)

2018-02-01 14:39:05 GMT <yreg> IF you have great infra with SSDs and low latency network ....

2018-02-01 14:39:08 GMT <angelborroy> fwu probably you can install 32 alfresco nodes (without SOLR, LibreOffice) sharing the same database and alf_Data

2018-02-01 14:39:22 GMT <angelborroy> but you need also a big database cluster and a fast filesystem

2018-02-01 14:39:23 GMT <yreg> You can get up to few hundred documents ingested in the same second

2018-02-01 14:39:27 GMT <fwu> angelborroy, but 32 is alot...

2018-02-01 14:39:33 GMT <yreg> having them indexed by solr is an other iissue

2018-02-01 14:39:42 GMT <angelborroy> no, SOLR have to be out of the process

2018-02-01 14:39:51 GMT <angelborroy> you can index all that stuff later

2018-02-01 14:42:48 GMT <fwu> angelborroy, SOLR to be out of process it means I need to disable it, right?

2018-02-01 14:42:51 GMT <yreg> using our inhouse stack and with only 4 parallel processes in the background ingesting documents, on a decent, but not top infra with a single node, we got up to 100 documents per second as a sustainable rate

2018-02-01 14:43:31 GMT <yreg> fwu, it's better to disable it during the ingestion

2018-02-01 14:43:57 GMT <yreg> and also lower the tracking frequency outside of the ingestion

2018-02-01 14:44:05 GMT <fwu> yreg, your data example is a good start. thank you

2018-02-01 14:44:26 GMT <yreg> something like 10 or 15 minutes as a replacement for the out of the box 15 seconds

2018-02-01 14:44:31 GMT <fwu> but 4 parallel processes means what? 4 alfresco instances?

2018-02-01 14:44:57 GMT <yreg> no, one single Alfresco instance, 4 threads

2018-02-01 14:45:50 GMT <yreg> well technically speaking 8 threads : 4 writing metadata and nodes, and 4 other writing content and linking it to the nodes

2018-02-01 14:45:52 GMT <fwu> one database? what kind of file system?

2018-02-01 14:46:16 GMT <yreg> but that was using our propriatary Stack

2018-02-01 14:46:59 GMT <yreg> I am just sharing the numbers as a reference

2018-02-01 14:47:23 GMT <fwu> ok, nice

2018-02-01 14:47:46 GMT <yreg> if you are interested in the product I am talking about or in a testdrive, feel free to get in touch through : https://xenit.eu/alfred-inflow-content-migration-alfresco/

2018-02-01 14:47:48 GMT <alfbot> Title: Content migration Alfresco - Inflow high speed content migration (at xenit.eu)

2018-02-01 14:48:12 GMT <fwu> ok, I will look at it thank you!

2018-02-01 14:48:20 GMT <angelborroy> My numbers are 4 documents per second on a simple repository with 16 GB RAM

2018-02-01 14:48:32 GMT <angelborroy> In fact about 10,000 documents per hour

2018-02-01 14:48:41 GMT <yreg> angelborroy, how many threads ?

2018-02-01 14:48:45 GMT <angelborroy> one

2018-02-01 14:48:53 GMT <angelborroy> Alfresco out-of-the-box ;-)

2018-02-01 14:49:02 GMT <fwu> looks nice angelborroy

2018-02-01 14:49:04 GMT <angelborroy> only with the repo

2018-02-01 14:49:09 GMT <yreg> were you using CMIS ?

2018-02-01 14:49:17 GMT <yreg> that's pretty low IMHO

2018-02-01 14:49:20 GMT <angelborroy> no SOLR, no thumbs, no Libreoffice

2018-02-01 14:49:28 GMT <yreg> unless if the files are relatively big

2018-02-01 14:49:31 GMT <angelborroy> using BULK export-import

2018-02-01 14:50:49 GMT <yreg> still, that's too slow, using inflow we get rates of ~20 documents per second only on crappy hardware or really bad config

2018-02-01 14:51:11 GMT <yreg> Do you have any idea on how many document were handled per transaction ?

2018-02-01 14:51:19 GMT <yreg> I am not familiar with the bulk import

2018-02-01 14:51:22 GMT <angelborroy> one

2018-02-01 14:51:33 GMT <angelborroy> because I modified the code to import one document per transaction :D

2018-02-01 14:51:35 GMT <yreg> aha, that is probably it

2018-02-01 14:52:17 GMT <yreg> In my experience, there is no magical number for it, and it is usually between 150 and 300

2018-02-01 14:53:03 GMT <yreg> we have a benchmarking tool that tries a lot of variations iin the configuration, and gives back the best parameter orchestration ;-)

2018-02-01 14:53:30 GMT <angelborroy> you have many time (or money) to invest :-P

2018-02-01 14:54:49 GMT <fwu> the bulkimport is too slow...

2018-02-01 14:55:28 GMT <fwu> yreg does alfred runs on the community version of Alfresco? Or only enterprise?

2018-02-01 14:55:35 GMT <yreg> angelborroy, it's a continuous effort, What we have is a product with many clients and shared maintenance cost ;-) if one needs an enhancement, and sponsors it, all clients benefit from it

2018-02-01 14:56:08 GMT <yreg> fwu, on both

2018-02-01 14:56:37 GMT <fwu> hmm... I believe I will look at it and try it

2018-02-01 14:57:26 GMT <fwu> I need to index 2,2 millions each month. with a base volume of documents of 5T

2018-02-01 15:00:05 GMT <fwu> yreg can you give me a clue about how Alfred cost is calculated?

2018-02-01 15:03:07 GMT <yreg> -= THIS MESSAGE NOT LOGGED =-

2018-02-01 15:15:05 GMT <fwu> ok. it seems my usecase is the second

2018-02-01 15:15:27 GMT <fwu> with a first 5T migration maybe

2018-02-01 15:16:18 GMT <fwu> the "get a demo" of Alfred means what? someone will show how it works remotely?

2018-02-01 15:22:36 GMT <AFaust> yreg, fwu, angelborroy: After my homeoffice fitness break, I'd like to chime in with some numbers I have from a customer PoC ~6 years ago on Alfresco 3.4 on a non-optimized, virtualized infrastructure (though the Alfresco code was optimised / custom): ~ 1-1.5 million nodes ingested (from MongoDB) in an hour (Lucene indexing out-of-transaction)

2018-02-01 15:25:56 GMT <fwu> 1 million in one hour? that is really great numbers Afaust.

2018-02-01 15:26:43 GMT <fwu> with Alfresco 3.4 is still more amazing because the idea (and feedback) I have is that the 3.x versions were slow

2018-02-01 15:27:31 GMT <AFaust> As yreg has mentioned multiple times, infrastructure is key. I have customers with horrendous virtualised storage concepts that could barely manage 10 documents per second, and others (in the past) with proper SSD where 100+ per second was the norm...

2018-02-01 15:29:05 GMT <AFaust> fwu: There are many ways to ingest data into Alfresco. Like yreg / Xenit, who have their own dedicated product for high speed ingestion, I have employed custom code / integrations at customers to reach reasonable performance...

2018-02-01 15:30:36 GMT <AFaust> That specific project was a PoC to showcase to the customer that the platform could handle such loads. In their use case they got deliveries of updated data sets twice a day containing a couple of million entries, which needed to be ingested and integrated / processed with regards to already existing data...

2018-02-01 15:31:12 GMT <fwu> AFaust, what infrastrucuture did you used?

2018-02-01 15:31:28 GMT <fwu> and was it with the community version?

2018-02-01 15:31:46 GMT <AFaust> Similarly to what yreg mentioned regarding optimised parameters (i.e. documents per transaction) you can also optimised by temporarily disabling Alfresco features / services you do not need...

2018-02-01 15:33:37 GMT <AFaust> As I said, it was a virtualised environment. No special storage (their standard, non-SSD one), 2 4-core CPUs, and about 16 GiB RAM. Simple MySQL DB on a similar host (self-managed, not a lot of optimisations). At that time I was working with an Alfresco partner, so it was Enterprise.

2018-02-01 15:34:11 GMT <AFaust> But again, Enterprise or Community does not matter. There is no magic code / silver bullet in Enterprise to make things go faster...

2018-02-01 15:36:13 GMT <AFaust> To be transparent, the documents we had to store on disk were primarily simple text-based ones (no large files) containing RTF-like contents.

2018-02-01 15:37:36 GMT <fwu> I think those 2miliions are more or less 200kb files

2018-02-01 15:37:43 GMT <fwu> pdf

2018-02-01 15:38:18 GMT <fwu> how many instances did you used?

2018-02-01 15:38:23 GMT <fwu> alfresco instances

2018-02-01 15:41:34 GMT <AFaust> 1

2018-02-01 15:41:56 GMT <AFaust> and I believe 4 threads

2018-02-01 15:42:40 GMT <fwu> so you just make some configurations and make some software changes?

2018-02-01 15:44:06 GMT <AFaust> It was a custom implemented ingestion process - through and through. Of course some configuration changes were also involved, but most impact was from that custom muilt-step process...

2018-02-01 15:45:25 GMT <fwu> but after indexing the documents, was it possible to search and get them a little bit later?

2018-02-01 15:48:22 GMT <AFaust> Not sure what you mean, but indexing was done as a separate step after mass ingestion. You could get the documents immediately after ingestion with DB-bound queries, but indexing (with Lucene mind you) took a while.

2018-02-01 15:50:12 GMT <fwu> ok, so I could get the document right after ingestion, but not to search them using Alfresco standard search.

2018-02-01 15:52:35 GMT <AFaust> I actually just found the PoC report in my archive.... I have to correct myself - indexing was done during the ingestion step, but code was in place to avoid in-transaction indexing. Basically it was indexed immediately but asynchronously.

2018-02-01 15:53:16 GMT <AFaust> One test run imported ~4 million document entries (patents) in ~4 hours, which includes the index time

2018-02-01 15:54:35 GMT <fwu> those are very good numbers

2018-02-01 15:54:41 GMT <AFaust> The report lists the Lucene indexing as the limiting factor of that PoC - otherwise the whole processcould have been 3-4x as fast (various phases caused indexing to occur, which could have been relegated to the end in an optimal scenario)

2018-02-01 16:10:19 GMT <fwu> thank you afaust, angelborroy and yreg for your help.

2018-02-01 16:38:06 GMT <yreg> AFaust, impressive indeed, but I gues in our case, we can not always pause Behaviours and alike as in most case there is a lot that could/should be done through them upon uploads

2018-02-01 16:38:36 GMT <yreg> but I agree that it is possible to squeeze even more performance..

2018-02-01 16:38:51 GMT <yreg> for special cases that require special measures

2018-02-01 16:39:41 GMT <AFaust> Behavours were actually enabled - rule service was not though, which can be the most significant performance hog...

2018-02-01 16:40:19 GMT <AFaust> All those "I have to check the hierarchy if there are some rules somwhere" is very expensive, and most of the time it finds none...

2018-02-01 16:40:21 GMT <yreg> Good to know

2018-02-01 16:40:37 GMT <yreg> was it disabled system wide or within the thread context ?

2018-02-01 16:40:43 GMT <AFaust> WIthin thread only

2018-02-01 16:41:42 GMT <yreg> We probably should make a config entrry for that in Inflow

2018-02-01 16:41:50 GMT <yreg> thanks for the tip

2018-02-01 16:42:51 GMT <AFaust> Oh - I have to be more clear: "the most significant performance hog" => once you remove synch indexing (since we were using Lucene)

2018-02-01 17:22:54 GMT <yreg> AFaust, for the sake of completeness, even though our solution supports custom parsers and custom data providers, in most usecases clients have two filesper node: one for the content and an other for the metadata, and the read/write from disk is usually the bottle neck... that and the report we give back to the tool about the uuid and status for each and every uploaded node + eventually some error/warning messages

2018-02-01 19:31:10 GMT <douglascrp> AFaust, yreg interesting discussions today

2018-02-01 23:30:07 GMT <harper> Would someone be willing to assist me in the configuration of my Alfresco Community Edition 201711 EA setup?

2018-02-01 23:32:15 GMT <harper> I cannot figure out how to active the "passthru" authentication subsystem

End of Daily Log

The other logs are at http://esplins.org/hash_alfresco