2018-02-01 08:39:46 GMT <yreg> ~later tell mbui What's that webscript supposed to return as a response ?

2018-02-01 08:39:46 GMT <alfbot> yreg: The operation succeeded.

2018-02-01 08:40:00 GMT * yreg bids you all a good morniing !

2018-02-01 10:04:18 GMT <lisa__> hi all i am trying to achieve a way to render all the documents to pdf and view in browser is done by pdfjs

2018-02-01 10:05:41 GMT <angelborroy> lisa__ can you explain it further?

2018-02-01 10:06:32 GMT <lisa__> all i want is to preview say .txt document by pdfJs

2018-02-01 10:07:19 GMT <lisa__> preview as pdf

2018-02-01 10:07:32 GMT <angelborroy> inside Alfresco Share?

2018-02-01 10:07:41 GMT <angelborroy> or using an external application?

2018-02-01 10:08:09 GMT <lisa__> there is an add on

2018-02-01 10:08:16 GMT <lisa__> by parashift

2018-02-01 10:08:20 GMT <angelborroy> what Alfresco version are you using?

2018-02-01 10:08:32 GMT <lisa__> i am using alfresco CE 5.2

2018-02-01 10:08:58 GMT <angelborroy> so your problem is that TXT files are not previewed in Share, right?

2018-02-01 10:09:30 GMT <lisa__> no it is previed as is

2018-02-01 10:09:39 GMT <lisa__> i want to get it previewd as pdf

2018-02-01 10:09:50 GMT <lisa__> i did try some transformation but

2018-02-01 10:10:44 GMT <lisa__> say acb.txt first gets saved in doclib as acb.pdf then when i click on view on browser

2018-02-01 10:10:53 GMT <lisa__> it shows me the desired result

2018-02-01 10:14:37 GMT <angelborroy> so the problem is that text/plain is not transformed to PDF

2018-02-01 10:15:25 GMT <lisa__> all i want is a print preview action on all documents

2018-02-01 10:15:49 GMT <lisa__> and then that can be viewed as pdf on next window

2018-02-01 10:16:43 GMT <angelborroy> like this one? https://addons.alfresco.com/addons/quick-print-para-module

2018-02-01 10:16:45 GMT <alfbot> Title: Quick Print - Para Module | Alfresco Add-ons - Alfresco Customizations (at addons.alfresco.com)

2018-02-01 10:17:19 GMT <angelborroy> you can buy it, it’s really cheap

2018-02-01 10:17:42 GMT <lisa__> yes

2018-02-01 10:19:25 GMT <lisa__> but if i have to develop this and contibute to opensource what should be the development path

2018-02-01 10:27:33 GMT <fwu> ppl!

2018-02-01 11:34:34 GMT <fwu> ppl, is this something anyone already did, or do you think it is possible: upload more or less 2,2 million documents to Alfresco (community) in 24 hours?

2018-02-01 11:34:47 GMT <fwu> each month...

2018-02-01 11:36:35 GMT <fwu> I dont want/need to browse this information in share or something like that. But I will need to find documents using search.

2018-02-01 11:37:06 GMT <fwu> I may consider enterprise also.

2018-02-01 11:37:27 GMT <fwu> I dont know if Alfresco Enterprise may help on this or not.

2018-02-01 12:18:35 GMT <AFaust> fwu: There is no difference between Enterprise or Community with regards to how many documents can be handled. It is purely a question of your infrastructure, system setup and approach to loading that many documents if it can be handled by Alfresco

2018-02-01 12:19:43 GMT <AFaust> The only minor difference is that Enterprise comes with clustering out-of-the-box so you could scale it out. With Community, you can only scale up to deal with higher resource demands (i.e. parallel operations for mass uploads)

2018-02-01 14:12:49 GMT <angelborroy> fwu 2,2 million seems a very high number to be processed in 24 hours

2018-02-01 14:12:58 GMT <angelborroy> fwu you will need a big infraestructure for that

2018-02-01 14:16:54 GMT <yreg> and probably few days for digesting

2018-02-01 14:18:52 GMT <yreg> AFaust, fwu should actually be able to hook multiple instances of Alfresco CE to the same Database/Content store and parallelize even further for the ingestion as long as the extra nodes are guaranteed to only be used for ingesting documents .... and all consultation/edit goes through one single node

2018-02-01 14:33:21 GMT <fwu> afaust, angelborroy, yreg thank you.

2018-02-01 14:34:30 GMT <fwu> is there any good documentation related with Alfresco clustering?

2018-02-01 14:35:38 GMT <angelborroy> fwu Alfresco clustering for Community or Enterprise?

2018-02-01 14:35:42 GMT <fwu> also, is there any benchmark related with indexing millions of documents?

2018-02-01 14:35:45 GMT <fwu> Community

2018-02-01 14:35:52 GMT <angelborroy> no, there is not

2018-02-01 14:36:59 GMT <angelborroy> fwu this is the more detailed I know: beecon.buzz/2017/assets/files/EF09/EF09-Installing-Alfresco-components-1-by-1.pdf

2018-02-01 14:37:05 GMT <angelborroy> fwu but it’s not clustering

2018-02-01 14:37:16 GMT <yreg> fwu, a small correction : "No, there is no clustering in community at all, at least not out of the box, and not in any version post 4.0.x"

2018-02-01 14:37:17 GMT <fwu> the problem is the math: if is possible to index 1 document=1 second. I will only be able to index 86400 in 24 hours. So , I need parallel execution. Maybe 32 processes?

2018-02-01 14:37:54 GMT <fwu> if I imagine 32 services running in parallel... how this maps to alfresco instances?

2018-02-01 14:38:28 GMT <yreg> fwu there is a lot of flows in your logic..

2018-02-01 14:38:46 GMT <fwu> yreg, Im sure there are :)

2018-02-01 14:39:05 GMT <yreg> IF you have great infra with SSDs and low latency network ....

2018-02-01 14:39:08 GMT <angelborroy> fwu probably you can install 32 alfresco nodes (without SOLR, LibreOffice) sharing the same database and alf_Data

2018-02-01 14:39:22 GMT <angelborroy> but you need also a big database cluster and a fast filesystem

2018-02-01 14:39:23 GMT <yreg> You can get up to few hundred documents ingested in the same second

2018-02-01 14:39:27 GMT <fwu> angelborroy, but 32 is alot...

2018-02-01 14:39:33 GMT <yreg> having them indexed by solr is an other iissue

2018-02-01 14:39:42 GMT <angelborroy> no, SOLR have to be out of the process

2018-02-01 14:39:51 GMT <angelborroy> you can index all that stuff later

2018-02-01 14:42:48 GMT <fwu> angelborroy, SOLR to be out of process it means I need to disable it, right?

2018-02-01 14:42:51 GMT <yreg> using our inhouse stack and with only 4 parallel processes in the background ingesting documents, on a decent, but not top infra with a single node, we got up to 100 documents per second as a sustainable rate

2018-02-01 14:43:31 GMT <yreg> fwu, it's better to disable it during the ingestion

2018-02-01 14:43:57 GMT <yreg> and also lower the tracking frequency outside of the ingestion

2018-02-01 14:44:05 GMT <fwu> yreg, your data example is a good start. thank you

2018-02-01 14:44:26 GMT <yreg> something like 10 or 15 minutes as a replacement for the out of the box 15 seconds

2018-02-01 14:44:31 GMT <fwu> but 4 parallel processes means what? 4 alfresco instances?

2018-02-01 14:44:57 GMT <yreg> no, one single Alfresco instance, 4 threads

2018-02-01 14:45:50 GMT <yreg> well technically speaking 8 threads : 4 writing metadata and nodes, and 4 other writing content and linking it to the nodes

2018-02-01 14:45:52 GMT <fwu> one database? what kind of file system?

2018-02-01 14:46:16 GMT <yreg> but that was using our propriatary Stack

2018-02-01 14:46:59 GMT <yreg> I am just sharing the numbers as a reference

2018-02-01 14:47:23 GMT <fwu> ok, nice

2018-02-01 14:47:46 GMT <yreg> if you are interested in the product I am talking about or in a testdrive, feel free to get in touch through : https://xenit.eu/alfred-inflow-content-migration-alfresco/

2018-02-01 14:47:48 GMT <alfbot> Title: Content migration Alfresco - Inflow high speed content migration (at xenit.eu)

2018-02-01 14:48:12 GMT <fwu> ok, I will look at it thank you!

2018-02-01 14:48:20 GMT <angelborroy> My numbers are 4 documents per second on a simple repository with 16 GB RAM

2018-02-01 14:48:32 GMT <angelborroy> In fact about 10,000 documents per hour

2018-02-01 14:48:41 GMT <yreg> angelborroy, how many threads ?

2018-02-01 14:48:45 GMT <angelborroy> one

2018-02-01 14:48:53 GMT <angelborroy> Alfresco out-of-the-box ;-)

2018-02-01 14:49:02 GMT <fwu> looks nice angelborroy

2018-02-01 14:49:04 GMT <angelborroy> only with the repo

2018-02-01 14:49:09 GMT <yreg> were you using CMIS ?

2018-02-01 14:49:17 GMT <yreg> that's pretty low IMHO

2018-02-01 14:49:20 GMT <angelborroy> no SOLR, no thumbs, no Libreoffice

2018-02-01 14:49:28 GMT <yreg> unless if the files are relatively big

2018-02-01 14:49:31 GMT <angelborroy> using BULK export-import

2018-02-01 14:50:49 GMT <yreg> still, that's too slow, using inflow we get rates of ~20 documents per second only on crappy hardware or really bad config

2018-02-01 14:51:11 GMT <yreg> Do you have any idea on how many document were handled per transaction ?

2018-02-01 14:51:19 GMT <yreg> I am not familiar with the bulk import

2018-02-01 14:51:22 GMT <angelborroy> one

2018-02-01 14:51:33 GMT <angelborroy> because I modified the code to import one document per transaction :D

2018-02-01 14:51:35 GMT <yreg> aha, that is probably it

2018-02-01 14:52:17 GMT <yreg> In my experience, there is no magical number for it, and it is usually between 150 and 300

2018-02-01 14:53:03 GMT <yreg> we have a benchmarking tool that tries a lot of variations iin the configuration, and gives back the best parameter orchestration ;-)

2018-02-01 14:53:30 GMT <angelborroy> you have many time (or money) to invest :-P

2018-02-01 14:54:49 GMT <fwu> the bulkimport is too slow...

2018-02-01 14:55:28 GMT <fwu> yreg does alfred runs on the community version of Alfresco? Or only enterprise?

2018-02-01 14:55:35 GMT <yreg> angelborroy, it's a continuous effort, What we have is a product with many clients and shared maintenance cost ;-) if one needs an enhancement, and sponsors it, all clients benefit from it

2018-02-01 14:56:08 GMT <yreg> fwu, on both

2018-02-01 14:56:37 GMT <fwu> hmm... I believe I will look at it and try it

2018-02-01 14:57:26 GMT <fwu> I need to index 2,2 millions each month. with a base volume of documents of 5T

2018-02-01 15:00:05 GMT <fwu> yreg can you give me a clue about how Alfred cost is calculated?

2018-02-01 15:15:05 GMT <fwu> ok. it seems my usecase is the second

2018-02-01 15:15:27 GMT <fwu> with a first 5T migration maybe

2018-02-01 15:16:18 GMT <fwu> the "get a demo" of Alfred means what? someone will show how it works remotely?

2018-02-01 15:22:36 GMT <AFaust> yreg, fwu, angelborroy: After my homeoffice fitness break, I'd like to chime in with some numbers I have from a customer PoC ~6 years ago on Alfresco 3.4 on a non-optimized, virtualized infrastructure (though the Alfresco code was optimised / custom): ~ 1-1.5 million nodes ingested (from MongoDB) in an hour (Lucene indexing out-of-transaction)

2018-02-01 15:25:56 GMT <fwu> 1 million in one hour? that is really great numbers Afaust.

2018-02-01 15:26:43 GMT <fwu> with Alfresco 3.4 is still more amazing because the idea (and feedback) I have is that the 3.x versions were slow

2018-02-01 15:27:31 GMT <AFaust> As yreg has mentioned multiple times, infrastructure is key. I have customers with horrendous virtualised storage concepts that could barely manage 10 documents per second, and others (in the past) with proper SSD where 100+ per second was the norm...

2018-02-01 15:29:05 GMT <AFaust> fwu: There are many ways to ingest data into Alfresco. Like yreg / Xenit, who have their own dedicated product for high speed ingestion, I have employed custom code / integrations at customers to reach reasonable performance...

2018-02-01 15:30:36 GMT <AFaust> That specific project was a PoC to showcase to the customer that the platform could handle such loads. In their use case they got deliveries of updated data sets twice a day containing a couple of million entries, which needed to be ingested and integrated / processed with regards to already existing data...

2018-02-01 15:31:12 GMT <fwu> AFaust, what infrastrucuture did you used?

2018-02-01 15:31:28 GMT <fwu> and was it with the community version?

2018-02-01 15:31:46 GMT <AFaust> Similarly to what yreg mentioned regarding optimised parameters (i.e. documents per transaction) you can also optimised by temporarily disabling Alfresco features / services you do not need...

2018-02-01 15:33:37 GMT <AFaust> As I said, it was a virtualised environment. No special storage (their standard, non-SSD one), 2 4-core CPUs, and about 16 GiB RAM. Simple MySQL DB on a similar host (self-managed, not a lot of optimisations). At that time I was working with an Alfresco partner, so it was Enterprise.

2018-02-01 15:34:11 GMT <AFaust> But again, Enterprise or Community does not matter. There is no magic code / silver bullet in Enterprise to make things go faster...

2018-02-01 15:36:13 GMT <AFaust> To be transparent, the documents we had to store on disk were primarily simple text-based ones (no large files) containing RTF-like contents.

2018-02-01 15:37:36 GMT <fwu> I think those 2miliions are more or less 200kb files

2018-02-01 15:37:43 GMT <fwu> pdf

2018-02-01 15:38:18 GMT <fwu> how many instances did you used?

2018-02-01 15:38:23 GMT <fwu> alfresco instances

2018-02-01 15:41:34 GMT <AFaust> 1

2018-02-01 15:41:56 GMT <AFaust> and I believe 4 threads

2018-02-01 15:42:40 GMT <fwu> so you just make some configurations and make some software changes?

2018-02-01 15:44:06 GMT <AFaust> It was a custom implemented ingestion process - through and through. Of course some configuration changes were also involved, but most impact was from that custom muilt-step process...

2018-02-01 15:45:25 GMT <fwu> but after indexing the documents, was it possible to search and get them a little bit later?

2018-02-01 15:48:22 GMT <AFaust> Not sure what you mean, but indexing was done as a separate step after mass ingestion. You could get the documents immediately after ingestion with DB-bound queries, but indexing (with Lucene mind you) took a while.

2018-02-01 15:50:12 GMT <fwu> ok, so I could get the document right after ingestion, but not to search them using Alfresco standard search.

2018-02-01 15:52:35 GMT <AFaust> I actually just found the PoC report in my archive.... I have to correct myself - indexing was done during the ingestion step, but code was in place to avoid in-transaction indexing. Basically it was indexed immediately but asynchronously.

2018-02-01 15:53:16 GMT <AFaust> One test run imported ~4 million document entries (patents) in ~4 hours, which includes the index time

2018-02-01 15:54:35 GMT <fwu> those are very good numbers

2018-02-01 15:54:41 GMT <AFaust> The report lists the Lucene indexing as the limiting factor of that PoC - otherwise the whole processcould have been 3-4x as fast (various phases caused indexing to occur, which could have been relegated to the end in an optimal scenario)

2018-02-01 16:10:19 GMT <fwu> thank you afaust, angelborroy and yreg for your help.

2018-02-01 16:38:06 GMT <yreg> AFaust, impressive indeed, but I gues in our case, we can not always pause Behaviours and alike as in most case there is a lot that could/should be done through them upon uploads

2018-02-01 16:38:36 GMT <yreg> but I agree that it is possible to squeeze even more performance..

2018-02-01 16:38:51 GMT <yreg> for special cases that require special measures

2018-02-01 16:39:41 GMT <AFaust> Behavours were actually enabled - rule service was not though, which can be the most significant performance hog...

2018-02-01 16:40:19 GMT <AFaust> All those "I have to check the hierarchy if there are some rules somwhere" is very expensive, and most of the time it finds none...

2018-02-01 16:40:21 GMT <yreg> Good to know

2018-02-01 16:40:37 GMT <yreg> was it disabled system wide or within the thread context ?

2018-02-01 16:40:43 GMT <AFaust> WIthin thread only

2018-02-01 16:41:42 GMT <yreg> We probably should make a config entrry for that in Inflow

2018-02-01 16:41:50 GMT <yreg> thanks for the tip

2018-02-01 16:42:51 GMT <AFaust> Oh - I have to be more clear: "the most significant performance hog" => once you remove synch indexing (since we were using Lucene)

2018-02-01 17:22:54 GMT <yreg> AFaust, for the sake of completeness, even though our solution supports custom parsers and custom data providers, in most usecases clients have two filesper node: one for the content and an other for the metadata, and the read/write from disk is usually the bottle neck... that and the report we give back to the tool about the uuid and status for each and every uploaded node + eventually some error/warning messages

2018-02-01 19:31:10 GMT <douglascrp> AFaust, yreg interesting discussions today

2018-02-01 23:30:07 GMT <harper> Would someone be willing to assist me in the configuration of my Alfresco Community Edition 201711 EA setup?

2018-02-01 23:32:15 GMT <harper> I cannot figure out how to active the "passthru" authentication subsystem

End of Daily Log

