Daily Log for #alfresco IRC Channel

Alfresco discussion and collaboration. Stick around a few hours after asking a question.

Official support for Enterprise subscribers: support.alfresco.com.

Joining the Channel:

Join in the conversation by getting an IRC client and connecting to #alfresco at Freenode. Our you can use the IRC web chat.

More information about the channel is in the wiki.

Getting Help

More help is available in this list of resources.

Daily Log for #alfresco

2019-10-18 09:48:37 GMT <alfresco-discord> <dgradecak> hi-ko: lately I am using spring cloud gateway, alfresco/share/ADF apps/activiti is configured as "external auth" and the gateway is in charge to keep the correct login info

2019-10-18 09:48:49 GMT <alfresco-discord> <dgradecak> and adds it to the requests when proxying

2019-10-18 09:49:05 GMT <alfresco-discord> <dgradecak> instead of apache for instance

2019-10-18 10:23:19 GMT <AFaust> Am I the only one who gets annoyed when customers ask "can we disallow document download, print and protect all possible download URLs, but still allow user to view document in the browser preview" and then don't understand that this is not how the web works? If you need to show a document to a user - in whatever form (even transformed to not show "the original") - then any non-tech-retarded user will be able to download and

2019-10-18 10:23:19 GMT <AFaust> then do whatever with that data stream...

2019-10-18 10:25:09 GMT <AFaust> Only way to achieve this level of "data security" is to have an Alfresco system on a separate computer, fully air-gapped, in a separate room and only allow access to it without any mobile phone, pen + notepad, etc.

2019-10-18 10:29:39 GMT <alfresco-discord> <dgradecak> I guess you know which computer is the most secure one?

2019-10-18 10:29:58 GMT <alfresco-discord> <dgradecak> the one that is turned off

2019-10-18 10:31:54 GMT <AFaust> ... and shredded...

2019-10-18 10:49:03 GMT <alfresco-discord> <Douglas Paes (douglascrp)> @AFaust, I do

2019-10-18 10:49:21 GMT <alfresco-discord> <Douglas Paes (douglascrp)> And I have an add-on for that

2019-10-18 10:50:04 GMT <alfresco-discord> <Douglas Paes (douglascrp)> Hide actions, links, add print buttons, everything configurable by groups

2019-10-18 10:50:21 GMT <alfresco-discord> <Douglas Paes (douglascrp)> But users do not understand everything you mentioned

2019-10-18 10:50:34 GMT <alfresco-discord> <Douglas Paes (douglascrp)> It gives them the false idea of security

2019-10-18 11:01:17 GMT <alfresco-discord> <dgradecak> I try to explain to them that and I had no cases where they insisted at the end

2019-10-18 11:16:34 GMT <AFaust> angelborroy: Question concerning optimal ASS configuration: If I have just set up an Alfresco Repository system and mass created 100 million documents in ~26h (50 docs/txn), what is the ideal configuration for ASS to index those (no content) in the shortest amount of time.

2019-10-18 11:17:33 GMT <AFaust> The default configuration is abysmal and will take 5 days - main issue is that regardless of number of threads + cron I configure for metadata tracker, I can only ever see a spike in Repository DB connection use (aka real indexing progress) every 20s or so...

2019-10-18 11:18:46 GMT <AFaust> Going to try the brute force option now, e.g. setting the global alfresco.cron to trigger tracking every 2s

2019-10-18 11:24:48 GMT <angelborroy> Did you try to increment trackers?

2019-10-18 11:25:25 GMT <AFaust> I increase core pool size, yes

2019-10-18 11:25:44 GMT <angelborroy> alfresco.corePoolSize

2019-10-18 11:25:55 GMT <angelborroy> No more suggestions from my side

2019-10-18 11:25:56 GMT <angelborroy> Sorry

2019-10-18 11:26:27 GMT <AFaust> And I can see that spike in used DB connections on Repository. But the spike is short-lived, and then the entire system is idle until the next spike after 20s. According to repository thread dump, no request is blocking, e.g. no tracker is actively requesting data.

2019-10-18 11:26:41 GMT <AFaust> And top also does not show any CPU load

2019-10-18 11:28:42 GMT <AFaust> Unfortunately in this system I am having issues getting the thread dump from the SOLR process itself. Looks like jstack (and other tools) in OpenJDK has some issues connecting to running processes in recent versions (both Java 8 and 11)

2019-10-18 11:29:25 GMT <angelborroy> We are performing internal spikes to improve this process, but nothing relevant to share by now

2019-10-18 12:39:23 GMT <hi-ko> @dgradecak: we already hat the sso discussion. In this special case we still evaluate to handle and cache the alfresco login tickets for every user to avoid the external auth config / restrictions.

2019-10-18 12:47:46 GMT <hi-ko> AFaust: we have a module "view only" for alfresco <= 5.2 using flash for preview and overwriting any consumer read access on content and available renditions. To translate this to the pdf preview and upcoming transformer mechanism a solution could be to create special secured pdfs having pixeled pages with overlay.

2019-10-18 12:50:55 GMT <hi-ko> AFaust: even we techies know that alfresco has no answer out of the box for this requirement this was the reason for some (larger) companies not to use alfresco

2019-10-18 12:53:26 GMT <AFaust> Regardless of whether Alfresco has or hasn't an answer out of the box. Such a requirement is just not feasible in any web-based system without sacrificing a lot of other features or quality of service (e.g. heavily pixelated images) while at the same time ensuring 100% security

2019-10-18 12:54:34 GMT <hi-ko> So customers choose other platforms for storing documents having this requirement.

2019-10-18 12:57:56 GMT <hi-ko> AFaust: concerning your load test with 100 mio docs: have you seen systems really having this amount of data in real live? We maintain 2 systems having more than 50 mio docs but they are more or less on concept limits. e.g. no longer able to change permissions on upper hierarchy, time for transformation spawning threads ...

2019-10-18 13:00:02 GMT <hi-ko> On one of these systems we switched off ASS completly since the resource requirements exploded and the whole system availability was no longer acceptable.

2019-10-18 13:00:10 GMT <AFaust> I have seen one system in excess of 100 mio docs a few years ago, and indirectly know of a system bound to cross 100 mio after little more than 2 years of operation

2019-10-18 13:00:44 GMT <AFaust> Well, at some amount of xx mio documents, sharding has to come into play.

2019-10-18 13:00:52 GMT <hi-ko> but I guess not as standard share application

2019-10-18 13:01:00 GMT <AFaust> (also highly dependent on full text vs. metadata only index)

2019-10-18 13:01:32 GMT <hi-ko> even with sharding: repo is the bottleneck

2019-10-18 13:02:56 GMT <hi-ko> try to change a permission on a higher level or try to move a folder and wait until the system gets out of resources

2019-10-18 13:04:32 GMT <hi-ko> so my experience is: if the cst is using Share, metadata, acls ~50 mio docs is the wall

2019-10-18 13:06:04 GMT <hi-ko> given that there are concurrent useres (e.g. 500) creating, searching content.

2019-10-18 13:12:28 GMT <hi-ko> angelborroy: any outcome for OpenJDK analysis, tuning, best practices would be very welcome since oracle jdks days are numbered ...

2019-10-18 13:42:52 GMT <AFaust> Well... any system expected to have xx mio of documents needs to have a proper ACL / permission scheme design up-front which must make sure to prevent such use cases as excessively cascading ACL updates.

2019-10-18 13:45:35 GMT <AFaust> And in the benchmarks I have done with test systems of 10 - 100 mio documents, the bottleneck so far has not been the repository (most of the time), but the general setup / configuration of resources (caches / memory) + SOLR integration (+ sharding, if relevant)

2019-10-18 13:47:40 GMT <hi-ko> I agree that it is not a problem to put > 50 mio docs into alfresco. I have seen seen the challenges when the people work with them in the way they are used to

2019-10-18 13:51:35 GMT <hi-ko> So I would rephrase: The alfresco UI allows operations which could easily kill the system in larger environments.

2019-10-18 13:55:22 GMT <AFaust> As for my current issue: SOLR/ASS indexing of mass data (e.g. after a migration / large import) definitely leaves much to be desired. The scheduling + synchronization of concurrent trackers + index consolidation (commit) phase really slows down the process. Whatever I try to optimise (batch sizes, commit interval...), the overall balance stays the same.

2019-10-18 13:56:03 GMT <hi-ko> I see.

2019-10-18 13:56:05 GMT <AFaust> Can't accept that indexing would be 6x as costly (duration) as creating the content in the first place

2019-10-18 13:57:07 GMT <hi-ko> did you find out where the time gets lost?

2019-10-18 13:58:00 GMT <AFaust> So far, all indications are that most of the (effective) time is lost doing index commits / merges.

2019-10-18 13:58:51 GMT <AFaust> I cannot see much pressure on IO though, but have 1 CPU constantly used in SOLR, while there is no activity in Repo or DB (so no bottleneck in repository)

2019-10-18 13:59:52 GMT <AFaust> Unfortunately, I cannot get a result from jstack for this SOLR process to know in more detail. jstack fails with some obscure reason I have become all too familiar with in recent OpenJDK versions.

2019-10-18 14:00:33 GMT <hi-ko> and you could exclude storage / iops issus? I've also seen issues on esx with overprovisioning cpus but that would not expain the difference to initial load

2019-10-18 14:01:07 GMT <AFaust> What I tried to optimize so far is to reduce the occurences of these commits / merges, so that the concurrent phase is longer / can get more done. As a result though, the commit / merge phase now takes significantly longer, and offset all gains of the concurrent phase.

2019-10-18 14:03:22 GMT <hi-ko> and if then the gc wakes up ...

2019-10-18 14:04:04 GMT <hi-ko> have you compared running solr in oracle jvm?

2019-10-18 14:07:40 GMT <AFaust> Not in that system, as switching JVMs would be quite a hassle, involving Docker image rebuilds.

2019-10-18 14:08:14 GMT <AFaust> GC is also not a (likely) culprit, because the JVM has sufficient memory and the slowness is already apparent from the start (empty index)

2019-10-18 14:08:29 GMT <AFaust> ...when memory usage is extremely low

2019-10-18 14:10:30 GMT <hi-ko> I would compare running solr in a nacked os to make sure your issue is not related to kernel issues on cgroup/virtualisation

2019-10-18 14:11:38 GMT <AFaust> Judging from the "remaining time" estimation, my optimisations have actually resulted in some improvements (4-5 days vs 6 days total), but still quite high.

2019-10-18 14:12:20 GMT <AFaust> The repo + DB run on exactly the same virtualisation setup - somehow I doubt only SOLR would be affected by kernel issues.

2019-10-18 14:13:09 GMT <AFaust> Anyway - such comparisons are unfortunately not that easy to do / setup in a customer-provided environment.

2019-10-18 14:13:37 GMT <hi-ko> still far to much. we reindex ~30 mio docs in < 8 hrs in a virtualized system but these numbers are from 5.2/solr4

2019-10-18 14:13:55 GMT <AFaust> And before I could suggest that, I would need to have some evidence / indications that this "might" be the case...

2019-10-18 14:16:06 GMT <hi-ko> so my guess is still the "runtime" but I have no scaling experience with solr6

2019-10-18 14:17:21 GMT <hi-ko> do you have exclusive cpus?

2019-10-18 14:18:14 GMT <hi-ko> and garanteed memory?

2019-10-18 14:23:20 GMT <hi-ko> I've seen this in overprovisioned virtual environmens since you always have to share the physical cpu cores and every context switch costs extra

2019-10-18 14:54:00 GMT <AFaust> Ah, finally got jstack to work... so far it looks like I suspected, index commit / merge blocking most of the processing (via Semaphore, synchronized methods and other means)

2019-10-18 14:56:42 GMT <AFaust> Hmm... Lucene has a method "maybeMerge" which ALWAYS calls the "merge" method of the merge scheduler, which in the case of ASS / SOLR 6 is a concurrent merge scheduler, where that method is synchronized.

2019-10-18 14:57:00 GMT <AFaust> I am totally missing the "maybe" part of the logic indicated by the name...

2019-10-18 15:48:40 GMT <hi-ko> btw. which tool do you use to analyse the thread dumps?

2019-10-18 16:02:44 GMT <AFaust> ~later tell hi-ko: Ehm, typically only "less" and my "mk-1 eye balls"

2019-10-18 16:02:44 GMT <alfbot> AFaust: The operation succeeded.

2019-10-18 16:12:22 GMT <AFaust> ~later tell hi-ko: From all the jstack traces so far, I never see any thread "stuck" or otherwise active in IO. It all looks to be CPU bound, e.g. the computational cost of doing index merges. Since that is exclusive, everything else blocks on that, when that occurs. I am currently trying to see how solrconfig.xml params (buffer/mergeFactor) impact this.

2019-10-18 16:12:22 GMT <alfbot> AFaust: The operation succeeded.

End of Daily Log

The other logs are at http://esplins.org/hash_alfresco