These are chat archives for inveniosoftware/invenio

19th
Apr 2016
Alexander Wagner
@aw-bib
Apr 19 2016 06:40
good morning :) is there a known reason why (the never fast) webcoll on invenio1.x gets excessively slow even though we did not add new collections? and even better, are there some counter measures?
Tibor Simko
@tiborsimko
Apr 19 2016 07:04
Do you have many UI languages defined? The time depends proportionally to the number of languages. Have you tried to profile it to see where it spends time the most in your particular collection setup? Run webcoll -u admin -f --profile cumulative and then look in the corresponding bibsched_task_123.log.
Alexander Wagner
@aw-bib
Apr 19 2016 07:48
@tiborsimko we did not change anything in the setup neither in the language nor in the collections. what hit us were the US grants that OpenAIRE exposed. thus on the box in question we ingested (unintentionally) some 500.000 records, but this should not affect bibsched, right? I'll check out the profile.
Tibor Simko
@tiborsimko
Apr 19 2016 07:55
Nope, 500K is not much. (1) The webcoll speed depends mostly on the number of collections, on the number of I18N languages the site has. (2) But you can also have a slow-down if your collections are defined via queries such as 100:% because then the query to get latest 10 records could take many seconds. You can check your regular webcoll task log to see whether you don't have some long-taking collections as bottlenecks. (3) Finally, note that you can also profile one single collection only, e.g. webcoll -u admin -c Books -f --profile cumulative, which may provide additional insight.
Alexander Wagner
@aw-bib
Apr 19 2016 09:02
As far as we see right now it seems to be the top of the tree that consumes most time. Till now I do not yet understand why, though.
@tiborsimko can it be a source of a problem that we marked most of those records as DELETED? As mentioned we got them by error and they just confuse our users, thus we threw them out again.
Tibor Simko
@tiborsimko
Apr 19 2016 09:14
Yes, webcoll removes deleted records by means of a silent addition to the query of a term going like -980:"DUMMY" -980:"DELETED". But that should be pretty fast to run... you can check in ipython. If the home collection is the problem, then I assume it does not have any dbquery defined, and only collects hits from its daughter collections? That would not be it then. You can try to run only the record gathering part (without I18N cache creation part) to see, for example webcoll -u admin -f -c Atlantis Institute of Fictive Science -p 1 --profile cumulative
Alexander Wagner
@aw-bib
Apr 19 2016 09:22
I saw the -980... stuff in the code and yes, you're right, the collection does not have any dbquery defined. when I test the search for deleted records in ipython I do not see any bad performance either. I'm checking the cache updates now.
Jan Åge Lavik
@jalavik
Apr 19 2016 09:23
Is it intended that current_user is not available in the API app? (Invenio 3)
we can do new release Invenio-Accounts v1.0.0a10
Alexander Wagner
@aw-bib
Apr 19 2016 09:31
@tiborsimko indeed inv $ib/webcoll -u admin -p 1 -f -c PUBDB gives me
2016-04-19 11:21:09 --> Task #166444 started.
2016-04-19 11:21:09 --> PUBDB / reclist cache update
2016-04-19 11:28:23 --> Task #166444 finished. [DONE]
so this seems to be the source of the problem.
Tibor Simko
@tiborsimko
Apr 19 2016 10:06
@aw-bib But does your PUBDB have some daugther collections by any chance? If it does, then the time bottleneck may be with some of those, as PUBDB have to traverse all its children and grandchildren etc to collect the record information from its sub-collections. You could run without -c and observe the time spent for each daughter collection with -p 1 mode switched on. The bottleneck really depends on your collection tree setup and your collection dbquery definitions, so I'm just posting some general thoughts on where to look...
Alexander Wagner
@aw-bib
Apr 19 2016 10:16
@tiborsimko pubdb has ~300 children, though we had this all the time. It really turned out to pose a problem once we got those 500k records and deleted them. Could it be, that the 980__c:DELETED causes a lot of substraction operations that consume the time?
Tibor Simko
@tiborsimko
Apr 19 2016 10:35
Not the query itself, that's usually fast, e.g. check http://bib-pubdb1.desy.de/search?ln=en&p=-980__c%3A%22DELETED%22 You may know more after inspecting webcoll -u admin -f -p 1 to see where the seconds are spent... would be probably with one of daugther collections. I doubt it's the home collection.
Jan Åge Lavik
@jalavik
Apr 19 2016 11:08
@jirikuncar Thanks! Indeed with latest it works.
Javier Martin Montull
@jmartinm
Apr 19 2016 13:32
@jirikuncar @lnielsen Invenio-Search, Invenio-Records-REST and Invenio-Records-UI now stable again? I’ve seen the new releases
Lars Holm Nielsen
@lnielsen
Apr 19 2016 13:38
We’re still testing but it looks kind of ok now……... also we’re postponing part of the refactoring for later inveniosoftware/invenio-records-rest#64
Jacopo Notarstefano
@jacquerie
Apr 19 2016 13:40
If I understood it right, elasticsearch-dsl wraps elasticsearch-py. Is there an easy way to drop to the wrapped library and use that instead?
(Because we have written some code that uses that directly with its API)
put in another way, is current_search_client still an Elasticsearch object?
mh, by reading the code it appears that this is still true. :+1:
Alexander Wagner
@aw-bib
Apr 19 2016 13:46
@tiborsimko we see a lot of time spent in cursor.py (mysql) but I'm not sure this is a helpful result.
Jiri Kuncar
@jirikuncar
Apr 19 2016 13:55
@jmartinm ALL Invenio-* packages are in ALPHA mode.
Javier Martin Montull
@jmartinm
Apr 19 2016 14:18
Thanks for the info @lnielsen
Tibor Simko
@tiborsimko
Apr 19 2016 14:46
@aw-bib That's SQL query time. E.g. your collection queries are defined in a way that they are not using indexes efficiently. E.g. your DB tables are not OPTIMIZE-ed since a while. E.g. you may simply have too many queries to run for your HW and the time is OK. It would be useful to look into your mysql slow query log to see whether something is logged there. Then you can run ANALYZE SELECT ... on your slow queries to see whether there is a bottleneck there. You can also run mysqltuner to see whether you can spot some potential for optimisation there. E.g. have you introduced tmpfs for MySQL in-memory tables? They are used quite a lot for bibxxx queries... See http://tiborsimko.org/mysql-tmpfs.html for inspiration.