Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Moritz E. Beber
@Midnighter
Hi @elad.noor_gitlab, with regard to your e-mail: I have one open pull request on our fork (biosustain/component-contribution#27). I will update our branch and then check if it still makes sense to merge it.
I think in future it's better if we directly open PRs to your repository and not our own first.
Looks like the commits you made are quite data heavy.
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File cache/component_contribution.npz is 54.18 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: File cache/component_contribution.npz is 54.27 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
To github.com:biosustain/component-contribution.git
Moritz E. Beber
@Midnighter
That PR mostly changes training_data.py so yes, quite a number of conflicts.
Elad Noor
@elad.noor_gitlab
Oh, yes. This large file is the cache of all the math done from the training data. I can try using LFS for this.
As for the merge conflicts, I can try to help. Which fork as you using?
Moritz E. Beber
@Midnighter
Let me re-open the PR directly to your repo.
Elad Noor
@elad.noor_gitlab
Also, what does CfB mean?
Moritz E. Beber
@Midnighter
Center for Biosustainability
Elad Noor
@elad.noor_gitlab
Well, I guess it's a bit old, but there are standing commits in your fork from January 2017 for a file called CfB_functions.py
Moritz E. Beber
@Midnighter
whoops, that's probably junk.
There was some consultant working on functionality for the institute here around that time I guess that's what he did.
Elad Noor
@elad.noor_gitlab
Okay, so the refactoring branch is the relevant one that we need to merge?
Moritz E. Beber
@Midnighter
No, sorry. the refactoring branch mirrors your develop branch. I have just updated it with all of your recent commits.
I opened a PR to your develop branch now. It's from the refactor/data branch.
Elad Noor
@elad.noor_gitlab
I just did the merge. In the end, I don't think there was much use for the suggested code, it was mostly adding the column titles to the DFs, something that I already solved by adding a header row to the resource .csv files. I also started using pkg_resources.resource_stream() so the other changes are probably also redundant.
Moritz E. Beber
@Midnighter
Yeah, it was more of a cosmetic PR.
So what's on your roadmap now?
Elad Noor
@elad.noor_gitlab
I need to finish doing my regression tests to see that I get the same dG'0 estimates (more or less). Then I want to make a larger database of precomputed compounds that will include all KEGG, BiGG, and MetaCyc compounds, and push that to eQuilibrator.
Finally, I want to make it very easy to add new compounds (by providing an ID and InChI). I think that all compounds that are added to component-contribution will have the COCO:Cxxxxx pattern
Moritz E. Beber
@Midnighter
  1. Do you want to make the regression test for dG'0 part of the test suite, i.e., maintain a table of known 'good' values and test against those everytime?
  2. I can offer help with executing the computation for the extended set of compounds.
  3. While I agree that it should be easy to add new compounds, I'm not a fan of introducing yet another identifier especially if adding one requires an InChI already. What problem are you trying to overcome by providing your own identifier?
Moritz E. Beber
@Midnighter

I have created a commit that introduces git-lfs and addresses eladnoor/component-contribution#29. However, I cannot push this to our fork and make a normal PR:

Git LFS: (0 of 1 files) 0 B / 54.18 MB                                                    
batch response: @Midnighter can not upload new objects to public fork biosustain/component-contribution

so I made a git format-patch that you can apply to your repo and simply git apply.

Moritz E. Beber
@Midnighter
You can download it from here https://drive.google.com/file/d/16V9Vt3koGQk_YGnTsP3sLTtxOHAHePpl/view?usp=sharing. Please let me know when you have done so, then I will delete it again.
Elad Noor
@elad.noor_gitlab
I am trying to push your commits, but there in an error:
LFS upload failed:cts:   0% (0/1), 0 B | 0 B/s 
(missing) cache/component_contribution.npz (a99de4f5e61df9fea005bb44cfd1419ca2733aba0042c46a08e6977ea633db49)
hint: Your push was rejected due to missing or corrupt local objects.
hint: You can disable this check with: 'git config lfs.allowincompletepush true'
error: failed to push some refs to 'git@github.com:eladnoor/component-contribution.git'
Moritz E. Beber
@Midnighter
:/
Can we briefly chat on meet.jit.si?
Elad Noor
@elad.noor_gitlab
I don't know what that means, but sure
Moritz E. Beber
@Midnighter
Free browser-based skype alternative https://meet.jit.si/component-contribution
Moritz E. Beber
@Midnighter
InChI > MetaNetX > ChEBI > ChEMBL
Maybe PubChem, KEGG, BiGG
Moritz E. Beber
@Midnighter
Some more questions and ideas:
  1. I think it would be great to add documentation (in general) and specifically a notebook that shows (a) common workflow with component-contribution. We could then also think about adding a command line interface to perform such a workflow.
  2. As we also briefly talked about, the data handling is a bit messy at the moment. There is partially duplicate data in cache/ and in component_contribution/data, as well as sometimes the same file as gzip compressed version and uncompressed. Since pandas can read compressed files out of the box, I would suggest only keeping compressed ones until we move to SQLite.
Moritz E. Beber
@Midnighter
Oh, and for the licensing I don't know if you want to credit new code that you have written to ETH?
Elad Noor
@elad.noor_gitlab
Okay, I'm taking care of the redundant files, and moving the cache folder into the component_contribution package directory. I'll also try to start using SQLite, at least for the cached compounds at the same time. Hopefully, we will then have only .csv.json files for the input data and .npz and .sqlite files for the cache.
Also, I'll add the licensing statement - good idea. As for the Jupyter notebook, I opened another issue for that.
Moritz E. Beber
@Midnighter
I'm just about to make another PR :)
So if you can hold out on moving files for just a bit, please.
Elad Noor
@elad.noor_gitlab
Sure
Elad Noor
@elad.noor_gitlab
Do you know if there is a web API for MetaNetX. I could easily find the FTP option, but not a web service option.
Moritz E. Beber
@Midnighter
I don't think there is one. Just the two flat files for compounds and reactions.
Elad Noor
@elad.noor_gitlab
I see. So if someone provides a new MetaNetX identifier, I'll have to download the entire 300 MB file every time?
Moritz E. Beber
@Midnighter
I was talking about these downloads https://www.metanetx.org/mnxdoc/mnxref.html I didn't realize they are so large?
Elad Noor
@elad.noor_gitlab
Yes, chem_prop.tsv is 307 MB, and is the one that contains the InChIs. It contains almost 700,000 entries
Moritz E. Beber
@Midnighter
Ouch, well, you are at the same university maybe you can figure out a better way with them :stuck_out_tongue: This is the only way I know.
Elad Noor
@elad.noor_gitlab
I will ask them. Another option is to prioritize ChEBI over them. Their API is great and they list 91,000 compounds which is probably enough
Moritz E. Beber
@Midnighter
:+1:
Elad Noor
@elad.noor_gitlab
@midnighter - I forgot to ask about something. It seemed that you or Joao tried to change the implementation of the compound_cache and use weak references. I guess it was only half implemented and I couldn't figure out how to finish it, so I just changed it back to a standard dictionary. Also, I got rid of the Singleton since it wasn't working well either. Now it's just an object that is initialized in the package file, so it's not exactly a singleton, but there is only one instance for every .py file that is calling it. That should be good enough for most purposes. Please let me know if you think we should change it.
Moritz E. Beber
@Midnighter
Hmm, yes, those are bits that @joaocardoso worked on.