These are chat archives for conda/kapsel

18th
Aug 2016
Alexey Strokach
@ostrokach
Aug 18 2016 20:30
Thanks, that solved the import problem!
Now, I am running conda kapsel init on an existing project, but it is taking ages...
pstree shows the following tree: conda───conda-kapsel───git. But this isn't a big project...
Havoc Pennington
@havocp
Aug 18 2016 21:05
which platform are you on? my guess is that it's permanently stuck for some reason
Alexey Strokach
@ostrokach
Aug 18 2016 21:06
Thanks for the reply! I'm on Linux Mint 16.04. It finished eventually, but took more than 5 minutes for sure.
I'm playing around trying to find a reason for this... Maybe it's because my repository has many ipython notebooks...
Also, is there a guide on creating a new service?
Havoc Pennington
@havocp
Aug 18 2016 21:07
I'm not sure to be honest why init would run git... clearly I'm forgetting something
the services feature is really pure placeholder right now, with just redis to work out some of the issues. there isn't a defined way to make new ones.
it's a todo item
Alexey Strokach
@ostrokach
Aug 18 2016 21:08
I have a repo that starts a MySQL database daemon from a conda MySQL (MariaDB) package: https://github.com/kimlaborg/datapkg.
Right now I have the code that starts the database in an IPython notebook, and it gets run by all other IPython notebooks that need the database.
But having this at the package level would be better.
Havoc Pennington
@havocp
Aug 18 2016 21:11
the idea is that conda kapsel will support things like that, but it hasn't been implemented yet :-/
so git is called from conda_kapsel/project.py which does archiver.py:_list_relative_paths_for_unignored_project_files - it's looking for .ipynb files that aren't inside a .gitignore directory, essentially
  output = subprocess.check_output(
            ['git', 'ls-files', '--others', '--ignored', '--exclude-standard'],
            cwd=project_directory)
I don't know if typing that git command by hand is slow for you?
maybe you have an enormous number of ignored files, come to think of it that could cause trouble
Alexey Strokach
@ostrokach
Aug 18 2016 21:14
Yes, I do have an "enormous" number of ignored files!
My IPython notebooks run jobs on a cluster, and each job saves its output to a file...
Havoc Pennington
@havocp
Aug 18 2016 21:15
yeah, I don't know how enormous is needed to make this slow ;-)
there's probably some better way to implement things here. I wonder if git-ls-files could somehow give us only the directories that are entirely ignored instead of everything in the directory, or something like that (though I don't know if that would help you, it depends on where your files are)
Alexey Strokach
@ostrokach
Aug 18 2016 21:18
That would definitely help me! My .gitignore has the line notebooks/*/**. which is supposed to ignore all temporary data produced by the notebooks.
My setup is to have IPython notebooks in the notebooks directory: notebooks/A.ipynb, notebooks/B.ipynb, etc..., and notebook temporary files in the corresponding subfolders: notebooks/A/tmpfiles, notebooks/B/tmpfiles...
Havoc Pennington
@havocp
Aug 18 2016 21:21
aha. so now to convince git-ls-files to compress the stuff some... not sure whether it can. or maybe there's a way to speed up our processing. how slow is doing the git command by hand >/dev/null ?
if git can output the list fast and we are just slow to analyze it we could probably optimize
if it outputs the list slow then its harder...
Alexey Strokach
@ostrokach
Aug 18 2016 21:26
$ time (git ls-files --others --ignored --exclude-standard > /dev/null)

real    1m29.819s
user    0m2.632s
sys    0m7.884s
But this repository is an exception. I tried conda kapsel init in several other repositories, and they all finished in < 10 seconds.
For this first repository, conda kapsel init was running for several minutes even after git finished.
$ git ls-files --others --ignored --exclude-standard | wc -l
708151
So this definitely is an edge case
Havoc Pennington
@havocp
Aug 18 2016 21:36
aha - I wonder if most of that time is simply building a set() from the entire list of ignored files - I was thinking "maybe we are checking whether each file is in a list to make an n-squared algorithm" but it does look like we make a set. but it's a big set.
Alexey Strokach
@ostrokach
Aug 18 2016 21:43
I could run a profiler locally and send you the result? Only I'm not sure how to run a profiler on something like conda kapsel init?
Havoc Pennington
@havocp
Aug 18 2016 22:05
I'm not sure offhand either, but if you can run it on a python function you could probably call the archiver.py:_list_relative_paths_for_unignored_project_files function directly
or I can reproduce and do it when I can, since I think I have the needed details here. but if you want to play with it some please do!
Alexey Strokach
@ostrokach
Aug 18 2016 22:11

Sure, thank you very much for your help!
I did a quick test with:

import sys
import conda.cli
import cProfile
sys.argv = ['conda', 'kapsel', 'init']
cProfile.run('conda.cli.main()')

but it just told be that all the time was spent in subprocess.py, which I don't think is right.

And conda-kapsel looks great! It's exactly what I was looking for to manage my data processing pipelines :).
Havoc Pennington
@havocp
Aug 18 2016 23:17
thanks I hope we can get it to a polished state that works for you
subprocess time is because conda runs conda-kapsel as a child process so you are profiling the parent process there that doesn't do anything except wait for the child