These are chat archives for ipython/ipython

16th
Feb 2016
Jason Grout
@jasongrout
Feb 16 2016 02:52
FYI, @williamstein also posted a SageMathCloud project for the data and Jupyter notebooks too: https://cloud.sagemath.com/projects/4a5f0542-5873-4eed-a85c-a18c706e8bcd/files/support/2016-02-12-LIGO/GW150914_tutorial.html
saiyam1814
@saiyam1814
Feb 16 2016 08:34
Hi ... I am new to ipython... I would like to know if any specific perquisites are there for it?
Min RK
@minrk
Feb 16 2016 08:36
There are several, but if you install with pip or conda, you will get them.
e.g. pip install ipython or conda install ipython
If you want the notebook, etc., that's now a different package, e.g. conda install notebook or pip install notebook
@davidgasquez IPython parallel doesn't need to know about how many threads or processes your individual tasks are going to use. As far as it's concerned, you are giving engines functions to call or code to run, and they may use whatever resources you like.
So if you want to send a function that's already going to use many threads (or processes) to a remote node, then you probably only want to start one IPython engine on that node.
If, however, you have existing multiprocessing code and you want to replace that multiprocessing with IPython parallel, then you may want one engine per core.
All that said, if you want to do distributed actions on large data frames, there's probably a better tool for that than IPython parallel: distributed
It has first-class support for distributed data frames, which may reduce the amount of parallel code that you have to write.
saiyam1814
@saiyam1814
Feb 16 2016 08:47
@minrk can you tell if any specific perquisites are there for ipython
?
Min RK
@minrk
Feb 16 2016 08:47
@saiyam1814 yes, there are several, but if you install with pip or conda, you will get them.
saiyam1814
@saiyam1814
Feb 16 2016 08:49
ok thank you @minrk
David Gasquez
@davidgasquez
Feb 16 2016 08:59
@minrk My goal is to send a function that's already going to use many threads (or processes) to several remote nodes, and benefit from the threads of each node.
If I have 4 nodes with 8 threads each one, I'll be running with 4*8 threads my code. At the moment I'm running with 8 threads in only one node. Using Ipython Parallel will be similar to running it with 4 threads. I am right?
Min RK
@minrk
Feb 16 2016 09:01
If you have 4 IPython Parallel engines, each running an 8-thread task, that's 32 threads.
David Gasquez
@davidgasquez
Feb 16 2016 09:03
Yes, that's what I want to get.
Min RK
@minrk
Feb 16 2016 09:03
With SGE, you can use --cpus-per-task to let the scheduler assign one task for every N CPUs, so that your threads will be spread out properly.
You will probably want a custom batch script for the engines in order to specify your extra arguments.
Are you familiar with SGE scripts?
David Gasquez
@davidgasquez
Feb 16 2016 09:04
So if I run a simple map(int, ['1', '2', '3']). Will it be with 32 threads?
More or less, I've been using them one or two months
@minrk I would like to thank you for your time ;)
Min RK
@minrk
Feb 16 2016 09:06
Happy to help!
David Gasquez
@davidgasquez
Feb 16 2016 09:07
I'm having troubles with the proper paralelization and distributed systems lately
Min RK
@minrk
Feb 16 2016 09:07
No, sorry. IPython Parallel itself does not have any awareness of threads or processes per engine. Once you have engines running in various places, a map in IPython parallel will run one IPython task per engine at a time.
David Gasquez
@davidgasquez
Feb 16 2016 09:08
More generic question. How does the people scale a scikit-aplication nowadays?
Min RK
@minrk
Feb 16 2016 09:08
But it's up to how how many threads or processes each of those tasks may use. But IPython won't be the one creating the threads, that's you.
joblib is common, though I think it's mostly local.
David Gasquez
@davidgasquez
Feb 16 2016 09:08
I thought it could make the same as when you run it locally(use threads) per node
Min RK
@minrk
Feb 16 2016 09:08
I know Olivier Grisel has used IPython parallel as a backend for scikit-learn, though I haven't seen that in a while.
Yes, you can.
If you have a local function that uses threads, if you send that same function to the engine, it will use threads there.
David Gasquez
@davidgasquez
Feb 16 2016 09:09
But not a simple map or things like that
I've seen several talks from Oliver Griselbut they are from 2013
Min RK
@minrk
Feb 16 2016 09:09
So if you did map(multithreaded_function, args, ...), it would create one task per engine and each engine would run your mutithreaded task, using all cores.
If you are using simple map, and are mapping single-threaded functions, then it is appropriate to run one engine per core
(it's effectively remote multiprocessing)
David Gasquez
@davidgasquez
Feb 16 2016 09:10
Thanks for the explanation
I guess I'll continue with my search of the right tool
Min RK
@minrk
Feb 16 2016 09:11
So how you allocate engines on CPUs depends on what kind of functions you are going to send to the engine.
David Gasquez
@davidgasquez
Feb 16 2016 09:11
Now I see it. Thanks again.
Min RK
@minrk
Feb 16 2016 09:12
If the functions are plain single-threaded operations (e.g. you are replacing your current use of threads with IPython parallel), then one engine per CPU is right, and a simple map will use all your cores.
But if you want to call your existing multi-threaded function many times, that's when the one engine for every N cores becomes the way to go.
But if your general problem is operations on really big data frames, definitely have a look at distributed.
David Gasquez
@davidgasquez
Feb 16 2016 09:13
It is okay to paste here some code?
Min RK
@minrk
Feb 16 2016 09:18
If you put it in markdown fenced code blocks
    ```python
    def foo()
```
David Gasquez
@davidgasquez
Feb 16 2016 09:18
I have something like this:
import pandas as pd
import multiprocessing


def process(user):
    # Locate all the user sessions in the *global* sessions dataframe
    user_session = sessions.loc[sessions['user_id'] == user]
    user_session_data = pd.Series()

    # Make calculations and append to user_session_data

    return user_session_data


# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')

# Each row is the details of one user action. 
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')

p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()

# I'm passing an integer ID argument to process() function so 
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Min RK
@minrk
Feb 16 2016 09:19
I don't suppose you have any sample data?
David Gasquez
@davidgasquez
Feb 16 2016 09:20
what sample do you need?
Min RK
@minrk
Feb 16 2016 09:20
just an example of the csvs
David Gasquez
@davidgasquez
Feb 16 2016 09:20
give me a second :)
Min RK
@minrk
Feb 16 2016 09:21
Do you have any experience with IPython parallel?
David Gasquez
@davidgasquez
Feb 16 2016 09:21
not a lot
Min RK
@minrk
Feb 16 2016 09:22
sending data around is the most expensive thing to do, and IPython doesn't do anything super clever about deduplicating communication if you ask it to send the same data around many times.
How big are the CSVs?
David Gasquez
@davidgasquez
Feb 16 2016 09:22
600mb
or so
        user_id                 action action_type  \
62   d1mm9tcy42                   show         NaN   
172  d1mm9tcy42  ajax_refresh_subtotal       click   
736  qtw88d9pbl                 lookup         NaN   
104  d1mm9tcy42   hosting_social_proof   -unknown-   
100  d1mm9tcy42                 lookup         NaN   
580  qtw88d9pbl       similar_listings        data   
95   d1mm9tcy42                   show         NaN   
572  qtw88d9pbl         search_results       click   
731  qtw88d9pbl       similar_listings        data   
885  ucgks2fyez                 lookup         NaN   

                   action_detail      device_type  secs_elapsed  
62                           NaN  Windows Desktop            77  
172  change_trip_characteristics  Windows Desktop          3522  
736                          NaN      Mac Desktop           382  
104                    -unknown-  Windows Desktop         73312  
100                          NaN  Windows Desktop            47  
580             similar_listings      Mac Desktop           188  
95                           NaN  Windows Desktop            38  
572          view_search_results      Mac Desktop         79946  
731             similar_listings      Mac Desktop            64  
885                          NaN      iPad Tablet          2407
this is the sessions.csv
several same user_ids
Right now I'm running the process_users multithreaded passing the id and having this big dataframe in read mode
Min RK
@minrk
Feb 16 2016 09:25
just for testing, can you share actual csv-format snippets?
David Gasquez
@davidgasquez
Feb 16 2016 09:27
,user_id,action,action_type,action_detail,device_type,secs_elapsed
600,qtw88d9pbl,personalize,data,wishlist_content_update,Mac Desktop,626.0
852,ucgks2fyez,show,view,p3,iPad Tablet,557143.0
803,ucgks2fyez,search_results,click,view_search_results,iPad Tablet,1279.0
321,xwxei6hdk4,confirm_email,click,confirm_email_link,iPhone,46262.0
558,qtw88d9pbl,similar_listings,data,similar_listings,Mac Desktop,133.0
496,qtw88d9pbl,ajax_refresh_subtotal,click,change_trip_characteristics,Mac Desktop,507.0
449,qtw88d9pbl,qt2,view,message_thread,Mac Desktop,8507.0
107,d1mm9tcy42,show,view,p3,Windows Desktop,44446.0
849,ucgks2fyez,ajax_refresh_subtotal,click,change_trip_characteristics,iPad Tablet,1415.0
508,qtw88d9pbl,search_results,click,view_search_results,Mac Desktop,9149.0
Min RK
@minrk
Feb 16 2016 09:28
thanks
David Gasquez
@davidgasquez
Feb 16 2016 09:28
no problem, in fact, it's me who should thank you for being so helpful
Min RK
@minrk
Feb 16 2016 09:29
The main thing is that you will want to load the data on the engines, and refer to the data via the global namespace, rather than passing the data many times via closures or arguments.
David Gasquez
@davidgasquez
Feb 16 2016 09:29
users.csv contains some user_id's and their ages
I'm not sure how to do it right now, since each node will have its copy at the beginning but dont know what users are being processed by other nodes :/
I think SGE is not the best distributed system tool
;)
I guess I could rent a big server and make this easily. But it's not fun!
Min RK
@minrk
Feb 16 2016 09:35
See this example for what it might look like
The main thing is the %%px cell, which tells all the engines to run a bit of code loading the data frame(s) before you start submitting work.
That's populates the namespace that will be referred to when process looks up what sessions should refer to.
David Gasquez
@davidgasquez
Feb 16 2016 09:36
It seems a good solution ;)
Thanks!
saiyam1814
@saiyam1814
Feb 16 2016 09:37
@minrk I installed conda then I executed install conda ipython then conda install jupyter so all this is done .... I need to clone the git for notebook can you help me a bit on that ? or provide any existing link if there already ill go through ? and any other stuffs need to be done?
Min RK
@minrk
Feb 16 2016 09:37
Why do you need to clone the notebook?
oh, do you mean a specific notebook, not the notebook package (sorry, the term 'notebook' on its own can be a bit ambiguous)?
What repo do you need to clone?
Typically that will be in a terminal: git clone https://github.com/user/project
Most hosted repos (GitHub, Bitbucket, etc.) will have a link that lets you copy the exact command for cloning the repo.
saiyam1814
@saiyam1814
Feb 16 2016 09:47
@minrk thank you .. ya I used a broader word though .... will work on it and get started ,, :)
:)