scikit-learn: machine learning in Python. Please feel free to ask specific questions about scikit-learn. Please try to keep the discussion focused on scikit-learn usage and immediately related open source projects from the Python ecosystem.
thomasjpfan on sample-props
SLEP006: CalibratedClassifierCV… (compare)
Hello guys, maybe anyone can help me out here. I am running following validation code:
train_scores, valid_scores = validation_curve(estimator=pipeline, # estimator (pipeline)
X=features, # features matrix
y=target, # target vector
param_name='pca__n_components',
param_range=range(1,50), # test these k-values
cv=5, # 5-fold cross-validation
scoring='neg_mean_absolute_error') # use negative validation
in the same .py
file on different machines, which I would name #1 localhost
, #2 staging
, #3 live
, #4 live
localhost and staging have both i7 cpus, localhost needs around 40s for the validation, staging needs around 13-14 seconds
live (#3) and live (#4) need almost 10 minutes for executing the validation - both of these servers have intel cpus with 48 threads.
In order to get more "trustworthy" numbers I dockerized the images and run them on the servers. Anyone has an idea why the speed is so different?
from sklearn.linear_model import LinearRegression
model = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=2, include_bias=False)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('poly', poly_transformer), ('reg', model)])
After profiling, I saw this (slowest time on bottom, sorted by 3rd column):
4150 208.706 0.050 208.706 0.050 {built-in method numpy.dot}
245 13.112 0.054 13.360 0.055 decomp_svd.py:16(svd)
2170 142.567 0.066 143.360 0.066 decomp_lu.py:153(lu)
Just executed python -m cProfiler validation.py
@amueller I don't know if this helps:
I ran
from scipy import linalg
import numpy as np
m, n = 9, 6
a = np.random.randn(m, n) + 1.j*np.random.randn(m, n)
U, s, Vh = linalg.svd(a)
print(U.shape, s.shape, Vh.shape)
cProfile
says:
394 0.004 0.000 0.017 0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
900 0.004 0.000 0.004 0.000 {built-in method posix.stat}
1 0.006 0.006 0.006 0.006 lil.py:23(lil_matrix)
81/24 0.007 0.000 0.011 0.000 sre_compile.py:64(_compile)
402/399 0.011 0.000 0.022 0.000 {built-in method builtins.__build_class__}
212/1 0.023 0.000 0.222 0.222 {built-in method builtins.exec}
190 0.024 0.000 0.024 0.000 {built-in method marshal.loads}
39/37 0.038 0.001 0.043 0.001 {built-in method _imp.create_dynamic}
(sorted by second column)
9 0.000 0.000 0.000 0.000 __future__.py:79(__init__)
9 0.000 0.000 0.000 0.000 _globals.py:77(__repr__)
9 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
9 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects}
9 0.000 0.000 0.000 0.000 os.py:742(encode)
9 0.000 0.000 0.001 0.000 abc.py:151(register)
9 0.000 0.000 0.001 0.000 datetime.py:356(__new__)
900 0.001 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:75(_path_stat)
900 0.004 0.000 0.004 0.000 {built-in method posix.stat}
936 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
96 0.000 0.000 0.000 0.000 enum.py:630(<lambda>)
39/37 0.038 0.001 0.043 0.001 {built-in method _imp.create_dynamic}
1 0.002 0.002 0.002 0.002 __init__.py:259(_reset_cache)
1 0.006 0.006 0.006 0.006 lil.py:23(lil_matrix)
(sorted by third column)
@amueller when I run this code:
train_scores, valid_scores = validation_curve(estimator=pipeline, # estimator (pipeline)
X=features, # features matrix
y=target, # target vector
param_name='pca__n_components',
param_range=range(1,50), # test these k-values
cv=5, # 5-fold cross-validation
scoring='neg_mean_absolute_error') # use negative validation
directly on the host (with 24 cores) I get ~30 seconds. When I run it directly on localhost (4 cores, 8 threads) I get around 30-40 seconds as well. When I run inside docker with cpu limit of 6 cores and 6GB RAM, it needs almost 10 minutes. Inside a VirtualBox with 2 cores.. around 30 seconds, seems scikit does not play well with docker limitations which uses the CFS Scheduler: link
param_range
to range(1,5)
the code runs much faster (I am no data scientist)
validation_curve
does not really profit from multithreading/multiprocessing. I get almost same results on intel i7 (4 cores) and intel xeon (24 cores). The problem is that if the validation curve runs on the xeon machines.. it uses all cores and the machine is overloaded, which makes no sense, really :)
conda install numpy scipy cython matplotlib pytest flake8 sphinx sphinx-gallery
or something like that