scikit-learn: machine learning in Python. Please feel free to ask specific questions about scikit-learn. Please try to keep the discussion focused on scikit-learn usage and immediately related open source projects from the Python ecosystem.
Hello guys, maybe anyone can help me out here. I am running following validation code:
train_scores, valid_scores = validation_curve(estimator=pipeline, # estimator (pipeline)
X=features, # features matrix
y=target, # target vector
param_name='pca__n_components',
param_range=range(1,50), # test these k-values
cv=5, # 5-fold cross-validation
scoring='neg_mean_absolute_error') # use negative validation
in the same .py
file on different machines, which I would name #1 localhost
, #2 staging
, #3 live
, #4 live
localhost and staging have both i7 cpus, localhost needs around 40s for the validation, staging needs around 13-14 seconds
live (#3) and live (#4) need almost 10 minutes for executing the validation - both of these servers have intel cpus with 48 threads.
In order to get more "trustworthy" numbers I dockerized the images and run them on the servers. Anyone has an idea why the speed is so different?
from sklearn.linear_model import LinearRegression
model = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=2, include_bias=False)
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('poly', poly_transformer), ('reg', model)])
After profiling, I saw this (slowest time on bottom, sorted by 3rd column):
4150 208.706 0.050 208.706 0.050 {built-in method numpy.dot}
245 13.112 0.054 13.360 0.055 decomp_svd.py:16(svd)
2170 142.567 0.066 143.360 0.066 decomp_lu.py:153(lu)
Just executed python -m cProfiler validation.py
@amueller I don't know if this helps:
I ran
from scipy import linalg
import numpy as np
m, n = 9, 6
a = np.random.randn(m, n) + 1.j*np.random.randn(m, n)
U, s, Vh = linalg.svd(a)
print(U.shape, s.shape, Vh.shape)
cProfile
says:
394 0.004 0.000 0.017 0.000 <frozen importlib._bootstrap_external>:1233(find_spec)
900 0.004 0.000 0.004 0.000 {built-in method posix.stat}
1 0.006 0.006 0.006 0.006 lil.py:23(lil_matrix)
81/24 0.007 0.000 0.011 0.000 sre_compile.py:64(_compile)
402/399 0.011 0.000 0.022 0.000 {built-in method builtins.__build_class__}
212/1 0.023 0.000 0.222 0.222 {built-in method builtins.exec}
190 0.024 0.000 0.024 0.000 {built-in method marshal.loads}
39/37 0.038 0.001 0.043 0.001 {built-in method _imp.create_dynamic}
(sorted by second column)
9 0.000 0.000 0.000 0.000 __future__.py:79(__init__)
9 0.000 0.000 0.000 0.000 _globals.py:77(__repr__)
9 0.000 0.000 0.000 0.000 {method 'encode' of 'str' objects}
9 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects}
9 0.000 0.000 0.000 0.000 os.py:742(encode)
9 0.000 0.000 0.001 0.000 abc.py:151(register)
9 0.000 0.000 0.001 0.000 datetime.py:356(__new__)
900 0.001 0.000 0.005 0.000 <frozen importlib._bootstrap_external>:75(_path_stat)
900 0.004 0.000 0.004 0.000 {built-in method posix.stat}
936 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:321(<genexpr>)
96 0.000 0.000 0.000 0.000 enum.py:630(<lambda>)
39/37 0.038 0.001 0.043 0.001 {built-in method _imp.create_dynamic}
1 0.002 0.002 0.002 0.002 __init__.py:259(_reset_cache)
1 0.006 0.006 0.006 0.006 lil.py:23(lil_matrix)
(sorted by third column)
@amueller when I run this code:
train_scores, valid_scores = validation_curve(estimator=pipeline, # estimator (pipeline)
X=features, # features matrix
y=target, # target vector
param_name='pca__n_components',
param_range=range(1,50), # test these k-values
cv=5, # 5-fold cross-validation
scoring='neg_mean_absolute_error') # use negative validation
directly on the host (with 24 cores) I get ~30 seconds. When I run it directly on localhost (4 cores, 8 threads) I get around 30-40 seconds as well. When I run inside docker with cpu limit of 6 cores and 6GB RAM, it needs almost 10 minutes. Inside a VirtualBox with 2 cores.. around 30 seconds, seems scikit does not play well with docker limitations which uses the CFS Scheduler: link
param_range
to range(1,5)
the code runs much faster (I am no data scientist)