Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 10:30
    cmarmo synchronize #15706
  • 10:22
    lkubin commented #15855
  • 10:04
    NicolasHug commented #15636
  • 10:01
    erinkhoo synchronize #15668
  • 09:57
    david-cortes commented #15636
  • 09:57
    david-cortes synchronize #15636
  • 09:54
    shivamgargsya commented #15865
  • 09:44
    jhennrich opened #15880
  • 09:39
    david-cortes commented #15636
  • 09:39
    david-cortes synchronize #15636
  • 09:35
    david-cortes commented #15636
  • 09:31
    david-cortes synchronize #15636
  • 09:06
    cmarmo commented #15508
  • 08:56
    GregoryMorse synchronize #15622
  • 08:55
    GregoryMorse synchronize #15622
  • 08:52
    albertcthomas commented #15724
  • 07:35
    alfaro96 synchronize #15870
  • 07:33
    Bibyutatsu edited #15878
  • 07:32
    Bibyutatsu edited #15878
  • 07:26
    Bibyutatsu edited #15878
Olivier Grisel
@ogrisel
Heads up: if you use conda and upgrade your env, you might get a crash when using n_jobs>=2. This is caused by an updated version of intel-openmp in the default channel of conda. I reported the issue upstream as ContinuumIO/anaconda-issues#11294 and the problem is tracked in this PR on the scikit-learn side: scikit-learn/scikit-learn#15020
The error message is OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361) reported by the dying worker process.
Which in turns causes loky to raise: TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}.
Samesh Lakhotia
@sameshl
If someone is free to review, please take a look at scikit-learn/scikit-learn#14993 and scikit-learn/scikit-learn#15045.
Andreas Mueller
@amueller
hm is there a pandas gitter? Or is @jorisvandenbossche around lol? For a pandas dtype, how do I get the closest numpy dtype to cast to?
Joris Van den Bossche
@jorisvandenbossche
yep
there is pandas gitter actually (pydata/pandas)
I don't think there is a typical way to do it
If I remember correctly, there is an issue about it
Basically, you would like to know the dtype of np.asarray(obj).dtype right? (but without needing to do the actual conversion?)
Andreas Mueller
@amueller
indeed
it's for scikit-learn/scikit-learn#15094 which is currently failing because np.result_type(pd.CategoricalDType) raises an error
Joris Van den Bossche
@jorisvandenbossche
the issue that I rememered is pandas-dev/pandas#22791
Andreas Mueller
@amueller
ok. so no solution :-/ is there a work-around?
like what does actually happen when you do the conversion?
is it from the pd.DataFrame.__array__ method or something?
Andreas Mueller
@amueller
yeah it is, no way to figure that one out :-/
Jesse Leigh Patsolic
@MrAE

Hello all (I am new to Cython),

I am currently working on adding an augmented version of Brieman's forest-RC (similar to RandomForest) algorithm into my fork of scikit-learn: In short, the algorithm takes linear combinations of features and projects them with weights randomly selected in {-1,1} to form a new feature to split on. The number of features combined at each split is a random variable.

The current SplitRecord only holds one feature, I need something to store a vector of features and a vector to hold weights.

  1. I tried initializing an np.ndarray and using memoryviews, but ran into GIL issues.
  2. I tried to make an ObliqueSplitRecord class, but that can't be passed as a pointer into functions because it is a Python object.
  3. I tried to augment the SplitRecord struct in _splitter.pxd but that didn't seem to work because vectors would then be of fixed length.
  4. I tried to use something similar to the tree/_utils:Stack but fell into the same problem as it was a class and couldn't be passed as a pointer into a function.

I am looking into using cppclass, but am not sure if that will fix solve my problem.

Does anyone have suggestions on how to best implement this in a Cythonic way? i.e. storing a vector of things while avoiding the GIL and not using python objects?

Adrin Jalali
@adrinjalali
@MrAE you can use a cpp vector in cython. But since you're changing the splitrecord struct, you'll need to change the code in quite a lot of places.
Mateusz Sokół
@mtsokol
Hi, I have some basic question about local docs build for scikit. I've been trying to modify docs inside API for some file in sklearn/linear_model and followed instructions in Contributors Guide. But after few attempts the make command inside /docs does not seem to modify local docs build inside _build. In the browser, API docs didn't change although I modified the sources. Am I missing something?
Nicolas Hug
@NicolasHug
@mtsokol it seems that you're doing it right... maybe double check that 1. you're actually changing the sources, i.e. not anything in the _build folder, 2. the doc that you're changing is about a public estimators/tools (private tools aren't rendered in the doc anyway) and 3. that you're looking at the generated html in doc/_build/html/stable/
Nicolas Hug
@NicolasHug

@MrAE

re 1. you can't use (let alone allocate) numpy arrays when the GIL is released because these are Python objects. Is there a way for you to allocate the arrays somewhere where the GIL is held, and use memory views when the GIL is released? Memory views are safe to use without the GIL

re 2. is it still considered a Python object if you use a cdefed class and all the attributes are cdefed as well?

re 3. what vectors? can't you use a view as a field of the struct?

Nicolas Hug
@NicolasHug
Also @MrAE I happen to have been writing about Cython over the weekend... maybe that could help http://nicolas-hug.com/blog/cython_notes
Jan-Benedikt Jagusch
@janjagusch
could somebody share a good example for class docstrings in scikit-learn that we could use as a sort of template? thanks!
Alessandro Surace
@zioalex
Here is the issue search string "is:issue is:open examples class docs involves:adrinjalali"
Alessandro Surace
@zioalex
Hey guys who is veerlosar on Githib? just want to talk about OneVsRestClassifier example
Benjamin Bossan
@BenjaminBossan
@zioalex I can talk to veerlosar, we're at the same sprint
veerlosar
@veerlosar

Hey guys who is veerlosar on Githib? just want to talk about OneVsRestClassifier example

@zioalex what did you want to talk about?

rajnish1642
@rajnish1642
how to learn complete sk learn ? please give the resources?
Jérémie du Boisberranger
@jeremiedbb
Andreas Muller's book, Introduction to Machine Learning with Python: A Guide for Data Scientists, is quite complete.
You can also look at the user guides: https://scikit-learn.org/stable/user_guide.html
Andreas Mueller
@amueller
there's also my lecture series: https://youtube.com/AndreasMueller The only complete resource is the user guide though
Ghost
@ghost~5a09ec4ed73408ce4f7e6c27
Hello there!
Jesse Leigh Patsolic
@MrAE

Hey guys, me again: Regarding me previous message :point_up: October 4, 2019 5:28 PM I've gone through some more attempts that don't quite work.

@NicolasHug The blog post helped a bit with my understanding of memory-views, however I still have a few questions: Can a memory-view be initialized with nogil? And no, a struct member cannot be a memory view.

I tried to make my own class but then got yelled at because it's not of type Splitter, so that was a bust.

I augmented the SplitRecord with 2 cpp vectors, but that caused things to go wonky requiring cpp in files that I'm not willing to touch.

I ended up augmenting SplitRecord with 2 Cython vectors with hard-coded length, but then can't seem to initialize a memory-view into them inside of the node_split. I'm pretty much stuck (in my current view of things), because I'm trying to do as little modification as possible, but it seems that in order to accomplish my task I'll have to re-write a big chunk of ensemble methods. I'd have to add an input argument to the node_split method? That doesn't sound like a good idea.

Any ideas? Much appreciated.

Peter Hadlaw
@peterhadlaw

Hi all, I'm trying to help my team reduce creating new code when leveraging existing libraries might get the job done. Does anyone have thoughts on how the following can be accomplished? https://stackoverflow.com/q/58533004/1566074

Basically finding the optimal subgroups for a dataset to then feed into an estimator to reduce noise.

Guillaume Chevalier
@guillaume-chevalier
Hello the scikit-learn community! I'd like to have your thoughts on what I coded. It's a way to do automatic machine learning on scikit-learn pipelines. It allows for handling hyperparameter spaces as well as hyperparameters. Example: https://www.neuraxio.com/en/neuraxle/stable/examples/hyperparams.html#sphx-glr-examples-hyperparams-py
Adrin Jalali
@adrinjalali
any takers on scikit-learn-contrib/imbalanced-learn#616 it's a good first issue.
Guillaume Lemaitre
@glemaitre
a first good issue?
a find it a bit harsh :)
Adrin Jalali
@adrinjalali
lol, I'm just a messenger, Joel tagged it as such :D
Guillaume Lemaitre
@glemaitre
Basically, I was starting to solve the issue yesterday
While making master work with master is easy (just change the import path), the challenging part is to make work out-of-date version with a newer scikit-learn.
In the latest case, we need to make some try except ImportError as you suggested I think
Adrin Jalali
@adrinjalali
Yep. If you're already at it, please leave a comment so that others don't start working on it ;)
Guillaume Lemaitre
@glemaitre
Yep I just cross-reference my PR
Guillaume Lemaitre
@glemaitre
For the people joining the MAN-AHL sprint, you can find the instructions to install scikit-learn from source at the following documentation page: https://scikit-learn.org/dev/developers/advanced_installation.html#building-from-source
Guillaume Lemaitre
@glemaitre
In addition, you can find the contributing guide as the following address: https://scikit-learn.org/dev/developers/contributing.html
Finally, if you are searching for an issue to work on, several issues have been tagged specifically for sprints: https://github.com/scikit-learn/scikit-learn/issues?utf8=%E2%9C%93&q=is%3Aopen+is%3Aissue+label%3ASprint
You can set some other tags if you want ("good first issues", etc.). You also free to search any issue that you are interested in on the issue tracker.