Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 29 02:28
    ginop edited #281
  • Oct 28 20:35
    codecov[bot] commented #281
  • Oct 28 20:34
    codecov[bot] commented #281
  • Oct 28 20:33
    codecov[bot] commented #281
  • Oct 28 20:32
    codecov[bot] commented #281
  • Oct 28 17:08
    ginop edited #281
  • Oct 28 17:07
    ginop synchronize #281
  • Oct 28 16:49
    ginop edited #281
  • Oct 28 16:48
    ginop edited #281
  • Oct 28 16:47
    ginop opened #281
  • Oct 19 12:57
    jmrozanec closed #277
  • Oct 19 12:57
    jmrozanec closed #276
  • Oct 19 12:57
    jmrozanec closed #278
  • Oct 19 12:56
    jmrozanec closed #280
  • Oct 15 15:02
    jmrozanec commented #278
  • Oct 15 14:42
    ahmed-masud commented #278
  • Oct 15 08:12
    codecov[bot] commented #280
  • Oct 15 08:04
    jmrozanec commented #280
  • Oct 15 08:03
    jmrozanec edited #280
  • Oct 15 08:03
    jmrozanec edited #280
Jacob
@jacobmontiel

scikit-multiflow v0.5.3 is now available!

This release includes multiple features, improvements and bug fixes. Please refer to the
changelog entry for a detailed list
of changes.

Summary of changes

New features

Delayed labels for supervised learning

  • Add support for delayed labels in streams and evaluations. Two new methods are available for this purpose:

Regression

  • Adaptive Random Forest Regressor\
    Note: This implementation is slightly different from the original algorithm. The Hoeffding Tree Regressor is used as
    the base learner, instead of the FIMT-DD. It also adds a new strategy to monitor the incoming data and check for concept
    drifts. For more information, see the notes in the documentation.
  • KNN
    implementation for regression.

Classification

Drift detection

  • HDDM_A,
    a drift detection method based on the Hoeffding’s bounds with moving average-test.
  • HDDM_W,
    a drift detection method based on the Hoeffding’s bounds with moving weighted-average-test.
  • KSWIN (Kolmogorov-Smirnov Windowing) concept drift detector.

Data generation

Transformers

Efficiency and enhancements

  • Improve efficiency for metrics calculations (classification).
  • Add support for multi-class metrics: precision, recall, F1-score, and G-mean.
  • Reduce substantially the size of the package.
  • Enhance handling of categorical attributes in Hoeffding Trees.
  • Add bootstrap option in the Hoeffding Adaptive Tree classifier.

Bug fixes and API changes

This release includes a set of bug fixes and API changes. The most relevant API change is the renaming of multiple
methods following a more consistent and informative naming convention. The full list of bug fixes and details of
API changes is available in the changelog entry.

Jacob
@jacobmontiel

Python version

  • Add support for Python 3.8.
  • This is the last version that supports Python 3.5 as it is reaching its end-of-life date (2020-09-13).

Patch release note

New features and improvements were introduced in version 0.5.0. Version 0.5.3 only includes a fix that triggered an
error in the conda-forge distribution files for Windows.
Kush Varma
@kushvarma
Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..
Adrien Luxey
@Adrien-Luxey
Hi! I'm just trying your glorious software. I ran EFDT and VFDT (HoeffdingTreeClassifier) on my own data (multi-label, nominal) using EvaluatePrequential. I'm surprised to see that VFDT has a kappa of 0, and only one leaf even after 9000 training samples. Since VFDT and EFDT basically have the same parameters, I'm surprised that VFDT only does not grow... Can you provide guidance please?
Thans for your framework anyway, you're making me save hundreds of hours of development :)
Adrien Luxey
@Adrien-Luxey
Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.
Adrien Luxey
@Adrien-Luxey
It works with gini coefficient. I'll investigate myself. Your comments are still welcome!
Joseph Lucas
@JosephTLucas
Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.
Joseph Lucas
@JosephTLucas
When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.
Bennett Lambert
@lambertsbennett
I just watched the talk at the SciPy conference. I think this is a really cool project and would love to contribute in any way possible. I'd rate myself an intermediate python programmer and do applied machine learning research. In the talk we were directed here to connect with ways that we could help push this project forward!
9 replies
Jacob
@jacobmontiel

Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..

I am not sure to follow the question

Jacob
@jacobmontiel

Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.

sorry for the delay @Adrien-Luxey . If your data is highly unbalanced then it is likely that the (root) node hasn’t seen enough information to split. You can play around with the parameters but this is likely going to need some preprocessing of the data as the Hoeffding Tree (as many algorithms) is not designed to handle imbalanced data. On the other hand, EFDT is rather lax in terms of splits, as its goal is to grow the tree fast.

Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.

Thanks and gald to hear the feature map is useful

6 replies
Jacob
@jacobmontiel

When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.

This is a soft requierement that we need to improve. The class column is not a requierement since it won’t be used for training, although can be used for validation. In this case you can extend your data with a dummy column, for example in pandas and then use DataStream to load it. On the other hand, it is strange that you are getting a False when calling stream.has_more_samples(), are you getting any warning when calling FileStream? Can you provide a MWE to see what could be happening?

Kush Varma
@kushvarma
@jacobmontiel sorry for not being clear, I understand ARF is already implemented, but I am working on implementing ARF_RE (ARF with Resampling from this paper https://www.researchgate.net/publication/336152223_Adaptive_Random_Forests_with_Resampling_for_Imbalanced_data_Streams). I contacted Mr. Heitor Gomes as he is the original author of ARF and ARF_RE. Normally in MOA, Instance instanc class implementation can provide the classValue, while calling instanc.classValue(), I have the MOA code here https://github.com/kushvarma/moa/blob/arf_re/moa/src/main/java/moa/classifiers/meta/AdaptiveRandomForestRE.java Line no 358 to 383 is the main logic behind ARF_RE. Here the value of classValue is implemented in com.yahoo.labs.samoa.instances. So I was looking for alternate function if a similar kind of function was implemented in scikit_multiflow. This is my implementation in scikit https://github.com/kushvarma/scikit-multiflow/blob/dm_arf/src/skmultiflow/meta/adaptive_random_forest_re.py line no 530 to 562.
the implementation is incomplete as i am still working on it.
Jacob
@jacobmontiel
I looked into the current state of youyr implementation. My suggestion is to extend the existing AdaptiveRandomForestClassifier class. This way you can use the existing code and you just need to implement those specific details for this variant of the original method.
My previous message got lost:
Now I understand what you are doing. You can pass the class values via the classes parameter in the partial_fit method. If you are using a stream object then you can access the class value in the corresponding property stream.target_values. Notice that thte user is not requiered to use a stream object and can manually set the list of class values to pass.
Bennett Lambert
@lambertsbennett

Just a note the "Community" hyperlink on the landing README page is broken.

Open source
Distributed under the BSD 3-Clause, scikit-multiflow is developed and maintained by an active, diverse and growing community.

tlfields
@tlfields
@jacobmontiel does scikit-multiflow support drift detection on a spam dataset? I would like to experiment with comparing naive bayes, svm, and HT on a spam dataset then add drift and compare the EDDM, DDM and ADWIN detection methods. Is this something that I can do with multiflow? Thank you for your time
Juan Cardona
@Juancard
Hi @jacobmontiel, I have an issue that I think is a bug, related to training a model with a multi-label dataset. The method EvaluatePrequential is forcing me to pre-train the model with at least one sample for every label , otherwise it throws an error. The error is:
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)(...)

ValueError: The number of classes has to be greater than one; got 1 class
Juan Cardona
@Juancard
And here is a code sample
from skmultilearn.dataset import load_dataset
from sklearn.linear_model import Perceptron
from skmultiflow.metrics import hamming_score, exact_match, j_index
from skmultiflow.meta.multi_output_learner import MultiOutputLearner
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential

X_stream, y_stream, feature_names, label_names = load_dataset('enron', 'undivided')
stream_original = DataStream(data=X_stream.todense(), y=y_stream.todense(), name="enron")
pretrain_samples = round(stream_original.n_remaining_samples() * 0.1)
classifier_br = MultiOutputLearner(
    Perceptron()
)
evaluator = EvaluatePrequential(
    show_plot=True, 
    pretrain_size=pretrain_samples, 
    metrics=["exact_match", "hamming_score", "hamming_loss", "running_time", "model_size"],
)
evaluator.evaluate(stream=stream_original, model=classifier_br)
asad1907
@asad1907
Hi, I am getting an error while I am trying to import some regressors and classifiers like below
image.png
How can I import them?
Saulo Martiello Mastelini
@smastelini
Hi @asad1907, could you please inform the version of skmultiflow you are using?
asad1907
@asad1907
Hi @smastelini, I am using 0.4.1 version. How can I download the new version?
asad1907
@asad1907
thanks @smastelini. I upgraded to 0.5.3, now it works
madprogramer
@madprogramer
Hi everyone, I'm new here 👋 Really excited to work with stream data
I actually discovered multiflow while looking for something that could help me with a research project
madprogramer
@madprogramer
Namely, I want to be able to process serial input from an Arduino
Looking through skmultiflow.data, I'm not sure if I should be using DataStream or TemporalDataStream. It's audio data I'll be reading so I guess TemporalDataStream might work better?
ilhem salah
@ilhem_salah_twitter
hello everyone , i would like to use method stacking and cross-validation with streaming but i didn't find those methods in skmultiflow . May i help you please !!
Ankit Kumar
@ankitk2109
Hi @jacobmontiel , I have a use case where I am using HDDM_A for drift detection but rather than adding elements one by one to the detector, I want to add them in batches. Is there a way to add batches and check for warning or drifts after that? Any sort of input is appreciated. Thanks!
Rosie Beeston
@Roonspoon
Hello, I'm just having a bit of trouble running the MultiOutputLearner moving_squares example. When I do the following:
stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/streaming-datasets/master/moving_squares.csv", 0, 6)
X, y = stream.next_sample(150)
I get X is
array([], shape=(150, 0), dtype=float64)
which means I can't run the rest of the example. Can anyone help?
Saulo Martiello Mastelini
@smastelini

Hi everyone, I'm new here 👋 Really excited to work with stream data

nice to hear that :D

Looking through skmultiflow.data, I'm not sure if I should be using DataStream or TemporalDataStream. It's audio data I'll be reading so I guess TemporalDataStream might work better?

TemporalDataStream relates each input to a timestamp. It is intended to evaluate scenarios where we expected an arbitrary delay in the label arrival (in the supervised learning setting)

Probably the DataStream class should be enough in your case

hello everyone , i would like to use method stacking and cross-validation with streaming but i didn't find those methods in skmultiflow . May i help you please !!

Hi @ilhem_salah_twitter. Currently we do not have stacking methos implemented in skmultiflow. The vanilla cross-validation scheme is intended to batch scenarios. I am not familiar with adaptations of it to streaming scenarios

Saulo Martiello Mastelini
@smastelini

Hi @jacobmontiel , I have a use case where I am using HDDM_A for drift detection but rather than adding elements one by one to the detector, I want to add them in batches. Is there a way to add batches and check for warning or drifts after that? Any sort of input is appreciated. Thanks!

Hi @ankitk2109, I hope to be able to help you on behalf of @jacobmontiel. skmultiflow's drift detection methods do not support the insertion of elements in (mini) batches. The usual usage is to add each element sequentially and check for drifts (or warnings) after each insertion

Hello, I'm just having a bit of trouble running the MultiOutputLearner moving_squares example. When I do the following:
stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/streaming-datasets/master/moving_squares.csv", 0, 6)
X, y = stream.next_sample(150)
I get X is
array([], shape=(150, 0), dtype=float64)
which means I can't run the rest of the example. Can anyone help?

Hi @Roonspoon, the problem in your example is probably related to the parameters you are passing to FileStream. The moving_squares dataset has two features and one target. In your call target_idx=0 and n_targets=6. Please, refer to the FileStream documentation for more details

5 replies
laoheico3
@laoheico3
@jacobmontiel , Hi! I would like to ask whether there is the method of FIMT-DD in skmultiflow?
Miso-K
@Miso-K
Hi, I am working on visualizing concept drifts in stream data using calculating GINI index per streaming window e.g. 500 examples. It works great with datasets from https://github.com/scikit-multiflow/streaming-datasets or from other sources. But I have problem with generators LEDGeneratorDrift or RandomRBFGeneratorDrift that does not provide any visual concept drift changes. Are there any settings to enable concept drift e.g. every 20 000 instances or what is default value to expect drift occurence?
foodiehack
@foodiehack
Hey I am trying to do incremental multilabel classification. So initially I have 10 classes but these classes can be increased in the near future. I am using Multilabel binarizer to convert classes into binary format and tfidf for features...I have initially trained a model ClassifierChain with taking 4 classes into account and now want to train again with rest 6 classes. How to do it....And how to make a proper pipeline.
Romain Picard
@MainRo

Hello @jacobmontiel , following our discussion this summer I started integrating scikit-multiflow in maki-nage.

As expected, integrating concept-drift detection as an operator was trivial. I did not test on real data yet, but unit tests run well. However I have one question on adwin: On a unit test similar to the sample code from the documentation, 3 changes are detected while I expected only one. I will read the associated paper to understand more the algorithm but maybe you have a quick explanation on this.

Concerning the other algorithms (classification, regression, anomaly detection), it seems that the simplest and most flexible way to integrate them is to integrate the different evaluation methods: They take scikit-multiflow model objects as input, so any algorithm of scikit-multiflow will be directly usable without explicit integration code. However this will be more tricky because the evaluators do a lot of different things. I started with the prequential one and it does "many" things: it acts as a runner to consume the stream, does the actual prequential operations, maintains a running loss, plot the stream, supports multiple models... All these together make it not directly usable as a reactivex operator. For now I started implementing a prequential operator that does only the test/train operation. A consequence is that I cannot use the scikit-multiflow code, but I need to rewrite it. I will continue investigating this and get more familiar with the inner working of scikt-multiflow, but it is highly probable that the evaluations must be split in smaller parts so that I can reuse them. I think that this would also be true to integrate with other streaming frameworks.

An example of concept drift detection is here:
https://github.com/maki-nage/rxsci-multiflow/blob/master/tests/test_detect_drift.py

and the initial prequential implementation is here:
https://github.com/maki-nage/rxsci-multiflow/blob/master/rxsci_multiflow/evaluate/prequential.py

This is still a preliminary work, but I am still confident that I can easily feed scikit multiflow models from maki-nage data quite easily.

Mariam Benllarch
@benllarch
Hello @jacobmontiel, is there a mechanism for hyperparameter tuning, I'm working with EFDT algorithm, and I wanna experiment its performance by testing some values for tie_threshold, grace_period, and Delta parameters.
farnaz
@farnaz2018
Hi everyone,
I got this error when I am using Hoeffding adaptive tree, any idea?
TypeError: _new_learning_node() got an unexpected keyword argument 'is_active_node'
the base is from python3.8/site-packages/skmultiflow/trees/hoeffding_tree.py", line 805, in _deactivate_learning_node
new_leaf = self._new_learning_node(
Jacob Pfeil
@jpfeil
Hello! I have been using XGBoost, but in production, it makes more sense to use an online algorithm. I am trying several different ensemble approaches in scikit-multiflow, but I'm a little overwhelmed with all of the options. I was wondering if anyone could recommend a few recipes to try out on my data.
Claudio
@claudiobrandy
Greetings to all! I'm new to the channel, got the link on the scikit multiflow page. I'm trying to implement an online learning model to detect anomalies on security logs using multiflow algorithms, but I'm finding performance issues when comparing with similar methods in scikit learn. It gets better when I accumulate enough samples before predicting and partial fitting, but that approach isn't always feasible. Did you have any similar experience? Am I asking too much and shouldn't expect comparable performance?