Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Sep 05 14:04
    CHIMAWAN001 commented #303
  • Sep 05 14:04
    CHIMAWAN001 closed #303
  • Aug 31 03:36
    CHIMAWAN001 commented #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:13
    CHIMAWAN001 opened #303
  • Aug 26 07:27
    rabitwhte opened #302
  • Jun 09 00:12
    yuritpinheiro closed #301
  • Jun 09 00:12
    yuritpinheiro commented #301
  • Jun 08 23:53
    yuritpinheiro opened #301
  • May 17 10:07
    binzhang-u5f6c opened #300
  • Apr 09 17:15
    odmarkj opened #299
  • Apr 05 11:30
    asmaafawzy25 reopened #298
  • Apr 05 11:30
    asmaafawzy25 closed #298
  • Mar 31 14:24
    asmaafawzy25 opened #298
  • Mar 08 10:06
    Linfengscat opened #297
  • Mar 05 20:31
    ginop commented #281
  • Mar 02 01:14
    gilbertoolimpio closed #296
  • Mar 02 00:45
    gilbertoolimpio edited #296
  • Mar 01 21:27
    gilbertoolimpio opened #296
asad1907
@asad1907
@smastelini thanks a lot sir for your help. I solved it :)
barnettjv
@barnettjv
@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.
Jacob
@jacobmontiel

@smastelini thanks a lot sir for your help. I solved it :)

@asad1907 Can you share your solution? Support for dynamic plots in Jupyter Lab has not improved much since its release.

@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you

Glad to help.

@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.

If you are using an evaluator you can add true_vs_predicted to metrics to get predicted values. In this case you also need to set n_wait=1. As a suggestion, in this case deactivate the plot as n_wait=1 implies a high refresh rate in the plot which is a lot of overhead.

@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.

That is correct.

Jacob
@jacobmontiel

@jacobmontiel how do we add LSTM and MLP deep learning algorithms to scikit-multiflow?

those are open questions still, since those methods are usually trained on batches

@tlfields scikit-multiflow does not include any implementation (yet). If for your use-case using batch-incremental instead of instance-incremental learning is fine, the you could do something similar to the BatchIncremental model. This is a simple class to show how you can do batch-incremental learning using batch methods from scikit-learn. But you are not restricted to models from that library.
tlfields
@tlfields
@jacobmontiel thank you so much for your response
barnettjv
@barnettjv
Jacob, I added the the 'true_vs_predicted' and set the pretrain to 50 on a data set of 200, along with n_wait=1 and aren't getting any predicted values.
I'm just getting the Accuracy, which is the only other metric I'm sending.
oh never mind. figured it out :D
asad1907
@asad1907
@barnettjv @barnettjv You can see true and predictive values in results.csv. Using true_vs_predicted in metrics and output_file='results.csv'
asad1907
@asad1907
@jacobmontiel I have solved that on Jupiter Notebook using %matplotlib notebook. Now I am trying to use on JupiterLab. If I can, i will share gladly
Jacob
@jacobmontiel

@jacobmontiel I have solved that on Jupiter Notebook using %matplotlib notebook. Now I am trying to use on JupiterLab. If I can, i will share gladly

Thanks for letting us know

tlfields
@tlfields
@jacobmontiel thank you for the video from anaconda con. I have watched it several times and I am learning so much from you. I have ran the notebok you provided and I have a question as to how to use the page-hinkley or the other drift detectors using the "agr_a_20k.csv" instead of the stream dataset. Is this somehting you can help me with? I want to see which of the detectors pick up the drift in specifically in the agr_a_20k.csv
tlfields
@tlfields
@jacobmontiel .... I think I got it to work... I am so HAPPY!!
Jacob
@jacobmontiel

@jacobmontiel .... I think I got it to work... I am so HAPPY!!

Glad to hear that

Just as a comment: be careful when testing different drift detectors, for example DDM and EDDM expect input data (error) encoded in the oposite way to ADWIN
tlfields
@tlfields
@jacobmontiel when I ran the PageHinkly on the agr_a_20k.csv, it picked up only two drifts one at index 5165 and the other 15408, It did not pick up any drift at the 1000-1100 range. The adwin picked up 5 drifts between index 5535-5855, it picked picked up 9 drifts in the index range from 10463-11007 and it picked up up 8 in the range of 15679-17407.. so that tells me from this particular dataset, the adwin was a bit more sensitive to the drift. for some reason I did not get the inpurt data error that you mentioned.
tlfields
@tlfields
@jacobmontiel , in reference to the the agr_a_20k.csv, did you add the drifts at certain points or was it all ready there? I am trying to figure out how to find the actual points where drift was inserted. thank you
Jacob
@jacobmontiel
Those are 3 synthetic abrupt drifts, every 7500 samples
tlfields
@tlfields
@jacobmontiel thank you Sir!
Santhosh Sahini
@santoshsahini19
Hi, When I’m implementing “from skmultiflow.data import AnomalySineGenerator”, there is an error popping up saying cannot import name 'AnomalySineGenerator' from 'skmultiflow.data'. Is there any way where I can resolve this issue. Thanks!
Jacob
@jacobmontiel
AnomalySineGenerator is only available in the development version. You must install it from GitHub
$ pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow
Santhosh Sahini
@santoshsahini19
Thank you so much @jacobmontiel
Jacob
@jacobmontiel

Those are 3 synthetic abrupt drifts, every 7500 samples

@tlfields , I was reviewing this and drifts are actually placed every 5000 samples

sorry about that
tlfields
@tlfields
@jacobmontiel thank you for clarifying, when I looked at your demo, it did say 5, 10 and 15K. Sir, can you please help me to understand how to add different amounts (magnitudes) of drifts to the agr_a_20 ? Additionally how would I add it so that it is considers, gradual drift rather than abrupt drift? Has there been a guide created to do this?
Juan Cardona
@Juancard

Hi everyone, i have a problem related with plot_show. I set plot_show = True but it doesn't work. What should i do? Do you have any idea?

I have the same issue in Codelab, and could not solve it using %matplotlib inline nor %matplotlib notebook. Anyone tried this in codelab before?

Juan Cardona
@Juancard
# -*- coding: utf-8 -*-

!pip install scikit-multiflow

%matplotlib inline
from skmultiflow.data import WaveformGenerator
from skmultiflow.trees import HoeffdingTree
from skmultiflow.evaluation import EvaluatePrequential

# 1. Create a stream
stream = WaveformGenerator()
stream.prepare_for_use()

# 2. Instantiate the HoeffdingTree classifier
ht = HoeffdingTree()

# 3. Setup the evaluator
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=200,
                                max_samples=20000)

# 4. Run evaluation
evaluator.evaluate(stream=stream, model=ht)
Santhosh Sahini
@santoshsahini19
Hi everyone, Does anyone have idea about how the structure of the data (in .csv file) should be while using 'skmultiflow.data.file_stream module' to perform HSTrees for anomaly detection? Thanks!
Saulo Martiello Mastelini
@smastelini

@jacobmontiel thank you for clarifying, when I looked at your demo, it did say 5, 10 and 15K. Sir, can you please help me to understand how to add different amounts (magnitudes) of drifts to the agr_a_20 ? Additionally how would I add it so that it is considers, gradual drift rather than abrupt drift? Has there been a guide created to do this?

Hi @tlfields, I think I can help you with that. The data you mention were most probably generated by using the AGRAWALGenerator. This generator supports 10 different generation functions. The most traditional way of adding (abrupt) drifts to this synthetic learning problem is to call its generate_drift method. It will change the generation function randomly. So, you can create abrupt drifts anytime.

Regarding your second question, the ConceptDriftStream class was designed to merge two concepts gradually and I think it would fit your needs perfectly :) Please, check the documentation page of ConceptDrif Stream for more details. Luckily, the default behaviour of this class is to merge two different concepts of AGRAWALGenerator

Saulo Martiello Mastelini
@smastelini

Hi everyone, Does anyone have idea about how the structure of the data (in .csv file) should be while using 'skmultiflow.data.file_stream module' to perform HSTrees for anomaly detection? Thanks!

HSTrees do not require labelled data to work. The anomaly labels are primarily used to evaluate performance. If you have labelled anomaly data, you can use the FileStream class to load your stream. You only have to indicate which column has the labels, which can be done by changing the target_idx parameter. By default FileStream assumes the labels are in the last column.

Jacob
@jacobmontiel

Regarding your second question, the ConceptDriftStream class was designed to merge two concepts gradually and I think it would fit your needs perfectly :)

@tlfields just for extra information, in ConceptDriftStream you can define the position of drift, and the type of transition, either abrupt change or gradually changing over a number of samples

Jacob
@jacobmontiel

I have the same issue in Codelab, and could not solve it using %matplotlib inline nor %matplotlib notebook. Anyone tried this in codelab before?

@Juancard do you mean Google’s colab? If that is the case, that is a limitation of colab itself, it does not support matplotlib’s dynamic plots.

Jacob
@jacobmontiel

scikit-multiflow v0.5.3 is now available!

This release includes multiple features, improvements and bug fixes. Please refer to the
changelog entry for a detailed list
of changes.

Summary of changes

New features

Delayed labels for supervised learning

  • Add support for delayed labels in streams and evaluations. Two new methods are available for this purpose:

Regression

  • Adaptive Random Forest Regressor\
    Note: This implementation is slightly different from the original algorithm. The Hoeffding Tree Regressor is used as
    the base learner, instead of the FIMT-DD. It also adds a new strategy to monitor the incoming data and check for concept
    drifts. For more information, see the notes in the documentation.
  • KNN
    implementation for regression.

Classification

Drift detection

  • HDDM_A,
    a drift detection method based on the Hoeffding’s bounds with moving average-test.
  • HDDM_W,
    a drift detection method based on the Hoeffding’s bounds with moving weighted-average-test.
  • KSWIN (Kolmogorov-Smirnov Windowing) concept drift detector.

Data generation

Transformers

Efficiency and enhancements

  • Improve efficiency for metrics calculations (classification).
  • Add support for multi-class metrics: precision, recall, F1-score, and G-mean.
  • Reduce substantially the size of the package.
  • Enhance handling of categorical attributes in Hoeffding Trees.
  • Add bootstrap option in the Hoeffding Adaptive Tree classifier.

Bug fixes and API changes

This release includes a set of bug fixes and API changes. The most relevant API change is the renaming of multiple
methods following a more consistent and informative naming convention. The full list of bug fixes and details of
API changes is available in the changelog entry.

Jacob
@jacobmontiel

Python version

  • Add support for Python 3.8.
  • This is the last version that supports Python 3.5 as it is reaching its end-of-life date (2020-09-13).

Patch release note

New features and improvements were introduced in version 0.5.0. Version 0.5.3 only includes a fix that triggered an
error in the conda-forge distribution files for Windows.
Kush Varma
@kushvarma
Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..
Adrien Luxey
@Adrien-Luxey
Hi! I'm just trying your glorious software. I ran EFDT and VFDT (HoeffdingTreeClassifier) on my own data (multi-label, nominal) using EvaluatePrequential. I'm surprised to see that VFDT has a kappa of 0, and only one leaf even after 9000 training samples. Since VFDT and EFDT basically have the same parameters, I'm surprised that VFDT only does not grow... Can you provide guidance please?
Thans for your framework anyway, you're making me save hundreds of hours of development :)
Adrien Luxey
@Adrien-Luxey
Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.
Adrien Luxey
@Adrien-Luxey
It works with gini coefficient. I'll investigate myself. Your comments are still welcome!
Joseph Lucas
@JosephTLucas
Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.
Joseph Lucas
@JosephTLucas
When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.
Bennett Lambert
@lambertsbennett
I just watched the talk at the SciPy conference. I think this is a really cool project and would love to contribute in any way possible. I'd rate myself an intermediate python programmer and do applied machine learning research. In the talk we were directed here to connect with ways that we could help push this project forward!
9 replies
Jacob
@jacobmontiel

Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..

I am not sure to follow the question

Jacob
@jacobmontiel

Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.

sorry for the delay @Adrien-Luxey . If your data is highly unbalanced then it is likely that the (root) node hasn’t seen enough information to split. You can play around with the parameters but this is likely going to need some preprocessing of the data as the Hoeffding Tree (as many algorithms) is not designed to handle imbalanced data. On the other hand, EFDT is rather lax in terms of splits, as its goal is to grow the tree fast.