Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jun 09 00:12
    yuritpinheiro closed #301
  • Jun 09 00:12
    yuritpinheiro commented #301
  • Jun 08 23:53
    yuritpinheiro opened #301
  • May 17 10:07
    binzhang-u5f6c opened #300
  • Apr 09 17:15
    odmarkj opened #299
  • Apr 05 11:30
    asmaafawzy25 reopened #298
  • Apr 05 11:30
    asmaafawzy25 closed #298
  • Mar 31 14:24
    asmaafawzy25 opened #298
  • Mar 08 10:06
    Linfengscat opened #297
  • Mar 05 20:31
    ginop commented #281
  • Mar 02 01:14
    gilbertoolimpio closed #296
  • Mar 02 00:45
    gilbertoolimpio edited #296
  • Mar 01 21:27
    gilbertoolimpio opened #296
  • Mar 01 20:53
    gilbertoolimpio commented #295
  • Mar 01 20:52
    gilbertoolimpio opened #295
  • Feb 26 09:38
    michaelchiucw closed #293
  • Feb 23 16:35
    shubhamsoniXom closed #294
  • Feb 23 16:33
    shubhamsoniXom opened #294
  • Feb 04 08:21
    michaelchiucw edited #293
  • Feb 04 08:21
    michaelchiucw edited #293
tlfields
@tlfields
@jacobmontiel .... I think I got it to work... I am so HAPPY!!
Jacob
@jacobmontiel

@jacobmontiel .... I think I got it to work... I am so HAPPY!!

Glad to hear that

Just as a comment: be careful when testing different drift detectors, for example DDM and EDDM expect input data (error) encoded in the oposite way to ADWIN
tlfields
@tlfields
@jacobmontiel when I ran the PageHinkly on the agr_a_20k.csv, it picked up only two drifts one at index 5165 and the other 15408, It did not pick up any drift at the 1000-1100 range. The adwin picked up 5 drifts between index 5535-5855, it picked picked up 9 drifts in the index range from 10463-11007 and it picked up up 8 in the range of 15679-17407.. so that tells me from this particular dataset, the adwin was a bit more sensitive to the drift. for some reason I did not get the inpurt data error that you mentioned.
tlfields
@tlfields
@jacobmontiel , in reference to the the agr_a_20k.csv, did you add the drifts at certain points or was it all ready there? I am trying to figure out how to find the actual points where drift was inserted. thank you
Jacob
@jacobmontiel
Those are 3 synthetic abrupt drifts, every 7500 samples
tlfields
@tlfields
@jacobmontiel thank you Sir!
Santhosh Sahini
@santoshsahini19
Hi, When I’m implementing “from skmultiflow.data import AnomalySineGenerator”, there is an error popping up saying cannot import name 'AnomalySineGenerator' from 'skmultiflow.data'. Is there any way where I can resolve this issue. Thanks!
Jacob
@jacobmontiel
AnomalySineGenerator is only available in the development version. You must install it from GitHub
$ pip install -U git+https://github.com/scikit-multiflow/scikit-multiflow
Santhosh Sahini
@santoshsahini19
Thank you so much @jacobmontiel
Jacob
@jacobmontiel

Those are 3 synthetic abrupt drifts, every 7500 samples

@tlfields , I was reviewing this and drifts are actually placed every 5000 samples

sorry about that
tlfields
@tlfields
@jacobmontiel thank you for clarifying, when I looked at your demo, it did say 5, 10 and 15K. Sir, can you please help me to understand how to add different amounts (magnitudes) of drifts to the agr_a_20 ? Additionally how would I add it so that it is considers, gradual drift rather than abrupt drift? Has there been a guide created to do this?
Juan Cardona
@Juancard

Hi everyone, i have a problem related with plot_show. I set plot_show = True but it doesn't work. What should i do? Do you have any idea?

I have the same issue in Codelab, and could not solve it using %matplotlib inline nor %matplotlib notebook. Anyone tried this in codelab before?

Juan Cardona
@Juancard
# -*- coding: utf-8 -*-

!pip install scikit-multiflow

%matplotlib inline
from skmultiflow.data import WaveformGenerator
from skmultiflow.trees import HoeffdingTree
from skmultiflow.evaluation import EvaluatePrequential

# 1. Create a stream
stream = WaveformGenerator()
stream.prepare_for_use()

# 2. Instantiate the HoeffdingTree classifier
ht = HoeffdingTree()

# 3. Setup the evaluator
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=200,
                                max_samples=20000)

# 4. Run evaluation
evaluator.evaluate(stream=stream, model=ht)
Santhosh Sahini
@santoshsahini19
Hi everyone, Does anyone have idea about how the structure of the data (in .csv file) should be while using 'skmultiflow.data.file_stream module' to perform HSTrees for anomaly detection? Thanks!
Saulo Martiello Mastelini
@smastelini

@jacobmontiel thank you for clarifying, when I looked at your demo, it did say 5, 10 and 15K. Sir, can you please help me to understand how to add different amounts (magnitudes) of drifts to the agr_a_20 ? Additionally how would I add it so that it is considers, gradual drift rather than abrupt drift? Has there been a guide created to do this?

Hi @tlfields, I think I can help you with that. The data you mention were most probably generated by using the AGRAWALGenerator. This generator supports 10 different generation functions. The most traditional way of adding (abrupt) drifts to this synthetic learning problem is to call its generate_drift method. It will change the generation function randomly. So, you can create abrupt drifts anytime.

Regarding your second question, the ConceptDriftStream class was designed to merge two concepts gradually and I think it would fit your needs perfectly :) Please, check the documentation page of ConceptDrif Stream for more details. Luckily, the default behaviour of this class is to merge two different concepts of AGRAWALGenerator

Saulo Martiello Mastelini
@smastelini

Hi everyone, Does anyone have idea about how the structure of the data (in .csv file) should be while using 'skmultiflow.data.file_stream module' to perform HSTrees for anomaly detection? Thanks!

HSTrees do not require labelled data to work. The anomaly labels are primarily used to evaluate performance. If you have labelled anomaly data, you can use the FileStream class to load your stream. You only have to indicate which column has the labels, which can be done by changing the target_idx parameter. By default FileStream assumes the labels are in the last column.

Jacob
@jacobmontiel

Regarding your second question, the ConceptDriftStream class was designed to merge two concepts gradually and I think it would fit your needs perfectly :)

@tlfields just for extra information, in ConceptDriftStream you can define the position of drift, and the type of transition, either abrupt change or gradually changing over a number of samples

Jacob
@jacobmontiel

I have the same issue in Codelab, and could not solve it using %matplotlib inline nor %matplotlib notebook. Anyone tried this in codelab before?

@Juancard do you mean Google’s colab? If that is the case, that is a limitation of colab itself, it does not support matplotlib’s dynamic plots.

Jacob
@jacobmontiel

scikit-multiflow v0.5.3 is now available!

This release includes multiple features, improvements and bug fixes. Please refer to the
changelog entry for a detailed list
of changes.

Summary of changes

New features

Delayed labels for supervised learning

  • Add support for delayed labels in streams and evaluations. Two new methods are available for this purpose:

Regression

  • Adaptive Random Forest Regressor\
    Note: This implementation is slightly different from the original algorithm. The Hoeffding Tree Regressor is used as
    the base learner, instead of the FIMT-DD. It also adds a new strategy to monitor the incoming data and check for concept
    drifts. For more information, see the notes in the documentation.
  • KNN
    implementation for regression.

Classification

Drift detection

  • HDDM_A,
    a drift detection method based on the Hoeffding’s bounds with moving average-test.
  • HDDM_W,
    a drift detection method based on the Hoeffding’s bounds with moving weighted-average-test.
  • KSWIN (Kolmogorov-Smirnov Windowing) concept drift detector.

Data generation

Transformers

Efficiency and enhancements

  • Improve efficiency for metrics calculations (classification).
  • Add support for multi-class metrics: precision, recall, F1-score, and G-mean.
  • Reduce substantially the size of the package.
  • Enhance handling of categorical attributes in Hoeffding Trees.
  • Add bootstrap option in the Hoeffding Adaptive Tree classifier.

Bug fixes and API changes

This release includes a set of bug fixes and API changes. The most relevant API change is the renaming of multiple
methods following a more consistent and informative naming convention. The full list of bug fixes and details of
API changes is available in the changelog entry.

Jacob
@jacobmontiel

Python version

  • Add support for Python 3.8.
  • This is the last version that supports Python 3.5 as it is reaching its end-of-life date (2020-09-13).

Patch release note

New features and improvements were introduced in version 0.5.0. Version 0.5.3 only includes a fix that triggered an
error in the conda-forge distribution files for Windows.
Kush Varma
@kushvarma
Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..
Adrien Luxey
@Adrien-Luxey
Hi! I'm just trying your glorious software. I ran EFDT and VFDT (HoeffdingTreeClassifier) on my own data (multi-label, nominal) using EvaluatePrequential. I'm surprised to see that VFDT has a kappa of 0, and only one leaf even after 9000 training samples. Since VFDT and EFDT basically have the same parameters, I'm surprised that VFDT only does not grow... Can you provide guidance please?
Thans for your framework anyway, you're making me save hundreds of hours of development :)
Adrien Luxey
@Adrien-Luxey
Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.
Adrien Luxey
@Adrien-Luxey
It works with gini coefficient. I'll investigate myself. Your comments are still welcome!
Joseph Lucas
@JosephTLucas
Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.
Joseph Lucas
@JosephTLucas
When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.
Bennett Lambert
@lambertsbennett
I just watched the talk at the SciPy conference. I think this is a really cool project and would love to contribute in any way possible. I'd rate myself an intermediate python programmer and do applied machine learning research. In the talk we were directed here to connect with ways that we could help push this project forward!
9 replies
Jacob
@jacobmontiel

Hello, I was working on porting ARF with resampling from MOA to scikit-multiflow.. the implementation requires classValue.. Though in MOA, it comes from a Java package, com.yahoo.labs.samoa.instances. I also see the implementation of instance is different from Java.. but is there any function which calculate the classvalue in scikit-multiflow.. as it is needed to for implementation..

I am not sure to follow the question

Jacob
@jacobmontiel

Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.

sorry for the delay @Adrien-Luxey . If your data is highly unbalanced then it is likely that the (root) node hasn’t seen enough information to split. You can play around with the parameters but this is likely going to need some preprocessing of the data as the Hoeffding Tree (as many algorithms) is not designed to handle imbalanced data. On the other hand, EFDT is rather lax in terms of splits, as its goal is to grow the tree fast.

Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.

Thanks and gald to hear the feature map is useful

6 replies
Jacob
@jacobmontiel

When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.

This is a soft requierement that we need to improve. The class column is not a requierement since it won’t be used for training, although can be used for validation. In this case you can extend your data with a dummy column, for example in pandas and then use DataStream to load it. On the other hand, it is strange that you are getting a False when calling stream.has_more_samples(), are you getting any warning when calling FileStream? Can you provide a MWE to see what could be happening?

Kush Varma
@kushvarma
@jacobmontiel sorry for not being clear, I understand ARF is already implemented, but I am working on implementing ARF_RE (ARF with Resampling from this paper https://www.researchgate.net/publication/336152223_Adaptive_Random_Forests_with_Resampling_for_Imbalanced_data_Streams). I contacted Mr. Heitor Gomes as he is the original author of ARF and ARF_RE. Normally in MOA, Instance instanc class implementation can provide the classValue, while calling instanc.classValue(), I have the MOA code here https://github.com/kushvarma/moa/blob/arf_re/moa/src/main/java/moa/classifiers/meta/AdaptiveRandomForestRE.java Line no 358 to 383 is the main logic behind ARF_RE. Here the value of classValue is implemented in com.yahoo.labs.samoa.instances. So I was looking for alternate function if a similar kind of function was implemented in scikit_multiflow. This is my implementation in scikit https://github.com/kushvarma/scikit-multiflow/blob/dm_arf/src/skmultiflow/meta/adaptive_random_forest_re.py line no 530 to 562.
the implementation is incomplete as i am still working on it.
Jacob
@jacobmontiel
I looked into the current state of youyr implementation. My suggestion is to extend the existing AdaptiveRandomForestClassifier class. This way you can use the existing code and you just need to implement those specific details for this variant of the original method.
My previous message got lost:
Now I understand what you are doing. You can pass the class values via the classes parameter in the partial_fit method. If you are using a stream object then you can access the class value in the corresponding property stream.target_values. Notice that thte user is not requiered to use a stream object and can manually set the list of class values to pass.
Bennett Lambert
@lambertsbennett

Just a note the "Community" hyperlink on the landing README page is broken.

Open source
Distributed under the BSD 3-Clause, scikit-multiflow is developed and maintained by an active, diverse and growing community.

tlfields
@tlfields
@jacobmontiel does scikit-multiflow support drift detection on a spam dataset? I would like to experiment with comparing naive bayes, svm, and HT on a spam dataset then add drift and compare the EDDM, DDM and ADWIN detection methods. Is this something that I can do with multiflow? Thank you for your time
Juan Cardona
@Juancard
Hi @jacobmontiel, I have an issue that I think is a bug, related to training a model with a multi-label dataset. The method EvaluatePrequential is forcing me to pre-train the model with at least one sample for every label , otherwise it throws an error. The error is:
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)(...)

ValueError: The number of classes has to be greater than one; got 1 class
Juan Cardona
@Juancard
And here is a code sample
from skmultilearn.dataset import load_dataset
from sklearn.linear_model import Perceptron
from skmultiflow.metrics import hamming_score, exact_match, j_index
from skmultiflow.meta.multi_output_learner import MultiOutputLearner
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential

X_stream, y_stream, feature_names, label_names = load_dataset('enron', 'undivided')
stream_original = DataStream(data=X_stream.todense(), y=y_stream.todense(), name="enron")
pretrain_samples = round(stream_original.n_remaining_samples() * 0.1)
classifier_br = MultiOutputLearner(
    Perceptron()
)
evaluator = EvaluatePrequential(
    show_plot=True, 
    pretrain_size=pretrain_samples, 
    metrics=["exact_match", "hamming_score", "hamming_loss", "running_time", "model_size"],
)
evaluator.evaluate(stream=stream_original, model=classifier_br)
asad1907
@asad1907
Hi, I am getting an error while I am trying to import some regressors and classifiers like below
image.png
How can I import them?
Saulo Martiello Mastelini
@smastelini
Hi @asad1907, could you please inform the version of skmultiflow you are using?
asad1907
@asad1907
Hi @smastelini, I am using 0.4.1 version. How can I download the new version?