Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
  • Apr 09 17:15
    odmarkj opened #299
  • Apr 05 11:30
    asmaafawzy25 reopened #298
  • Apr 05 11:30
    asmaafawzy25 closed #298
  • Mar 31 14:24
    asmaafawzy25 opened #298
  • Mar 08 10:06
    Linfengscat opened #297
  • Mar 05 20:31
    ginop commented #281
  • Mar 02 01:14
    gilbertoolimpio closed #296
  • Mar 02 00:45
    gilbertoolimpio edited #296
  • Mar 01 21:27
    gilbertoolimpio opened #296
  • Mar 01 20:53
    gilbertoolimpio commented #295
  • Mar 01 20:52
    gilbertoolimpio opened #295
  • Feb 26 09:38
    michaelchiucw closed #293
  • Feb 23 16:35
    shubhamsoniXom closed #294
  • Feb 23 16:33
    shubhamsoniXom opened #294
  • Feb 04 08:21
    michaelchiucw edited #293
  • Feb 04 08:21
    michaelchiucw edited #293
  • Feb 04 08:21
    michaelchiucw edited #293
  • Feb 04 04:55
    michaelchiucw edited #293
  • Feb 04 04:54
    michaelchiucw edited #293
  • Feb 04 04:54
    michaelchiucw edited #293
Saulo Martiello Mastelini

hello everyone , i would like to use method stacking and cross-validation with streaming but i didn't find those methods in skmultiflow . May i help you please !!

Hi @ilhem_salah_twitter. Currently we do not have stacking methos implemented in skmultiflow. The vanilla cross-validation scheme is intended to batch scenarios. I am not familiar with adaptations of it to streaming scenarios

Hi @jacobmontiel , I have a use case where I am using HDDM_A for drift detection but rather than adding elements one by one to the detector, I want to add them in batches. Is there a way to add batches and check for warning or drifts after that? Any sort of input is appreciated. Thanks!

Hi @ankitk2109, I hope to be able to help you on behalf of @jacobmontiel. skmultiflow's drift detection methods do not support the insertion of elements in (mini) batches. The usual usage is to add each element sequentially and check for drifts (or warnings) after each insertion

Saulo Martiello Mastelini

Hello, I'm just having a bit of trouble running the MultiOutputLearner moving_squares example. When I do the following:
stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/streaming-datasets/master/moving_squares.csv", 0, 6)
X, y = stream.next_sample(150)
I get X is
array([], shape=(150, 0), dtype=float64)
which means I can't run the rest of the example. Can anyone help?

Hi @Roonspoon, the problem in your example is probably related to the parameters you are passing to FileStream. The moving_squares dataset has two features and one target. In your call target_idx=0 and n_targets=6. Please, refer to the FileStream documentation for more details

5 replies
@jacobmontiel , Hi! I would like to ask whether there is the method of FIMT-DD in skmultiflow?
Hi, I am working on visualizing concept drifts in stream data using calculating GINI index per streaming window e.g. 500 examples. It works great with datasets from https://github.com/scikit-multiflow/streaming-datasets or from other sources. But I have problem with generators LEDGeneratorDrift or RandomRBFGeneratorDrift that does not provide any visual concept drift changes. Are there any settings to enable concept drift e.g. every 20 000 instances or what is default value to expect drift occurence?
Hey I am trying to do incremental multilabel classification. So initially I have 10 classes but these classes can be increased in the near future. I am using Multilabel binarizer to convert classes into binary format and tfidf for features...I have initially trained a model ClassifierChain with taking 4 classes into account and now want to train again with rest 6 classes. How to do it....And how to make a proper pipeline.
Romain Picard

Hello @jacobmontiel , following our discussion this summer I started integrating scikit-multiflow in maki-nage.

As expected, integrating concept-drift detection as an operator was trivial. I did not test on real data yet, but unit tests run well. However I have one question on adwin: On a unit test similar to the sample code from the documentation, 3 changes are detected while I expected only one. I will read the associated paper to understand more the algorithm but maybe you have a quick explanation on this.

Concerning the other algorithms (classification, regression, anomaly detection), it seems that the simplest and most flexible way to integrate them is to integrate the different evaluation methods: They take scikit-multiflow model objects as input, so any algorithm of scikit-multiflow will be directly usable without explicit integration code. However this will be more tricky because the evaluators do a lot of different things. I started with the prequential one and it does "many" things: it acts as a runner to consume the stream, does the actual prequential operations, maintains a running loss, plot the stream, supports multiple models... All these together make it not directly usable as a reactivex operator. For now I started implementing a prequential operator that does only the test/train operation. A consequence is that I cannot use the scikit-multiflow code, but I need to rewrite it. I will continue investigating this and get more familiar with the inner working of scikt-multiflow, but it is highly probable that the evaluations must be split in smaller parts so that I can reuse them. I think that this would also be true to integrate with other streaming frameworks.

An example of concept drift detection is here:

and the initial prequential implementation is here:

This is still a preliminary work, but I am still confident that I can easily feed scikit multiflow models from maki-nage data quite easily.

Mariam Benllarch
Hello @jacobmontiel, is there a mechanism for hyperparameter tuning, I'm working with EFDT algorithm, and I wanna experiment its performance by testing some values for tie_threshold, grace_period, and Delta parameters.
Hi everyone,
I got this error when I am using Hoeffding adaptive tree, any idea?
TypeError: _new_learning_node() got an unexpected keyword argument 'is_active_node'
the base is from python3.8/site-packages/skmultiflow/trees/hoeffding_tree.py", line 805, in _deactivate_learning_node
new_leaf = self._new_learning_node(
2 replies
Jacob Pfeil
Hello! I have been using XGBoost, but in production, it makes more sense to use an online algorithm. I am trying several different ensemble approaches in scikit-multiflow, but I'm a little overwhelmed with all of the options. I was wondering if anyone could recommend a few recipes to try out on my data.
Greetings to all! I'm new to the channel, got the link on the scikit multiflow page. I'm trying to implement an online learning model to detect anomalies on security logs using multiflow algorithms, but I'm finding performance issues when comparing with similar methods in scikit learn. It gets better when I accumulate enough samples before predicting and partial fitting, but that approach isn't always feasible. Did you have any similar experience? Am I asking too much and shouldn't expect comparable performance?
@benllarch hi, did you find any method for hyperparameter tuning? I am interested the same issue too but cant find any solution yet.
I have one more question , is there any feature selection method in scikit-multiflow?
Hi, is there any example to calculate roc and auc for HalfSpaceTree in skmultiflow as i was looking at the library of 'metrics' they didn't specified anything.
Hi @jacobmontiel, is there any way to use Hellinger Distance as split criterion with ExtremelyFastDecisionTreeClassifier? (I actually want to use ExtremelyFastDecisionTreeClassifier for imbalanced dataset.)
@jacobmontiel, Hi, is there any example to calculate roc and auc for HalfSpaceTree in skmultiflow as i was looking at the library of 'metrics' they didn't specified anything.

Hi everyone,

In my project, I'm using Scikit-Multiflow 0.5.3 with Python 3.6. I tried to read Kdd Cup99 dataset with FileStream method. But in the label column that is text-based (includes "Normal", "Teardrop", etc.), the scikit-multiflow gives me

"dtype: objectscikit-multiflow only supports numeric data." error. I tried to set the label column as target_idx, and also tried setting that column as categories but none of them worked. How can I make scikit-multiflow work for this dataset?

The full error looks like this:

 File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 99, in __init__
  File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 177, in _prepare_for_use
  File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 185, in _load_data
    check_data_consistency(raw_data, self.allow_nan)
  File "<projdir>/site-packages/skmultiflow/data/data_stream.py", line 443, in check_data_consistency
ValueError: Non-numeric data found:
 duration                         int64
src_bytes                        int64
dst_bytes                        int64
wrong_fragment                   int64
urgent                           int64
hot                              int64
num_failed_logins                int64
num_compromised                  int64
root_shell                       int64
su_attempted                     int64
num_root                         int64
num_file_creations               int64
num_shells                       int64
num_access_files                 int64
num_outbound_cmds                int64
count                            int64
srv_count                        int64
serror_rate                    float64
srv_serror_rate                float64
rerror_rate                    float64
srv_rerror_rate                float64
same_srv_rate                  float64
diff_srv_rate                  float64
srv_diff_host_rate             float64
dst_host_count                   int64
dst_host_srv_count               int64
dst_host_same_srv_rate         float64
dst_host_diff_srv_rate         float64
dst_host_same_src_port_rate    float64
dst_host_srv_diff_host_rate    float64
dst_host_serror_rate           float64
dst_host_srv_serror_rate       float64
dst_host_rerror_rate           float64
dst_host_srv_rerror_rate       float64
label                           object
dtype: objectscikit-multiflow only supports numeric data.


1 reply
Hello, @ecehansavas you can use my function read_kdd_data_multilable from https://github.com/Miso-K/DDCW/blob/master/utils/data_preprocesing.py and then:
data, X, y = read_kdd_data_multilable('./data/kddcup.data_10_percent_corrected.csv'),
stream = DataStream(X, y),
@jacobmontiel QQ: For Anomaly Detection with HalfSpace Trees, how can we load streaming data point by point (i.e. read one datapoint every minute) and feed that into the model instead of loading batch data from a csv?
3 replies
Adrien Luxey
Hi people :)
I'm wondering how you make random_state work as expected in the online learning context.
The difference between you and sklearn is that you don't reset all parameters everytime you call fit, isn't it?
I'm writing my own online classifier, and I'd like to keep their BaseEstimator structure like you do (super practical for cross validation), but I'm struggling with an ensemble estimator (that should use its random_state to fix the base estimators'). Any advice? Thanks, great library :)
Arg. Your HoeffdingTreeClassifier does not have a random_state parameter, for instance. So how do you ensure experiments' reproducibility? Is HoeffdingTreeClassifier deterministic?
3 replies
Adrien Luxey
AdaptiveRandomForestClassifier will be a better example. I'll dig the code. Answers still appreciated :)
Hassan Mehmood
is add element sensitive to 0,1 only or it can be any floating value in KSWIN
Saulo Martiello Mastelini
Hi everyone, as stated in the github page, skmultiflow merged with creme to become https://riverml.xyz/latest/

So, there's no active development in skmultiflow anymore. We invite all the users to check river :D

It's more than the sum of the two projects

We are waiting for the name river be made available in pip, so that we can make an official release and statement
but since now I see a lot of messages in this community channel, I think is worth making an unofficial announcement here
for new users, the interface of river might change a little bit, but we got a lot of improvements regarding API consistency, model speedups and so on
besides that, river has a lot of extra tools and methods that are not available in the legacy skmultiflow
for instance, pre processing techniques (e.g., incremental StandardScaler)
Saulo Martiello Mastelini

sorry for my absence on this channel :/

@jacobmontiel and I have been focusing all our time in preparing the merge and polishing things

Welcome to river! If you have any questions, you can use Github's discussions to make questions and get feedbacks
Saulo Martiello Mastelini
We understand that some ongoing projects rely on skmultiflow for their functioning. For that reason, we will keep skmultiflow in its current state (stable release) and might apply eventual bug fixes
Nasrin Eshraghi Ivari
Hi All, Can anyone help me please? I have implemented data stream clustering. to simulate the stream, I used scikitmultiflow. But I want a sliding time window model to capture my last data. I do not know how can I implement or use sliding ?Does scikitmultiflow support windowing? I could not find anything related!
Michael Forde
Hey Everyone, I have been wanting to do some multivariate forecasting using the HoeffdingTreeRegressor for streamed data, Scikitmultiflow seems to suit what I want to do very well, but I noticed there isn't built-in forecasting support, I'm curious if there is any workaround I could try to do to achieve multiple step forecasting with this library?
Wannabe Maker
Hi all, Is there any way to install scikit multiflow on new M1 Macs ? I Use Pycharm as IDE and when i tried to install scikit multiflow everytime i got a Error because it try to install multiflow for X64 architecture. Is here someone with similar problem?
Hi @jacobmontiel , how to use concept drift detectors in regression problems (example: ADWIN, DDM,KSWIN)? obs: I don't want to use the error
from skmultiflow.data import AnomalySineGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd

stream = AnomalySineGenerator(random_state=42, n_samples=10000, n_anomalies=250)
hs_tree = HalfSpaceTrees(n_estimators=10, depth=8)
true_positive = 0
anomalies = 0
predictions = []
y_test = []
max_samples = 10000
n_samples = 0

while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = hs_tree.predict(X)
    if y[0] == 1.0:
        true_positive += 1
        if y_pred[0] == 1.0:
            anomalies += 1

    hs_tree.partial_fit(X, y)
    n_samples += 1

print('The data has {} anomalies'.format(true_positive))
print('Half Space Trees predicted {} anomalies'.format(anomalies))
This code on running it produces some interesting output. It predicts 0's for some time and predicts 1's later on.
Below is the classification report output which is quite poor.
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))
Hassan Mehmood
@Marilia-Nayara please open up your problem little bit.
@jacobmontiel @all Anyone, please tell me how to use EvaluatePrequentialDelayed with extremely fast decision tree. I want to use the incremental learn part from extremely fast decision tree, while doing delayed evaluation. Please help me!!!!!
Saulo Martiello Mastelini
Hi everyone, just as a reminder, there is not active development in skmultiflow anymore. Skmultiflow and Creme have merged to become River. Now, users can also install River via pip
We encourage the skmultiflow users to make the leap to River. Feel free to open a discussion with your question, or asking for any assistance
Jacob Montiel and I are both maintainers of River too
Saulo Martiello Mastelini
I'll talk with Jacob about the possibility of creating a quick guide, maybe something like: "from skmultiflow to river"
How do I save the results after the test as a file. Instead of saving the assessment as a file
@smastelini @jacobmontiel@all
I am a multi-label classification. I have a total of 14 labels. Where do I set it?
Hello @jacobmontiel @smastelini , i'm new to scikitmultiflow. I would like to train my classification model on one data stream (train data), and make predictions on another datastream(test stream). Could you please help me with a short piece of code that demonstrates this. Thank you.