Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 19 18:28
    denisesato commented #306
  • Oct 19 17:50
    denisesato edited #306
  • Oct 19 17:49
    denisesato opened #306
  • Oct 07 14:36
    indialindsay opened #305
  • Sep 25 19:33
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli opened #304
  • Sep 05 14:04
    CHIMAWAN001 commented #303
  • Sep 05 14:04
    CHIMAWAN001 closed #303
  • Aug 31 03:36
    CHIMAWAN001 commented #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:13
    CHIMAWAN001 opened #303
Jacob
@jacobmontiel

Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?

You can get probabilities via the predict_proba method from the HoeffdingTree, however there is currently no suppoort for this in the EvaluatePrequential class. In this case you might want to try implementing the evaluate prequential process. Something like this:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier
# Setting up a data stream
stream = SEAGenerator(random_state=1)
# Setup Hoeffding Tree estimator
ht = HoeffdingTreeClassifier()
# Setup variables to control loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 200
# Train the estimator with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = ht.predict(X)
    if y[0] == y_pred[0]:
        correct_cnt += 1
    ht = ht.partial_fit(X, y)
    n_samples += 1

The metrics can be calculated using the ClassificationPerformanceEvaluator and WindowClassificationPerformanceEvaluator in the development branch.

nuwangunasekara
@nuwangunasekara

@nuwangunasekara

Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?

The StreamTransform class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:

  1. If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a dict) would help.
  2. If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
    I would explore the first scenario first as the second one seems more like a corner case.

Thamks for the tip @jacobmontiel !

Jacob
@jacobmontiel

I am looking for some example code for Anomaly Detection.

Sorry about that, here is an example for the HalfSpaceTree:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
Emanuel Rodrigues
@emanueldosreis_twitter
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
    Thank you, appreciate it.
I had to add: stream.prepare_for_use() and n)features=2 in order to run it as I am not using the development version.
Emanuel Rodrigues
@emanueldosreis_twitter
I have a dataset that contains mostly 0 and 1 across 300+ features. I am currently using Sklearn Isolation forest but really want to swap for an online model as it is now impossible to re-train the model timely. My data is currently spread into a pandas dataframe as I said most 0 and 1 ... each observation is a pandas df shaped as (1,339) . Its looks like the training function only allows numpy arrays. The data looks like this: {'x': 0, 'y': 1,'z':0 . . . } where the letters represent the features/columns. Of course I am not an experienced programmer .. and I am wondering how I could fit my dataset into skmultiflow. Thank you again.
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
stream.prepare_for_use()
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5, n_features=2)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Emanuel Rodrigues
@emanueldosreis_twitter
This is the running code for the version 0.4.1 currently available to be installed using pip. Reference: HalfSpaceTrees Example for Anomaly detection by @jacobmontiel
Jacob
@jacobmontiel
Hi Emanuel, does the DataFrame contains the ground truth? 1 if the sample is an anomaly.
Emanuel Rodrigues
@emanueldosreis_twitter
Hi Jacob, no it does not. So, the performance would be measured using synthetic data which works most of the time.
Jacob
@jacobmontiel
Here is a workaround to get the data from a DataFrame
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
nuwangunasekara
@nuwangunasekara
Hi guys!
Does anyone know of a way that I could save a Stream into a file in scikit-multiflow?
ideally to a csv file
Jacob
@jacobmontiel
there you go
from skmultiflow.data import SEAGenerator

import pandas as pd
import numpy as np


X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
                  columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')
nuwangunasekara
@nuwangunasekara
Cool... thanks @jacobmontiel !
tlfields
@tlfields
Hello @jacobmontiel can you please help me to access random forest using sci-kit mulutiflow? I am trying to compare the performance of Random forest with and without ADWIN . I see the Adaptive Random Forest is already implemented but I dont see how to bring in a Random Forest. Please and Thank you
Jacob
@jacobmontiel
RandomForest is the batch version based on Decision Trees. AdaptiveRandomForest is the stream version based on Hoeffding Trees. AdaptiveRandomForest can be used with or without the drift detection. If you want to use AdaptiveRandomForest without drift detection you must initialize it as AdaptiveRandomForest(drift_detection_method=None)
Emanuel Rodrigues
@emanueldosreis_twitter
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Thank you so much @jacobmontiel
asad1907
@asad1907
Hi everyone, i have a problem related with plot_show. I set plot_show = True but it doesn't work. What should i do? Do you have any idea?
Saulo Martiello Mastelini
@smastelini
Hi @asad1907, could you provide a MWE to help us figure out your problem?
asad1907
@asad1907
@smastelini
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.trees import HoeffdingTree

stream  = DataStream(X_train, y = y_train)
stream.prepare_for_use()

ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=5000,
                                max_samples=20000,
                                metrics = ['accuracy', 'running_time','model_size'],
                                output_file='results.csv')

evaluator.evaluate(stream=stream, model=ht);
image.png
Just i got this
Saulo Martiello Mastelini
@smastelini

How many instances does your dataset have?

Did you try to decrease the pretrain_size to, let say, pretrain_size=200?

asad1907
@asad1907

@smastelini

X_train shape : (20631, 16)
y_train shape : (20631,)

@smastelini yes, i did it but it doesn't change
Saulo Martiello Mastelini
@smastelini
Are you using jupyter notebooks? You might need to change your matplotlib backend
asad1907
@asad1907
@smastelini I am using JupyterLab . I tried %matplotlib widget and then I got following problem
image.png
Saulo Martiello Mastelini
@smastelini
That's indeed strange. I am assuming that by setting show_plot=False your code runs normally (is it correct?). It seems that your problem is related to the matplotlib backend used in jupyter. Probably the solution is to set a proper backend for your interactive plot
tlfields
@tlfields
@jacobmontiel thank you so much. I am very new to scikit-multiflow, would you direct me to tutorials that have been compiled to explain how to compare the performance of algorithms?
tlfields
@tlfields
@jacobmontiel so I am trying to see the results of a Random Forest with no drift and and Adaptive random forest
tlfields
@tlfields
@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you
barnettjv
@barnettjv
@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.
barnettjv
@barnettjv
@jacobmontiel I'm assuming that I'll need to use the predict(X) fn, but honestly was hoping for a quick solution.
tlfields
@tlfields
@jacobmontiel how do we add LSTM and MLP deep learning algorithms to scikit-multiflow?
asad1907
@asad1907
@smastelini thanks a lot sir for your help. I solved it :)
barnettjv
@barnettjv
@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.
Jacob
@jacobmontiel

@smastelini thanks a lot sir for your help. I solved it :)

@asad1907 Can you share your solution? Support for dynamic plots in Jupyter Lab has not improved much since its release.

@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you

Glad to help.

@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.

If you are using an evaluator you can add true_vs_predicted to metrics to get predicted values. In this case you also need to set n_wait=1. As a suggestion, in this case deactivate the plot as n_wait=1 implies a high refresh rate in the plot which is a lot of overhead.

@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.

That is correct.

Jacob
@jacobmontiel

@jacobmontiel how do we add LSTM and MLP deep learning algorithms to scikit-multiflow?

those are open questions still, since those methods are usually trained on batches

@tlfields scikit-multiflow does not include any implementation (yet). If for your use-case using batch-incremental instead of instance-incremental learning is fine, the you could do something similar to the BatchIncremental model. This is a simple class to show how you can do batch-incremental learning using batch methods from scikit-learn. But you are not restricted to models from that library.
tlfields
@tlfields
@jacobmontiel thank you so much for your response
barnettjv
@barnettjv
Jacob, I added the the 'true_vs_predicted' and set the pretrain to 50 on a data set of 200, along with n_wait=1 and aren't getting any predicted values.
I'm just getting the Accuracy, which is the only other metric I'm sending.