Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 17:50
    denisesato edited #306
  • 17:49
    denisesato opened #306
  • Oct 07 14:36
    indialindsay opened #305
  • Sep 25 19:33
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli opened #304
  • Sep 05 14:04
    CHIMAWAN001 commented #303
  • Sep 05 14:04
    CHIMAWAN001 closed #303
  • Aug 31 03:36
    CHIMAWAN001 commented #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:14
    CHIMAWAN001 edited #303
  • Aug 31 03:13
    CHIMAWAN001 opened #303
  • Aug 26 07:27
    rabitwhte opened #302
Emanuel Rodrigues
@emanueldosreis_twitter
I had to add: stream.prepare_for_use() and n)features=2 in order to run it as I am not using the development version.
Emanuel Rodrigues
@emanueldosreis_twitter
I have a dataset that contains mostly 0 and 1 across 300+ features. I am currently using Sklearn Isolation forest but really want to swap for an online model as it is now impossible to re-train the model timely. My data is currently spread into a pandas dataframe as I said most 0 and 1 ... each observation is a pandas df shaped as (1,339) . Its looks like the training function only allows numpy arrays. The data looks like this: {'x': 0, 'y': 1,'z':0 . . . } where the letters represent the features/columns. Of course I am not an experienced programmer .. and I am wondering how I could fit my dataset into skmultiflow. Thank you again.
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
stream.prepare_for_use()
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5, n_features=2)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Emanuel Rodrigues
@emanueldosreis_twitter
This is the running code for the version 0.4.1 currently available to be installed using pip. Reference: HalfSpaceTrees Example for Anomaly detection by @jacobmontiel
Jacob
@jacobmontiel
Hi Emanuel, does the DataFrame contains the ground truth? 1 if the sample is an anomaly.
Emanuel Rodrigues
@emanueldosreis_twitter
Hi Jacob, no it does not. So, the performance would be measured using synthetic data which works most of the time.
Jacob
@jacobmontiel
Here is a workaround to get the data from a DataFrame
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
nuwangunasekara
@nuwangunasekara
Hi guys!
Does anyone know of a way that I could save a Stream into a file in scikit-multiflow?
ideally to a csv file
Jacob
@jacobmontiel
there you go
from skmultiflow.data import SEAGenerator

import pandas as pd
import numpy as np


X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
                  columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')
nuwangunasekara
@nuwangunasekara
Cool... thanks @jacobmontiel !
tlfields
@tlfields
Hello @jacobmontiel can you please help me to access random forest using sci-kit mulutiflow? I am trying to compare the performance of Random forest with and without ADWIN . I see the Adaptive Random Forest is already implemented but I dont see how to bring in a Random Forest. Please and Thank you
Jacob
@jacobmontiel
RandomForest is the batch version based on Decision Trees. AdaptiveRandomForest is the stream version based on Hoeffding Trees. AdaptiveRandomForest can be used with or without the drift detection. If you want to use AdaptiveRandomForest without drift detection you must initialize it as AdaptiveRandomForest(drift_detection_method=None)
Emanuel Rodrigues
@emanueldosreis_twitter
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Thank you so much @jacobmontiel
asad1907
@asad1907
Hi everyone, i have a problem related with plot_show. I set plot_show = True but it doesn't work. What should i do? Do you have any idea?
Saulo Martiello Mastelini
@smastelini
Hi @asad1907, could you provide a MWE to help us figure out your problem?
asad1907
@asad1907
@smastelini
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.trees import HoeffdingTree

stream  = DataStream(X_train, y = y_train)
stream.prepare_for_use()

ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=5000,
                                max_samples=20000,
                                metrics = ['accuracy', 'running_time','model_size'],
                                output_file='results.csv')

evaluator.evaluate(stream=stream, model=ht);
image.png
Just i got this
Saulo Martiello Mastelini
@smastelini

How many instances does your dataset have?

Did you try to decrease the pretrain_size to, let say, pretrain_size=200?

asad1907
@asad1907

@smastelini

X_train shape : (20631, 16)
y_train shape : (20631,)

@smastelini yes, i did it but it doesn't change
Saulo Martiello Mastelini
@smastelini
Are you using jupyter notebooks? You might need to change your matplotlib backend
asad1907
@asad1907
@smastelini I am using JupyterLab . I tried %matplotlib widget and then I got following problem
image.png
Saulo Martiello Mastelini
@smastelini
That's indeed strange. I am assuming that by setting show_plot=False your code runs normally (is it correct?). It seems that your problem is related to the matplotlib backend used in jupyter. Probably the solution is to set a proper backend for your interactive plot
tlfields
@tlfields
@jacobmontiel thank you so much. I am very new to scikit-multiflow, would you direct me to tutorials that have been compiled to explain how to compare the performance of algorithms?
tlfields
@tlfields
@jacobmontiel so I am trying to see the results of a Random Forest with no drift and and Adaptive random forest
tlfields
@tlfields
@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you
barnettjv
@barnettjv
@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.
barnettjv
@barnettjv
@jacobmontiel I'm assuming that I'll need to use the predict(X) fn, but honestly was hoping for a quick solution.
tlfields
@tlfields
@jacobmontiel how do we add LSTM and MLP deep learning algorithms to scikit-multiflow?
asad1907
@asad1907
@smastelini thanks a lot sir for your help. I solved it :)
barnettjv
@barnettjv
@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.
Jacob
@jacobmontiel

@smastelini thanks a lot sir for your help. I solved it :)

@asad1907 Can you share your solution? Support for dynamic plots in Jupyter Lab has not improved much since its release.

@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you

Glad to help.

@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.

If you are using an evaluator you can add true_vs_predicted to metrics to get predicted values. In this case you also need to set n_wait=1. As a suggestion, in this case deactivate the plot as n_wait=1 implies a high refresh rate in the plot which is a lot of overhead.

@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.

That is correct.

Jacob
@jacobmontiel

@jacobmontiel how do we add LSTM and MLP deep learning algorithms to scikit-multiflow?

those are open questions still, since those methods are usually trained on batches

@tlfields scikit-multiflow does not include any implementation (yet). If for your use-case using batch-incremental instead of instance-incremental learning is fine, the you could do something similar to the BatchIncremental model. This is a simple class to show how you can do batch-incremental learning using batch methods from scikit-learn. But you are not restricted to models from that library.
tlfields
@tlfields
@jacobmontiel thank you so much for your response
barnettjv
@barnettjv
Jacob, I added the the 'true_vs_predicted' and set the pretrain to 50 on a data set of 200, along with n_wait=1 and aren't getting any predicted values.
I'm just getting the Accuracy, which is the only other metric I'm sending.
oh never mind. figured it out :D
asad1907
@asad1907
@barnettjv @barnettjv You can see true and predictive values in results.csv. Using true_vs_predicted in metrics and output_file='results.csv'
asad1907
@asad1907
@jacobmontiel I have solved that on Jupiter Notebook using %matplotlib notebook. Now I am trying to use on JupiterLab. If I can, i will share gladly
Jacob
@jacobmontiel

@jacobmontiel I have solved that on Jupiter Notebook using %matplotlib notebook. Now I am trying to use on JupiterLab. If I can, i will share gladly

Thanks for letting us know

tlfields
@tlfields
@jacobmontiel thank you for the video from anaconda con. I have watched it several times and I am learning so much from you. I have ran the notebok you provided and I have a question as to how to use the page-hinkley or the other drift detectors using the "agr_a_20k.csv" instead of the stream dataset. Is this somehting you can help me with? I want to see which of the detectors pick up the drift in specifically in the agr_a_20k.csv
tlfields
@tlfields
@jacobmontiel .... I think I got it to work... I am so HAPPY!!