Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 17 23:03
    lsgotti opened #307
  • Oct 28 14:19
    lambertsbennett edited #268
  • Oct 28 14:19
    lambertsbennett edited #268
  • Oct 19 18:28
    denisesato commented #306
  • Oct 19 17:50
    denisesato edited #306
  • Oct 19 17:49
    denisesato opened #306
  • Oct 07 14:36
    indialindsay opened #305
  • Sep 25 19:33
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:28
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli edited #304
  • Sep 25 19:12
    Venoli opened #304
  • Sep 05 14:04
    CHIMAWAN001 commented #303
  • Sep 05 14:04
    CHIMAWAN001 closed #303
  • Aug 31 03:36
    CHIMAWAN001 commented #303
Jacob
@jacobmontiel
EvaluatePrequential requieres as input a model tghat implements the predict and partial_fit methods. For each sample the predict method returns the predicted value, whereas the partial_fit method is used to introduce new data into the model
the important element is to process one sample at a time
Jacob
@jacobmontiel
in this case, as your data seems to not be labeled, then you can consider anomaly detection methods or clustering (depending on the distribution of the data and how often you expect to see “different” samples)
Jacob
@jacobmontiel
ignore my previous comment as I got confussed. Your data is indeed labeled. In this case, then you can follow a similar structure to the one in the KNNClassifier.
farzane
@farzane30788946_twitter
i appreciate you indeed, i was confused in these concepts. have good happenings dear @jacobmontiel
Dossy Shiobara
@dossy
Is there a working example of how to use skmultiflow.data.DataStream?
Jacob
@jacobmontiel

Hi @dossy , here is one

from skmultiflow.data import DataStream
import numpy as np

n_features = 10
n_samples = 50 
X = np.random.random(size=(n_samples, n_feature
y = np.random.randint(2, size=n_samples)
stream = DataStream(data=X, y=y)
# stream.prepare_for_use() # if using the stable version (0.4.1)
stream.n_remaining_samples()

Last line return 50

This is for example if you have your data in different numpy.ndarray
You can also a single np.ndarray as long as you define the index of the target column (last column by default). pandas.DataFrame are also supported, following the same indications.
Dossy Shiobara
@dossy
I know this is a n00b question, but if I’m working with strings, I have to vectorize them first? I can’t just pass in a pandas.DataFrame containing strings - only real/int values?
Dossy Shiobara
@dossy
How do you actually implement this as a stream processor? How do you operationalize this so it implements online learning, and periodically checkpoints to disk and reloads when restarted, etc.? I want to stream log data at it and have it classify events in the stream ... looks like scikit-multiflow is only a small part of the puzzle and there’s a lot of stuff you have to develop yourself around it?
Jacob
@jacobmontiel

I know this is a n00b question, but if I’m working with strings, I have to vectorize them first? I can’t just pass in a pandas.DataFrame containing strings - only real/int values?

All questions are welcomed. Currently, we only support numerical data. Your data must be pre-processed. As you mention, scikit-multiflow is focused on the learning part. The idea is that you can take it and integrate it as part of your workflow.

Jacob
@jacobmontiel
scikit-multiflow processes data one sample at a time. We provide the FileStream and DataStream classes for the case when you have data in a file or in memory. Both are extensions of the Stream class. If you want to read from log file you could process each log as it arrives (convert to numerical values) and the pass it to a model. The operation to receive and process the log entry can be wrapped in an extension of the Stream class. The most relevant method is next_sample.
tlfields
@tlfields
Hello, does anyone know how to add a random forest to sckit-multfilow? Not he adaptive random forest, but a random forst with no drift adaptation. Or is this somehthing that has aready been done? Thank you for your help
nuwangunasekara
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
Any tip is highly appreciated. Thanks!
Niek Tax
@TaXxER

Hi, I am building a HoeffdingTree classifier on a heavily imbalanced data stream (only ~1 in 1000 data points are of the positive class). Using the EvaluatePrequential evaluator I am able to plot the precision and recall, however, the recall is extremely low as the model learns to predict the negative class almost always (only 50 positive predictions in my stream of 10 million data points).

Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?

Emanuel Rodrigues
@emanueldosreis_twitter
Hello guys, I am looking for some example code for Anomaly Detection. I did not find anything at the documentation nor at internet. Any code example would be great.
Jacob
@jacobmontiel

@tlfields

does anyone know how to add a random forest to sckit-multfilow?

I am not sure I understand, do you mean that you want to run the batch version of Random Forest? If you want to run the stream version without concept drift then set drift_detection_method=None

Jacob
@jacobmontiel

@nuwangunasekara

Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?

The StreamTransform class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:

  1. If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a dict) would help.
  2. If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
    I would explore the first scenario first as the second one seems more like a corner case.
Jacob
@jacobmontiel

Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?

You can get probabilities via the predict_proba method from the HoeffdingTree, however there is currently no suppoort for this in the EvaluatePrequential class. In this case you might want to try implementing the evaluate prequential process. Something like this:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier
# Setting up a data stream
stream = SEAGenerator(random_state=1)
# Setup Hoeffding Tree estimator
ht = HoeffdingTreeClassifier()
# Setup variables to control loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 200
# Train the estimator with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = ht.predict(X)
    if y[0] == y_pred[0]:
        correct_cnt += 1
    ht = ht.partial_fit(X, y)
    n_samples += 1

The metrics can be calculated using the ClassificationPerformanceEvaluator and WindowClassificationPerformanceEvaluator in the development branch.

nuwangunasekara
@nuwangunasekara

@nuwangunasekara

Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?

The StreamTransform class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:

  1. If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a dict) would help.
  2. If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
    I would explore the first scenario first as the second one seems more like a corner case.

Thamks for the tip @jacobmontiel !

Jacob
@jacobmontiel

I am looking for some example code for Anomaly Detection.

Sorry about that, here is an example for the HalfSpaceTree:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
Emanuel Rodrigues
@emanueldosreis_twitter
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
    Thank you, appreciate it.
I had to add: stream.prepare_for_use() and n)features=2 in order to run it as I am not using the development version.
Emanuel Rodrigues
@emanueldosreis_twitter
I have a dataset that contains mostly 0 and 1 across 300+ features. I am currently using Sklearn Isolation forest but really want to swap for an online model as it is now impossible to re-train the model timely. My data is currently spread into a pandas dataframe as I said most 0 and 1 ... each observation is a pandas df shaped as (1,339) . Its looks like the training function only allows numpy arrays. The data looks like this: {'x': 0, 'y': 1,'z':0 . . . } where the letters represent the features/columns. Of course I am not an experienced programmer .. and I am wondering how I could fit my dataset into skmultiflow. Thank you again.
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
stream.prepare_for_use()
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5, n_features=2)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Emanuel Rodrigues
@emanueldosreis_twitter
This is the running code for the version 0.4.1 currently available to be installed using pip. Reference: HalfSpaceTrees Example for Anomaly detection by @jacobmontiel
Jacob
@jacobmontiel
Hi Emanuel, does the DataFrame contains the ground truth? 1 if the sample is an anomaly.
Emanuel Rodrigues
@emanueldosreis_twitter
Hi Jacob, no it does not. So, the performance would be measured using synthetic data which works most of the time.
Jacob
@jacobmontiel
Here is a workaround to get the data from a DataFrame
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
nuwangunasekara
@nuwangunasekara
Hi guys!
Does anyone know of a way that I could save a Stream into a file in scikit-multiflow?
ideally to a csv file
Jacob
@jacobmontiel
there you go
from skmultiflow.data import SEAGenerator

import pandas as pd
import numpy as np


X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
                  columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')
nuwangunasekara
@nuwangunasekara
Cool... thanks @jacobmontiel !
tlfields
@tlfields
Hello @jacobmontiel can you please help me to access random forest using sci-kit mulutiflow? I am trying to compare the performance of Random forest with and without ADWIN . I see the Adaptive Random Forest is already implemented but I dont see how to bring in a Random Forest. Please and Thank you
Jacob
@jacobmontiel
RandomForest is the batch version based on Decision Trees. AdaptiveRandomForest is the stream version based on Hoeffding Trees. AdaptiveRandomForest can be used with or without the drift detection. If you want to use AdaptiveRandomForest without drift detection you must initialize it as AdaptiveRandomForest(drift_detection_method=None)
Emanuel Rodrigues
@emanueldosreis_twitter
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Thank you so much @jacobmontiel
asad1907
@asad1907
Hi everyone, i have a problem related with plot_show. I set plot_show = True but it doesn't work. What should i do? Do you have any idea?
Saulo Martiello Mastelini
@smastelini
Hi @asad1907, could you provide a MWE to help us figure out your problem?
asad1907
@asad1907
@smastelini
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.trees import HoeffdingTree

stream  = DataStream(X_train, y = y_train)
stream.prepare_for_use()

ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
                                pretrain_size=5000,
                                max_samples=20000,
                                metrics = ['accuracy', 'running_time','model_size'],
                                output_file='results.csv')

evaluator.evaluate(stream=stream, model=ht);
image.png
Just i got this
Saulo Martiello Mastelini
@smastelini

How many instances does your dataset have?

Did you try to decrease the pretrain_size to, let say, pretrain_size=200?

asad1907
@asad1907

@smastelini

X_train shape : (20631, 16)
y_train shape : (20631,)

@smastelini yes, i did it but it doesn't change
Saulo Martiello Mastelini
@smastelini
Are you using jupyter notebooks? You might need to change your matplotlib backend
asad1907
@asad1907
@smastelini I am using JupyterLab . I tried %matplotlib widget and then I got following problem