Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Mar 25 21:11
    denisesato opened #309
  • Feb 22 03:34
    tigerinus opened #308
  • Feb 09 15:35
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Nov 17 2021 23:03
    lsgotti opened #307
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 19 2021 18:28
    denisesato commented #306
  • Oct 19 2021 17:50
    denisesato edited #306
  • Oct 19 2021 17:49
    denisesato opened #306
  • Oct 07 2021 14:36
    indialindsay opened #305
  • Sep 25 2021 19:33
    Venoli edited #304
  • Sep 25 2021 19:28
    Venoli edited #304
  • Sep 25 2021 19:28
    Venoli edited #304
  • Sep 25 2021 19:28
    Venoli edited #304
  • Sep 25 2021 19:12
    Venoli edited #304
  • Sep 25 2021 19:12
    Venoli edited #304
Jiao Yin
@JoanYinCQ
@barnettjv https://github.com/scikit-multiflow/scikit-multiflow/tree/master/src/skmultiflow/data/datasets I download the data and test scripts from the githib repo of scikit-multiflow.
Jacob
@jacobmontiel

Thanks @jacobmontiel . I run exactly the same codes with you. But the error still exists. However, when I redownload the data, the error is fixed. So, maybe the data source is not saved properly at first. Thanks for your reply.

Glad to hear that it is working. It is strange that it just went away, in any case we will keep it in mind in case somebody else gets the same error.

Jiao Yin
@JoanYinCQ

Hi @jacobmontiel I want to know if all methods for Concept Drift Detection included in skmultiflow.drift_detection only support 1-D data stream. For example, when using a 2-d (size=[2000,5]) data_stream in the following codes, an error will arise.

Imports

import numpy as np
from skmultiflow.drift_detection import PageHinkley
ph = PageHinkley()

Simulating a data stream as a normal distribution of 1's and 0's

data_stream = np.random.randint(2, size=[2000,5])

Changing the data concept from index 999 to 2000

for i in range(999, 2000):
data_stream[i] = np.random.randint(4, high=8,size=5)

Adding stream elements to the PageHinkley drift detector and verifying if drift occurred

for i in range(2000):
ph.add_element(data_stream[i])
if ph.detected_change():
print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))

Jacob
@jacobmontiel
HI @JoanYinCQ , yes all drift detectors only support 1D data. This is not defined by scikit-multiflow but the algorithms themselves.
Jacob
@jacobmontiel
you code should look like this:
from skmultiflow.drift_detection import PageHinkley

data_stream = np.concatenate((np.random.randint(2, size=1000), np.random.randint(4, size=1000)))

ph = PageHinkley()
for i, val in enumerate(data_stream):
    ph.add_element(val)
    if ph.detected_change():
        print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))
        ph.reset()
Jiao Yin
@JoanYinCQ
@jacobmontiel Thanks for your reply. That solves my problem.
nuwangunasekara
@nuwangunasekara
Hi guys, does scikit-multiflow supports "arff" file format?
Jacob
@jacobmontiel
Short anwer, no. However, there is a loader in scipy: (scipy.io.arff.loadarff). So you can load the data and then convert it using skmultiflow.data.DataStream
nuwangunasekara
@nuwangunasekara
Thanks for the tip @jacobmontiel !
farzane
@farzane30788946_twitter
hi , i have a problem . i want to compute kernel density estimation and i use scikit-learn.neighbor.KernelDensity ; multivariate data is comming in a stream and i want to change and update my estimated density as streaming data come in. so i choose scikit-multiflow , i discover that multiflow supports streaming data and has drift detection method in window; but i want to find out if multiflow core can join to sklearn.neighbor.KernelDensity and can update densities as data come in, and if it can not work with this so how should i use adwin() for multivariate data to save part of data in window and change saved part every time drift detected and compute kernel density estimation again in new part of data? any other advice will be useful, thanks in advance!
Jacob
@jacobmontiel
Hello @farzane30788946_twitter . You can use sklearn.neighbor.KernelDensity alongisde scikit-multiflow. As you mention the sklearn implementation works in batches of data, if you wan to update the densities you have to define data update strategy. This is very similar to how the KNNClassifier is implemented. You will see there that the data is stored in a sliding window. Regarding drift detection, ADWIN as all other drift detectors take as input 1-dimensional data. You can check the KNNADWINClassifier which uses ADWIN to monitor the classification performance of the basic KNN model. If ADWIN detects a change in classification performance, then the model is reset.
farzane
@farzane30788946_twitter
much obliged @jacobmontiel . i will try it
farzane
@farzane30788946_twitter
hi guys , one question ; i have a distribution of a data set ; i want to measure the probability of how much the new instance belongs to this distribution. it is the same things generative adversarial networks do for discriminate real data and fake data, but in python i am wandering how can i do that ? i am not sure there is methods in scikit or not but if there is i prefer use this instead of other libraries.
every ideas even from other libraries such as stats or .. will be useful! thanks in advance.
farzane
@farzane30788946_twitter
i find my question answer myself ! binary cross entropy loss function is useful here , but i am wander if i can use this function as a learner model in skmultiflow.evaluation.evaluate_prequential.EvaluatePrequential .
@jacobmontiel you answer my last question;much obliged for your answer . do you have any idea here too? if you have some experiences let me know ; thanks in advance.
Jacob
@jacobmontiel
EvaluatePrequential requieres as input a model tghat implements the predict and partial_fit methods. For each sample the predict method returns the predicted value, whereas the partial_fit method is used to introduce new data into the model
the important element is to process one sample at a time
Jacob
@jacobmontiel
in this case, as your data seems to not be labeled, then you can consider anomaly detection methods or clustering (depending on the distribution of the data and how often you expect to see “different” samples)
Jacob
@jacobmontiel
ignore my previous comment as I got confussed. Your data is indeed labeled. In this case, then you can follow a similar structure to the one in the KNNClassifier.
farzane
@farzane30788946_twitter
i appreciate you indeed, i was confused in these concepts. have good happenings dear @jacobmontiel
Dossy Shiobara
@dossy
Is there a working example of how to use skmultiflow.data.DataStream?
Jacob
@jacobmontiel

Hi @dossy , here is one

from skmultiflow.data import DataStream
import numpy as np

n_features = 10
n_samples = 50 
X = np.random.random(size=(n_samples, n_feature
y = np.random.randint(2, size=n_samples)
stream = DataStream(data=X, y=y)
# stream.prepare_for_use() # if using the stable version (0.4.1)
stream.n_remaining_samples()

Last line return 50

This is for example if you have your data in different numpy.ndarray
You can also a single np.ndarray as long as you define the index of the target column (last column by default). pandas.DataFrame are also supported, following the same indications.
Dossy Shiobara
@dossy
I know this is a n00b question, but if I’m working with strings, I have to vectorize them first? I can’t just pass in a pandas.DataFrame containing strings - only real/int values?
Dossy Shiobara
@dossy
How do you actually implement this as a stream processor? How do you operationalize this so it implements online learning, and periodically checkpoints to disk and reloads when restarted, etc.? I want to stream log data at it and have it classify events in the stream ... looks like scikit-multiflow is only a small part of the puzzle and there’s a lot of stuff you have to develop yourself around it?
Jacob
@jacobmontiel

I know this is a n00b question, but if I’m working with strings, I have to vectorize them first? I can’t just pass in a pandas.DataFrame containing strings - only real/int values?

All questions are welcomed. Currently, we only support numerical data. Your data must be pre-processed. As you mention, scikit-multiflow is focused on the learning part. The idea is that you can take it and integrate it as part of your workflow.

Jacob
@jacobmontiel
scikit-multiflow processes data one sample at a time. We provide the FileStream and DataStream classes for the case when you have data in a file or in memory. Both are extensions of the Stream class. If you want to read from log file you could process each log as it arrives (convert to numerical values) and the pass it to a model. The operation to receive and process the log entry can be wrapped in an extension of the Stream class. The most relevant method is next_sample.
tlfields
@tlfields
Hello, does anyone know how to add a random forest to sckit-multfilow? Not he adaptive random forest, but a random forst with no drift adaptation. Or is this somehthing that has aready been done? Thank you for your help
nuwangunasekara
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
Any tip is highly appreciated. Thanks!
Niek Tax
@TaXxER

Hi, I am building a HoeffdingTree classifier on a heavily imbalanced data stream (only ~1 in 1000 data points are of the positive class). Using the EvaluatePrequential evaluator I am able to plot the precision and recall, however, the recall is extremely low as the model learns to predict the negative class almost always (only 50 positive predictions in my stream of 10 million data points).

Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?

Emanuel Rodrigues
@emanueldosreis_twitter
Hello guys, I am looking for some example code for Anomaly Detection. I did not find anything at the documentation nor at internet. Any code example would be great.
Jacob
@jacobmontiel

@tlfields

does anyone know how to add a random forest to sckit-multfilow?

I am not sure I understand, do you mean that you want to run the batch version of Random Forest? If you want to run the stream version without concept drift then set drift_detection_method=None

Jacob
@jacobmontiel

@nuwangunasekara

Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?

The StreamTransform class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:

  1. If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a dict) would help.
  2. If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
    I would explore the first scenario first as the second one seems more like a corner case.
Jacob
@jacobmontiel

Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?

You can get probabilities via the predict_proba method from the HoeffdingTree, however there is currently no suppoort for this in the EvaluatePrequential class. In this case you might want to try implementing the evaluate prequential process. Something like this:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier
# Setting up a data stream
stream = SEAGenerator(random_state=1)
# Setup Hoeffding Tree estimator
ht = HoeffdingTreeClassifier()
# Setup variables to control loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 200
# Train the estimator with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
    X, y = stream.next_sample()
    y_pred = ht.predict(X)
    if y[0] == y_pred[0]:
        correct_cnt += 1
    ht = ht.partial_fit(X, y)
    n_samples += 1

The metrics can be calculated using the ClassificationPerformanceEvaluator and WindowClassificationPerformanceEvaluator in the development branch.

nuwangunasekara
@nuwangunasekara

@nuwangunasekara

Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?

The StreamTransform class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:

  1. If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a dict) would help.
  2. If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
    I would explore the first scenario first as the second one seems more like a corner case.

Thamks for the tip @jacobmontiel !

Jacob
@jacobmontiel

I am looking for some example code for Anomaly Detection.

Sorry about that, here is an example for the HalfSpaceTree:

# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
Emanuel Rodrigues
@emanueldosreis_twitter
Some comments:
  1. The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
  2. The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a Stream object. You must replace the generator with the proper data.
  3. This example corresponds to the development version where the parameter n_features has been removed from the signature.
    Thank you, appreciate it.
I had to add: stream.prepare_for_use() and n)features=2 in order to run it as I am not using the development version.
Emanuel Rodrigues
@emanueldosreis_twitter
I have a dataset that contains mostly 0 and 1 across 300+ features. I am currently using Sklearn Isolation forest but really want to swap for an online model as it is now impossible to re-train the model timely. My data is currently spread into a pandas dataframe as I said most 0 and 1 ... each observation is a pandas df shaped as (1,339) . Its looks like the training function only allows numpy arrays. The data looks like this: {'x': 0, 'y': 1,'z':0 . . . } where the letters represent the features/columns. Of course I am not an experienced programmer .. and I am wondering how I could fit my dataset into skmultiflow. Thank you again.
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
stream.prepare_for_use()
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5, n_features=2)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
   X, y = stream.next_sample()
   y_pred = half_space_trees.predict(X)
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(X, y)
   n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Emanuel Rodrigues
@emanueldosreis_twitter
This is the running code for the version 0.4.1 currently available to be installed using pip. Reference: HalfSpaceTrees Example for Anomaly detection by @jacobmontiel
Jacob
@jacobmontiel
Hi Emanuel, does the DataFrame contains the ground truth? 1 if the sample is an anomaly.
Emanuel Rodrigues
@emanueldosreis_twitter
Hi Jacob, no it does not. So, the performance would be measured using synthetic data which works most of the time.
Jacob
@jacobmontiel
Here is a workaround to get the data from a DataFrame
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])

anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
   y_pred = half_space_trees.predict([X])
   if y_pred[0] == 1:
       anomaly_cnt += 1
   half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
nuwangunasekara
@nuwangunasekara
Hi guys!
Does anyone know of a way that I could save a Stream into a file in scikit-multiflow?
ideally to a csv file
Jacob
@jacobmontiel
there you go
from skmultiflow.data import SEAGenerator

import pandas as pd
import numpy as np


X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
                  columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')