from skmultiflow.data import FileStream
stream = FileStream("./src/skmultiflow/data/datasets/covtype.csv")
stream.prepare_for_use()
stream.n_classes # Output: 7
stream.target_values # Output: [1, 2, 3, 4, 5, 6, 7]
Thanks @jacobmontiel . I run exactly the same codes with you. But the error still exists. However, when I redownload the data, the error is fixed. So, maybe the data source is not saved properly at first. Thanks for your reply.
Glad to hear that it is working. It is strange that it just went away, in any case we will keep it in mind in case somebody else gets the same error.
Hi @jacobmontiel I want to know if all methods for Concept Drift Detection included in skmultiflow.drift_detection only support 1-D data stream. For example, when using a 2-d (size=[2000,5]) data_stream in the following codes, an error will arise.
import numpy as np
from skmultiflow.drift_detection import PageHinkley
ph = PageHinkley()
data_stream = np.random.randint(2, size=[2000,5])
for i in range(999, 2000):
data_stream[i] = np.random.randint(4, high=8,size=5)
for i in range(2000):
ph.add_element(data_stream[i])
if ph.detected_change():
print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))
from skmultiflow.drift_detection import PageHinkley
data_stream = np.concatenate((np.random.randint(2, size=1000), np.random.randint(4, size=1000)))
ph = PageHinkley()
for i, val in enumerate(data_stream):
ph.add_element(val)
if ph.detected_change():
print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))
ph.reset()
skmultiflow.data.DataStream
sklearn.neighbor.KernelDensity
alongisde scikit-multiflow
. As you mention the sklearn implementation works in batches of data, if you wan to update the densities you have to define data update strategy. This is very similar to how the KNNClassifier
is implemented. You will see there that the data is stored in a sliding window. Regarding drift detection, ADWIN as all other drift detectors take as input 1-dimensional data. You can check the KNNADWINClassifier
which uses ADWIN to monitor the classification performance of the basic KNN model. If ADWIN detects a change in classification performance, then the model is reset.
Hi @dossy , here is one
from skmultiflow.data import DataStream
import numpy as np
n_features = 10
n_samples = 50
X = np.random.random(size=(n_samples, n_feature
y = np.random.randint(2, size=n_samples)
stream = DataStream(data=X, y=y)
# stream.prepare_for_use() # if using the stable version (0.4.1)
stream.n_remaining_samples()
Last line return 50
numpy.ndarray
np.ndarray
as long as you define the index of the target column (last column by default). pandas.DataFrame
are also supported, following the same indications.
scikit-multiflow
is only a small part of the puzzle and there’s a lot of stuff you have to develop yourself around it?
I know this is a n00b question, but if I’m working with strings, I have to vectorize them first? I can’t just pass in a
pandas.DataFrame
containing strings - only real/int values?
All questions are welcomed. Currently, we only support numerical data. Your data must be pre-processed. As you mention, scikit-multiflow is focused on the learning part. The idea is that you can take it and integrate it as part of your workflow.
scikit-multiflow
processes data one sample at a time. We provide the FileStream
and DataStream
classes for the case when you have data in a file or in memory. Both are extensions of the Stream
class. If you want to read from log file you could process each log as it arrives (convert to numerical values) and the pass it to a model. The operation to receive and process the log entry can be wrapped in an extension of the Stream
class. The most relevant method is next_sample
.
Hi, I am building a HoeffdingTree classifier on a heavily imbalanced data stream (only ~1 in 1000 data points are of the positive class). Using the EvaluatePrequential
evaluator I am able to plot the precision and recall, however, the recall is extremely low as the model learns to predict the negative class almost always (only 50 positive predictions in my stream of 10 million data points).
Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
The StreamTransform
class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:
dict
) would help.Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?
You can get probabilities via the predict_proba
method from the HoeffdingTree
, however there is currently no suppoort for this in the EvaluatePrequential
class. In this case you might want to try implementing the evaluate prequential process. Something like this:
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier
# Setting up a data stream
stream = SEAGenerator(random_state=1)
# Setup Hoeffding Tree estimator
ht = HoeffdingTreeClassifier()
# Setup variables to control loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 200
# Train the estimator with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
X, y = stream.next_sample()
y_pred = ht.predict(X)
if y[0] == y_pred[0]:
correct_cnt += 1
ht = ht.partial_fit(X, y)
n_samples += 1
The metrics can be calculated using the ClassificationPerformanceEvaluator
and WindowClassificationPerformanceEvaluator
in the development branch.
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
The
StreamTransform
class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:
- If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a
dict
) would help.- If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
I would explore the first scenario first as the second one seems more like a corner case.
Thamks for the tip @jacobmontiel !
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
X, y = stream.next_sample()
y_pred = half_space_trees.predict(X)
if y_pred[0] == 1:
anomaly_cnt += 1
half_space_trees = half_space_trees.partial_fit(X, y)
n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))