Note that my data is very unbalanced in favor of class '7', which is the value of the root leaf of the VFDT. I am trying out other settings (increasing the split_confidence notably) with no luck.
sorry for the delay @Adrien-Luxey . If your data is highly unbalanced then it is likely that the (root) node hasn’t seen enough information to split. You can play around with the parameters but this is likely going to need some preprocessing of the data as the Hoeffding Tree (as many algorithms) is not designed to handle imbalanced data. On the other hand, EFDT is rather lax in terms of splits, as its goal is to grow the tree fast.
Great ScipyConf talk, Jacob. Exploring the library today. I especially appreciate the feature map.
Thanks and gald to hear the feature map is useful
When creating a stream from a file, does it need to have a "class column"? I want to do unsupervised clustering/anomaly detection (so I don't have that column). After initializing a FileStream, I stream.has_more_samples() immediately returns false.
This is a soft requierement that we need to improve. The class column is not a requierement since it won’t be used for training, although can be used for validation. In this case you can extend your data with a dummy column, for example in
pandas and then use
DataStream to load it. On the other hand, it is strange that you are getting a
False when calling
stream.has_more_samples(), are you getting any warning when calling
FileStream? Can you provide a MWE to see what could be happening?
Instance instancclass implementation can provide the classValue, while calling
instanc.classValue(), I have the MOA code here https://github.com/kushvarma/moa/blob/arf_re/moa/src/main/java/moa/classifiers/meta/AdaptiveRandomForestRE.java Line no 358 to 383 is the main logic behind ARF_RE. Here the value of classValue is implemented in
com.yahoo.labs.samoa.instances. So I was looking for alternate function if a similar kind of function was implemented in scikit_multiflow. This is my implementation in scikit https://github.com/kushvarma/scikit-multiflow/blob/dm_arf/src/skmultiflow/meta/adaptive_random_forest_re.py line no 530 to 562.
classesparameter in the
partial_fitmethod. If you are using a
streamobject then you can access the class value in the corresponding property
stream.target_values. Notice that thte user is not requiered to use a stream object and can manually set the list of class values to pass.
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/_stochastic_gradient.py in _partial_fit(self, X, y, alpha, C, loss, learning_rate, max_iter, classes, sample_weight, coef_init, intercept_init)(...) ValueError: The number of classes has to be greater than one; got 1 class
from skmultilearn.dataset import load_dataset from sklearn.linear_model import Perceptron from skmultiflow.metrics import hamming_score, exact_match, j_index from skmultiflow.meta.multi_output_learner import MultiOutputLearner from skmultiflow.data.data_stream import DataStream from skmultiflow.evaluation.evaluate_prequential import EvaluatePrequential X_stream, y_stream, feature_names, label_names = load_dataset('enron', 'undivided') stream_original = DataStream(data=X_stream.todense(), y=y_stream.todense(), name="enron") pretrain_samples = round(stream_original.n_remaining_samples() * 0.1) classifier_br = MultiOutputLearner( Perceptron() ) evaluator = EvaluatePrequential( show_plot=True, pretrain_size=pretrain_samples, metrics=["exact_match", "hamming_score", "hamming_loss", "running_time", "model_size"], ) evaluator.evaluate(stream=stream_original, model=classifier_br)
Looking through skmultiflow.data, I'm not sure if I should be using DataStream or TemporalDataStream. It's audio data I'll be reading so I guess TemporalDataStream might work better?
TemporalDataStream relates each input to a timestamp. It is intended to evaluate scenarios where we expected an arbitrary delay in the label arrival (in the supervised learning setting)
DataStreamclass should be enough in your case
hello everyone , i would like to use method stacking and cross-validation with streaming but i didn't find those methods in skmultiflow . May i help you please !!
Hi @ilhem_salah_twitter. Currently we do not have stacking methos implemented in skmultiflow. The vanilla cross-validation scheme is intended to batch scenarios. I am not familiar with adaptations of it to streaming scenarios
Hi @jacobmontiel , I have a use case where I am using HDDM_A for drift detection but rather than adding elements one by one to the detector, I want to add them in batches. Is there a way to add batches and check for warning or drifts after that? Any sort of input is appreciated. Thanks!
Hi @ankitk2109, I hope to be able to help you on behalf of @jacobmontiel.
skmultiflow's drift detection methods do not support the insertion of elements in (mini) batches. The usual usage is to add each element sequentially and check for drifts (or warnings) after each insertion
Hello, I'm just having a bit of trouble running the MultiOutputLearner moving_squares example. When I do the following:
stream = FileStream("https://raw.githubusercontent.com/scikit-multiflow/streaming-datasets/master/moving_squares.csv", 0, 6)
X, y = stream.next_sample(150)
I get X is
array(, shape=(150, 0), dtype=float64)
which means I can't run the rest of the example. Can anyone help?
Hi @Roonspoon, the problem in your example is probably related to the parameters you are passing to
FileStream. The moving_squares dataset has two features and one target. In your call
n_targets=6. Please, refer to the FileStream documentation for more details
Hello @jacobmontiel , following our discussion this summer I started integrating scikit-multiflow in maki-nage.
As expected, integrating concept-drift detection as an operator was trivial. I did not test on real data yet, but unit tests run well. However I have one question on adwin: On a unit test similar to the sample code from the documentation, 3 changes are detected while I expected only one. I will read the associated paper to understand more the algorithm but maybe you have a quick explanation on this.
Concerning the other algorithms (classification, regression, anomaly detection), it seems that the simplest and most flexible way to integrate them is to integrate the different evaluation methods: They take scikit-multiflow model objects as input, so any algorithm of scikit-multiflow will be directly usable without explicit integration code. However this will be more tricky because the evaluators do a lot of different things. I started with the prequential one and it does "many" things: it acts as a runner to consume the stream, does the actual prequential operations, maintains a running loss, plot the stream, supports multiple models... All these together make it not directly usable as a reactivex operator. For now I started implementing a prequential operator that does only the test/train operation. A consequence is that I cannot use the scikit-multiflow code, but I need to rewrite it. I will continue investigating this and get more familiar with the inner working of scikt-multiflow, but it is highly probable that the evaluations must be split in smaller parts so that I can reuse them. I think that this would also be true to integrate with other streaming frameworks.
An example of concept drift detection is here:
and the initial prequential implementation is here:
This is still a preliminary work, but I am still confident that I can easily feed scikit multiflow models from maki-nage data quite easily.
In my project, I'm using Scikit-Multiflow 0.5.3 with Python 3.6. I tried to read Kdd Cup99 dataset with FileStream method. But in the label column that is text-based (includes "Normal", "Teardrop", etc.), the scikit-multiflow gives me
"dtype: objectscikit-multiflow only supports numeric data." error. I tried to set the label column as target_idx, and also tried setting that column as categories but none of them worked. How can I make scikit-multiflow work for this dataset?
The full error looks like this:
File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 99, in __init__ self._prepare_for_use() File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 177, in _prepare_for_use self._load_data() File "<projdir>/site-packages/skmultiflow/data/file_stream.py", line 185, in _load_data check_data_consistency(raw_data, self.allow_nan) File "<projdir>/site-packages/skmultiflow/data/data_stream.py", line 443, in check_data_consistency .format(raw_data_frame.dtypes)) ValueError: Non-numeric data found: duration int64 src_bytes int64 dst_bytes int64 wrong_fragment int64 urgent int64 hot int64 num_failed_logins int64 num_compromised int64 root_shell int64 su_attempted int64 num_root int64 num_file_creations int64 num_shells int64 num_access_files int64 num_outbound_cmds int64 count int64 srv_count int64 serror_rate float64 srv_serror_rate float64 rerror_rate float64 srv_rerror_rate float64 same_srv_rate float64 diff_srv_rate float64 srv_diff_host_rate float64 dst_host_count int64 dst_host_srv_count int64 dst_host_same_srv_rate float64 dst_host_diff_srv_rate float64 dst_host_same_src_port_rate float64 dst_host_srv_diff_host_rate float64 dst_host_serror_rate float64 dst_host_srv_serror_rate float64 dst_host_rerror_rate float64 dst_host_srv_rerror_rate float64 label object dtype: objectscikit-multiflow only supports numeric data.
random_statework as expected in the online learning context.
fit, isn't it?
BaseEstimatorstructure like you do (super practical for cross validation), but I'm struggling with an ensemble estimator (that should use its
random_stateto fix the base estimators'). Any advice? Thanks, great library :)
HoeffdingTreeClassifierdoes not have a
random_stateparameter, for instance. So how do you ensure experiments' reproducibility? Is