Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 14:56
    codecov[bot] commented #230
  • 14:53
    codecov[bot] commented #230
  • 14:51
    codecov[bot] commented #230
  • 13:50
    codecov[bot] commented #230
  • 13:50
    smastelini synchronize #230
  • 13:17
    smastelini commented #230
  • 02:08
    jacobmontiel closed #229
  • 01:54
    smastelini edited #230
  • Apr 07 22:50
    jacobmontiel edited #230
  • Apr 07 22:50
    codecov[bot] commented #229
  • Apr 07 22:46
    jacobmontiel labeled #230
  • Apr 07 22:45
    jacobmontiel labeled #230
  • Apr 07 22:45
    jacobmontiel unlabeled #230
  • Apr 07 22:45
    jacobmontiel unlabeled #230
  • Apr 07 22:43
    codecov[bot] commented #229
  • Apr 07 22:38
    codecov[bot] commented #229
  • Apr 07 22:38
    codecov[bot] commented #229
  • Apr 07 22:38
    codecov[bot] commented #229
  • Apr 07 22:30
    codecov[bot] commented #229
  • Apr 07 22:30
    jacobmontiel synchronize #229
barnettjv
@barnettjv

@laoheico3 Hi, I'm no expert (yet), but perhaps it would help if you retrieve the attributes of the Filestream after the assignment statement. .... Attributes

cat_features_idx

Get the list of the categorical features index.

feature_names

Retrieve the names of the features.

n_cat_features

Retrieve the number of integer features.

n_features

Retrieve the number of features.

n_num_features

Retrieve the number of numerical features.

n_targets

Get the number of targets.

target_idx

Get the number of the column where Y begins.

target_names

Retrieve the names of the targets

target_values

Retrieve all target_values in the stream for each target.

Jacob
@jacobmontiel

eval = EvaluatePrequential(max_samples=1000000, pretrain_size=0, n_wait=1000, output_file = "VFDT_v1_20.txt”)

Short answer: no. The n_wait parameter is used to set the window size used to measure current performance. We track gloabal performance (from the begining of the stream) and current performance (last n_wait samples seen)

Does cat_features mean to set the index of the feature?But it didn't work, and I was a little confused about the property.

Yes, cat_features should set the property in your stream, what does it return when you run stream_train .cat_features_idx?

@laoheico3 Hi, I'm no expert (yet), but perhaps it would help if you retrieve the attributes of the Filestream after the assignment statement. .... Attributes

Both options should work ;-)

laoheico3
@laoheico3
@jacobmontiel Wow, the run returns [1, 2], which is the value of the cat_features I set.When I run stream_train.feature_names, all the features in the data are returned.I see. Thank you!!!
barnettjv
@barnettjv
Hi Jacob, would you have any idea why ARF is taking 1 hour and 20 minutes to classify 1 million records? MOI takes about 7 minutes over the same data for ARF.
Jacob
@jacobmontiel
There are two things to consider:
  1. The default parameters in MOA were changed, so there might be differences between the default parameters
  2. Python code is slower than Java code. This goes down to core differences between languages.
We are planing to re-implement parts of the code in Cython at some point in the near future.
barnettjv
@barnettjv
@jacobmontiel Hi Jacob. I'm trying to understand how EvaluatePrequential is impacted when SEAGen (or any other data generator) inserts noise into the stream. I'm doing a comparison of different classifiers and need to show the impact of noise. So I have two files each with 1 million SEAGen data points and the associated label. One with 0% noise and the other with 20% noise. Do you have any recommendations on how I can show the impact of introducing noise on the accuracy of ARF (or any other classifier)?
Jacob
@jacobmontiel
EvaluatePrequential is only in charge of managing the flow of data from the stream to the estimator and getting predictions from the estimator. Regarding measuring the impact of noise, yoo can run experiments with your 0% noise data (baseline) and with noisy data. The you can compare the performance of models trained on noisy data vs the baseline. Notice that some models like ARF (ensemble methods) are known to be robust against noise, so it might be hard to notice the impact if only one data set is used. Notice that the SEA stream is a rather simple stream with only 3 features (2 relevant, 1 not-relevant), so noise in the not-relevant attribute should not impact the performance. I would recommend using multiple data sets and levels of noise.
barnettjv
@barnettjv
@jacobmontiel I actually have 8 other sources of data (Hyperplane, LED, Electricity, etc.) So here are the results of the noise0% vs noise 20% with ARF......
@jacobmontiel ######- [95%] [4951.65s]^M #################### [100%] [5213.35s]^M #################### [100%] [5213.42s]
Processed samples: 999996
Mean performance:
AdaptiveRandomForest - Accuracy : 0.9954
AdaptiveRandomForest - Training time (s) : 4273.26
AdaptiveRandomForest - Testing time (s) : 804.97
AdaptiveRandomForest - Total time (s) : 5078.23
AdaptiveRandomForest - Size (kB) : 10381.8223
and now the 20% noise....
Processed samples: 999996
Mean performance:
AdaptiveRandomForest - Accuracy : 0.8028
AdaptiveRandomForest - Training time (s) : 14063.29
AdaptiveRandomForest - Testing time (s) : 1177.81
AdaptiveRandomForest - Total time (s) : 15241.10
AdaptiveRandomForest - Size (kB) : 126046.4863
Jacob
@jacobmontiel

@jacobmontiel I actually have 8 other sources of data (Hyperplane, LED, Electricity, etc.) So here are the results of the noise0% vs noise 20% with ARF......

:+1:

Those results are interesting
notice also that noisy data results in longer training time and larger models (in memory) which is expected :-)
barnettjv
@barnettjv
image.png
image.png
@jacobmontiel sorry, not entirely used to Gitter
Jacob
@jacobmontiel
looks good in my opinion, although I would consider intermediate noise leves 5,10,20 to potentially show the impact in performance (decrement)
although that would depend in. the amount of resources you have and the time to run all the experiments
barnettjv
@barnettjv
@jacobmontiel Thank you. I think your suggestion is a good one. I'll try and get them in and will post on here the results.
Jacob
@jacobmontiel
Nice :+1:
barnettjv
@barnettjv
@jacobmontiel Hi Jacob, it appears that the accuracy is reduced by nearly the exact amount of the noise (i.e. 5% noise leads to ~95% accuracy, 10% equates to ~90% accuracy and so on). Does this seem right? Also can you help me to understand better as to how exactly the evaluator uses the stream? I know the docs so that they serve two purpose (i.e. test and train). but how is it testing? and then how is it training?
Jacob
@jacobmontiel

it appears that the accuracy is reduced by nearly the exact amount of the noise (i.e. 5% noise leads to ~95% accuracy, 10% equates to ~90% accuracy and so on)

This might change depending on the estimator used, other than that seems reasonable (I assume noise is generated from a normal distribution)

but how is it testing? and then how is it training?

The data from the strem is first used to get a prediction (test(X)) and the predicted value is compraed against the true value to estimate track the performance of the estimator. Then the same sample is used to train the estimator (train(X, y))

we must perform the test before using the data for training :-)
barnettjv
@barnettjv
@jacobmontiel The noise parameter was used with the SEAGenerator when building the stream. You mention "This might change depending on the estimator used". I didn't realize that we had a choice of picking estimators. I thought is was done internal to the EvaluatePrequential class? Also, wrt the Testing/Training time metrics in my post a couple of days earlier, the EvaluatePrequential class performed the test first taking 1177.81 seconds and then trained for 14063.29 seconds. Do I have this correct? After training, can I send it another stream to be processed using the "trained" classifier? So many questions, my apologies for having this many. It's just that my Dissertation is due in a few weeks....
Jacob
@jacobmontiel
We you run evaluator.evaluate() you can pass the estimator (classifier) you wan to use, it could be ARF, Naive Bayes, Hoeffding Tree, etc. As mentioned earlier, EvaluatePrequential is just in charge of managing the flow of data and the order for the test and train steps.
Yes, the times are correct, as expected training takes more time since it is updating the model

After training, can I send it another stream to be processed using the "trained" classifier?

Yes, there is a way to run another evaluation task with a previously trained model. You must first make sure to set the parameter restart_stream=False in EvaluatePrequential. This way the firsst evaluate call will train the model without restarting it at the end. If you call again evaluate with a new stream and the same model, the model will continue learning.

Jacob
@jacobmontiel
In this case you must use different streams. Using the same stream is incorrect since the model has already “seen” and learn from that data.
barnettjv
@barnettjv
@jacobmontiel Jacob something looks odd with my results. I wasn't expecting to see training time for the unseen stream evaluation run.... here is code.
ARFnoise20.png
result.png
@barnettjv and here is the result...
@jacobmontiel My intention was not to have the classifier retrain on the second stream. Did I code this correctly?
barnettjv
@barnettjv
@jacobmontiel Jacob, if I don't want the model to continue learning, should I set restart_stream=True or False?
barnettjv
@barnettjv
@jacobmontiel I noticed that regardless of whether I restart_stream=False or restart_stream=True, both evals show training time.
barnettjv
@barnettjv
@jacobmontiel it just dawned on me that perhaps all of your classifiers continuously learn by their very design? Is this the same with the classifiers with MOA?
laoheico3
@laoheico3
@jacobmontiel Hi Jacob, I had another problem. I wanted to predict all the target values in 20 time steps after time t, and when I used EvaluatePrequentia, the parameters in it didn't seem to do the job.I can only do this by changing the source code? I hope you can give me some help with this problem. Thank you
Jacob
@jacobmontiel
@barnettjv Sorry for the delay. The evaluators will always perform training, regardless if the model is new or has been pre-trained. There is no mechanism to disable the training phase.

@jacobmontiel Jacob, if I don't want the model to continue learning, should I set restart_stream=True or False?

This parameter is not intended to be used like that. This parameter indicates if the model should be re-started (True) or not (False) at the end of the evaluation. If restart_stream=True it means that after the evaluation the model instance will remain in the last status from the evaluation. You can either continue treaining or use it only to get predictions. That is up to you to define (and code). However, as mentioned earlier, EvaluatePrequential always performs both tessting and training.

Jacob
@jacobmontiel

@jacobmontiel Hi Jacob, I had another problem. I wanted to predict all the target values in 20 time steps after time t, and when I used EvaluatePrequentia, the parameters in it didn't seem to do the job.I can only do this by changing the source code? I hope you can give me some help with this problem. Thank you

This is not currently supported by the EvaluatePrequential, we are working on a new feature for this case, but it might take some time until it is available. In the meantime the best option is, as you mention, to manually implement it. You can take a look into PR #222 for reference :-)

barnettjv
@barnettjv
@jacobmontiel Hi again jacob, i've made a lot of progress since my last post. Quick question (hopefully) regarding the VFDT, ARF, DWM, LevBag classifiers, the default training is immediate right? Is there a way to put in a delay for the classifier to get access to the label during training?
Jacob
@jacobmontiel
not in the release version, we are currently working in the deveopment of such functionality
nuwangunasekara
@nuwangunasekara
Hi guys,
can someone please explain me what is meant by 'model_size' in skmultiflow.evaluation.EvaluatePrequential() 'metrics' parameter?
Is it something similar to MOA, model cost (RAM-Hours)?
Jacob
@jacobmontiel
Yes in the sense that it is a way to track the amount of memory used
In our case it refers to the size of a model in memory
nuwangunasekara
@nuwangunasekara
cool! Thanks @jacobmontiel !