Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 01 12:04
    IlyaDer17 commented #307
  • Sep 05 07:19
    greywolfbrillio edited #311
  • Sep 05 06:48
    greywolfbrillio opened #311
  • Aug 10 20:35
    lsgotti commented #307
  • Aug 10 05:28
    ananya2711 commented #307
  • Jun 22 02:12
    dependabot[bot] labeled #310
  • Jun 22 02:12
    dependabot[bot] opened #310
  • Jun 08 19:44
    denisesato edited #309
  • Mar 25 21:11
    denisesato opened #309
  • Feb 22 03:34
    tigerinus opened #308
  • Feb 09 15:35
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Nov 17 2021 23:03
    lsgotti opened #307
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 19 2021 18:28
    denisesato commented #306
  • Oct 19 2021 17:50
    denisesato edited #306
Jacob
@jacobmontiel
Delayed samples are not yet supported (although we are working on it)
In this case, you would need to implement the prequential evaluation loop so you have control over the time when labels are “available”

I'm not sure what the author means by window size.....

If the authors used MOA, my guess is that they used that window to measure “current” performance, although can’t tell for sure

barnettjv
@barnettjv
Thank you. Yes he uss MOA
uses
barnettjv
@barnettjv
Hi again Jacob, I'm hoping you can tell me if the following EvaluatePrequential statement will delay the labels by 1000 timesteps....
eval = EvaluatePrequential(max_samples=1000000, pretrain_size=0, output_file = "VFDT_v1_20.txt")
sorry forgot the n_wait=1000
laoheico3
@laoheico3
Hi again Jacob, When I use FileStream, it's initialized:stream_train = FileStream(r"E:\Fracturing forecasting_scikit multiflow\data\Methods_2\train2.csv",cat_features=[1, 2])
Does cat_features mean to set the index of the feature?But it didn't work, and I was a little confused about the property.
barnettjv
@barnettjv
eval = EvaluatePrequential(n_wait, max_samples=1000000, pretrain_size=0, output_file = "VFDT_v1_20.txt")

@laoheico3 Hi, I'm no expert (yet), but perhaps it would help if you retrieve the attributes of the Filestream after the assignment statement. .... Attributes

cat_features_idx

Get the list of the categorical features index.

feature_names

Retrieve the names of the features.

n_cat_features

Retrieve the number of integer features.

n_features

Retrieve the number of features.

n_num_features

Retrieve the number of numerical features.

n_targets

Get the number of targets.

target_idx

Get the number of the column where Y begins.

target_names

Retrieve the names of the targets

target_values

Retrieve all target_values in the stream for each target.

Jacob
@jacobmontiel

eval = EvaluatePrequential(max_samples=1000000, pretrain_size=0, n_wait=1000, output_file = "VFDT_v1_20.txt”)

Short answer: no. The n_wait parameter is used to set the window size used to measure current performance. We track gloabal performance (from the begining of the stream) and current performance (last n_wait samples seen)

Does cat_features mean to set the index of the feature?But it didn't work, and I was a little confused about the property.

Yes, cat_features should set the property in your stream, what does it return when you run stream_train .cat_features_idx?

@laoheico3 Hi, I'm no expert (yet), but perhaps it would help if you retrieve the attributes of the Filestream after the assignment statement. .... Attributes

Both options should work ;-)

laoheico3
@laoheico3
@jacobmontiel Wow, the run returns [1, 2], which is the value of the cat_features I set.When I run stream_train.feature_names, all the features in the data are returned.I see. Thank you!!!
barnettjv
@barnettjv
Hi Jacob, would you have any idea why ARF is taking 1 hour and 20 minutes to classify 1 million records? MOI takes about 7 minutes over the same data for ARF.
Jacob
@jacobmontiel
There are two things to consider:
  1. The default parameters in MOA were changed, so there might be differences between the default parameters
  2. Python code is slower than Java code. This goes down to core differences between languages.
We are planing to re-implement parts of the code in Cython at some point in the near future.
barnettjv
@barnettjv
@jacobmontiel Hi Jacob. I'm trying to understand how EvaluatePrequential is impacted when SEAGen (or any other data generator) inserts noise into the stream. I'm doing a comparison of different classifiers and need to show the impact of noise. So I have two files each with 1 million SEAGen data points and the associated label. One with 0% noise and the other with 20% noise. Do you have any recommendations on how I can show the impact of introducing noise on the accuracy of ARF (or any other classifier)?
Jacob
@jacobmontiel
EvaluatePrequential is only in charge of managing the flow of data from the stream to the estimator and getting predictions from the estimator. Regarding measuring the impact of noise, yoo can run experiments with your 0% noise data (baseline) and with noisy data. The you can compare the performance of models trained on noisy data vs the baseline. Notice that some models like ARF (ensemble methods) are known to be robust against noise, so it might be hard to notice the impact if only one data set is used. Notice that the SEA stream is a rather simple stream with only 3 features (2 relevant, 1 not-relevant), so noise in the not-relevant attribute should not impact the performance. I would recommend using multiple data sets and levels of noise.
barnettjv
@barnettjv
@jacobmontiel I actually have 8 other sources of data (Hyperplane, LED, Electricity, etc.) So here are the results of the noise0% vs noise 20% with ARF......
@jacobmontiel ######- [95%] [4951.65s]^M #################### [100%] [5213.35s]^M #################### [100%] [5213.42s]
Processed samples: 999996
Mean performance:
AdaptiveRandomForest - Accuracy : 0.9954
AdaptiveRandomForest - Training time (s) : 4273.26
AdaptiveRandomForest - Testing time (s) : 804.97
AdaptiveRandomForest - Total time (s) : 5078.23
AdaptiveRandomForest - Size (kB) : 10381.8223
and now the 20% noise....
Processed samples: 999996
Mean performance:
AdaptiveRandomForest - Accuracy : 0.8028
AdaptiveRandomForest - Training time (s) : 14063.29
AdaptiveRandomForest - Testing time (s) : 1177.81
AdaptiveRandomForest - Total time (s) : 15241.10
AdaptiveRandomForest - Size (kB) : 126046.4863
Jacob
@jacobmontiel

@jacobmontiel I actually have 8 other sources of data (Hyperplane, LED, Electricity, etc.) So here are the results of the noise0% vs noise 20% with ARF......

:+1:

Those results are interesting
notice also that noisy data results in longer training time and larger models (in memory) which is expected :-)
barnettjv
@barnettjv
image.png
image.png
@jacobmontiel sorry, not entirely used to Gitter
Jacob
@jacobmontiel
looks good in my opinion, although I would consider intermediate noise leves 5,10,20 to potentially show the impact in performance (decrement)
although that would depend in. the amount of resources you have and the time to run all the experiments
barnettjv
@barnettjv
@jacobmontiel Thank you. I think your suggestion is a good one. I'll try and get them in and will post on here the results.
Jacob
@jacobmontiel
Nice :+1:
barnettjv
@barnettjv
@jacobmontiel Hi Jacob, it appears that the accuracy is reduced by nearly the exact amount of the noise (i.e. 5% noise leads to ~95% accuracy, 10% equates to ~90% accuracy and so on). Does this seem right? Also can you help me to understand better as to how exactly the evaluator uses the stream? I know the docs so that they serve two purpose (i.e. test and train). but how is it testing? and then how is it training?
Jacob
@jacobmontiel

it appears that the accuracy is reduced by nearly the exact amount of the noise (i.e. 5% noise leads to ~95% accuracy, 10% equates to ~90% accuracy and so on)

This might change depending on the estimator used, other than that seems reasonable (I assume noise is generated from a normal distribution)

but how is it testing? and then how is it training?

The data from the strem is first used to get a prediction (test(X)) and the predicted value is compraed against the true value to estimate track the performance of the estimator. Then the same sample is used to train the estimator (train(X, y))

we must perform the test before using the data for training :-)
barnettjv
@barnettjv
@jacobmontiel The noise parameter was used with the SEAGenerator when building the stream. You mention "This might change depending on the estimator used". I didn't realize that we had a choice of picking estimators. I thought is was done internal to the EvaluatePrequential class? Also, wrt the Testing/Training time metrics in my post a couple of days earlier, the EvaluatePrequential class performed the test first taking 1177.81 seconds and then trained for 14063.29 seconds. Do I have this correct? After training, can I send it another stream to be processed using the "trained" classifier? So many questions, my apologies for having this many. It's just that my Dissertation is due in a few weeks....
Jacob
@jacobmontiel
We you run evaluator.evaluate() you can pass the estimator (classifier) you wan to use, it could be ARF, Naive Bayes, Hoeffding Tree, etc. As mentioned earlier, EvaluatePrequential is just in charge of managing the flow of data and the order for the test and train steps.
Yes, the times are correct, as expected training takes more time since it is updating the model

After training, can I send it another stream to be processed using the "trained" classifier?

Yes, there is a way to run another evaluation task with a previously trained model. You must first make sure to set the parameter restart_stream=False in EvaluatePrequential. This way the firsst evaluate call will train the model without restarting it at the end. If you call again evaluate with a new stream and the same model, the model will continue learning.

Jacob
@jacobmontiel
In this case you must use different streams. Using the same stream is incorrect since the model has already “seen” and learn from that data.
barnettjv
@barnettjv
@jacobmontiel Jacob something looks odd with my results. I wasn't expecting to see training time for the unseen stream evaluation run.... here is code.
ARFnoise20.png
result.png
@barnettjv and here is the result...
@jacobmontiel My intention was not to have the classifier retrain on the second stream. Did I code this correctly?
barnettjv
@barnettjv
@jacobmontiel Jacob, if I don't want the model to continue learning, should I set restart_stream=True or False?
barnettjv
@barnettjv
@jacobmontiel I noticed that regardless of whether I restart_stream=False or restart_stream=True, both evals show training time.