Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jun 22 02:12
    dependabot[bot] labeled #310
  • Jun 22 02:12
    dependabot[bot] opened #310
  • Jun 08 19:44
    denisesato edited #309
  • Mar 25 21:11
    denisesato opened #309
  • Feb 22 03:34
    tigerinus opened #308
  • Feb 09 15:35
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Feb 09 15:34
    lambertsbennett edited #268
  • Nov 17 2021 23:03
    lsgotti opened #307
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 28 2021 14:19
    lambertsbennett edited #268
  • Oct 19 2021 18:28
    denisesato commented #306
  • Oct 19 2021 17:50
    denisesato edited #306
  • Oct 19 2021 17:49
    denisesato opened #306
  • Oct 07 2021 14:36
    indialindsay opened #305
  • Sep 25 2021 19:33
    Venoli edited #304
  • Sep 25 2021 19:28
    Venoli edited #304
  • Sep 25 2021 19:28
    Venoli edited #304
barnettjv
@barnettjv
data file link above
barnettjv
@barnettjv
It appears that the issue is either how SEAWrite is saving the stream to the csv file or how ARFClassifier is reading from the file and assigning to the stream for the classifier. What is interesting is that I wrote a MWE that took out the file i/o portion which does not give me the warning, but I was shocked to see that both versions (file i/o vs direct) give the same accuracy results. Also the file i/o version seems to be faster. Would you be able to take a look at it and see what I'm doing wrong?
barnettjv
@barnettjv
The example above is using the VFDTClassifier, no difference between it and ARFClassifier aside from which classifier is being used.
barnettjv
@barnettjv
After much debugging, it appears that the extra line at the end of the csv file (required by Posix) is throwing EvalHoldout off at the end.
does not affect the result
Jacob
@jacobmontiel
Hi @barnettjv
thanks for the examples and further analysis
I agree that results seem to be fine, the warning seems to be coming from a mishandled corner case
sorry for the delay in my answer, while I was reviewing the code I actually found a bug that impacts (as long as I can see) some variants of the ARF. I will create an issue with a clear explanation.
barnettjv
@barnettjv
Thank you Jacob. I'm happy to be of service.
barnettjv
@barnettjv
@jacobmontiel Hi Jacob, I was wondering (hoping actually) that there is a way for me to use my graphics cards to speed up the data processing of the scikit multiflow functions (i.e. evaluatePrequential). numbapro? cuda?
Jacob
@jacobmontiel
Unfortunately, that is not possible. Most stream algorithms are sequential in nature, which makes it very challenging to parallelize. An alternative is to launch multiple jobs in parallel, depending on the amount of resources you have
barnettjv
@barnettjv
I see. Well my machine has 512GB, 2 CPUs. To run in parallel I've been using the & for each python job and have been getting about 40% cpu load but 180F temp. Is there a better way?
Jacob
@jacobmontiel
Not that I am aware of (any suggestion on this is welcomed). What you describe is what we usually do in our servers
barnettjv
@barnettjv
After I finish my dissertation report in May, I'm planning on exploring GPU parallelization possibilities with scikit. I think I've just about maxed out the 72 cores on my server with the current scikit architecture. My nieve thought is that perhaps NVidia or Radeon have parallelization libraries that can be easily imported into my python projects (based on scikit).
Jacob
@jacobmontiel
That is a very interesting topic and would be really nice to see how it can be applied to stream learning. I am going to drop here a reference to https://dask.org/ which seems promissing but we have not enough resources to explore at the moment.
Jiao Yin
@JoanYinCQ
New to skmultiflow. When I load the 'covtype.csv' using filestream and after stream.prepare_for_use(), it was interpreted as a regression problem, with y=1.0/2.0/3.0/4.0/5.0. But it actually a multi-classification problem, and y=1/2/3/4/5 in the 'covtype.csv' . How can I make the filestream interpret the data in a right way?
Jacob
@jacobmontiel
Hi @JoanYinCQ. I am not able to reproduce this error, can you share a MWE?
This is what I used:
from skmultiflow.data import FileStream
stream = FileStream("./src/skmultiflow/data/datasets/covtype.csv")
stream.prepare_for_use()
stream.n_classes    # Output: 7
stream.target_values    # Output: [1, 2, 3, 4, 5, 6, 7]
Jiao Yin
@JoanYinCQ
Thanks @jacobmontiel . I run exactly the same codes with you. But the error still exists. However, when I redownload the data, the error is fixed. So, maybe the data source is not saved properly at first. Thanks for your reply.
barnettjv
@barnettjv
@JoanYinCQ Hi Joan, would you mind telling me where you got the data? Was it the UCI repository?
Jiao Yin
@JoanYinCQ
@barnettjv https://github.com/scikit-multiflow/scikit-multiflow/tree/master/src/skmultiflow/data/datasets I download the data and test scripts from the githib repo of scikit-multiflow.
Jacob
@jacobmontiel

Thanks @jacobmontiel . I run exactly the same codes with you. But the error still exists. However, when I redownload the data, the error is fixed. So, maybe the data source is not saved properly at first. Thanks for your reply.

Glad to hear that it is working. It is strange that it just went away, in any case we will keep it in mind in case somebody else gets the same error.

Jiao Yin
@JoanYinCQ

Hi @jacobmontiel I want to know if all methods for Concept Drift Detection included in skmultiflow.drift_detection only support 1-D data stream. For example, when using a 2-d (size=[2000,5]) data_stream in the following codes, an error will arise.

Imports

import numpy as np
from skmultiflow.drift_detection import PageHinkley
ph = PageHinkley()

Simulating a data stream as a normal distribution of 1's and 0's

data_stream = np.random.randint(2, size=[2000,5])

Changing the data concept from index 999 to 2000

for i in range(999, 2000):
data_stream[i] = np.random.randint(4, high=8,size=5)

Adding stream elements to the PageHinkley drift detector and verifying if drift occurred

for i in range(2000):
ph.add_element(data_stream[i])
if ph.detected_change():
print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))

Jacob
@jacobmontiel
HI @JoanYinCQ , yes all drift detectors only support 1D data. This is not defined by scikit-multiflow but the algorithms themselves.
Jacob
@jacobmontiel
you code should look like this:
from skmultiflow.drift_detection import PageHinkley

data_stream = np.concatenate((np.random.randint(2, size=1000), np.random.randint(4, size=1000)))

ph = PageHinkley()
for i, val in enumerate(data_stream):
    ph.add_element(val)
    if ph.detected_change():
        print('Change has been detected in data: ' + str(data_stream[i]) + ' - of index: ' + str(i))
        ph.reset()
Jiao Yin
@JoanYinCQ
@jacobmontiel Thanks for your reply. That solves my problem.
nuwangunasekara
@nuwangunasekara
Hi guys, does scikit-multiflow supports "arff" file format?
Jacob
@jacobmontiel
Short anwer, no. However, there is a loader in scipy: (scipy.io.arff.loadarff). So you can load the data and then convert it using skmultiflow.data.DataStream
nuwangunasekara
@nuwangunasekara
Thanks for the tip @jacobmontiel !
farzane
@farzane30788946_twitter
hi , i have a problem . i want to compute kernel density estimation and i use scikit-learn.neighbor.KernelDensity ; multivariate data is comming in a stream and i want to change and update my estimated density as streaming data come in. so i choose scikit-multiflow , i discover that multiflow supports streaming data and has drift detection method in window; but i want to find out if multiflow core can join to sklearn.neighbor.KernelDensity and can update densities as data come in, and if it can not work with this so how should i use adwin() for multivariate data to save part of data in window and change saved part every time drift detected and compute kernel density estimation again in new part of data? any other advice will be useful, thanks in advance!
Jacob
@jacobmontiel
Hello @farzane30788946_twitter . You can use sklearn.neighbor.KernelDensity alongisde scikit-multiflow. As you mention the sklearn implementation works in batches of data, if you wan to update the densities you have to define data update strategy. This is very similar to how the KNNClassifier is implemented. You will see there that the data is stored in a sliding window. Regarding drift detection, ADWIN as all other drift detectors take as input 1-dimensional data. You can check the KNNADWINClassifier which uses ADWIN to monitor the classification performance of the basic KNN model. If ADWIN detects a change in classification performance, then the model is reset.
farzane
@farzane30788946_twitter
much obliged @jacobmontiel . i will try it
farzane
@farzane30788946_twitter
hi guys , one question ; i have a distribution of a data set ; i want to measure the probability of how much the new instance belongs to this distribution. it is the same things generative adversarial networks do for discriminate real data and fake data, but in python i am wandering how can i do that ? i am not sure there is methods in scikit or not but if there is i prefer use this instead of other libraries.
every ideas even from other libraries such as stats or .. will be useful! thanks in advance.
farzane
@farzane30788946_twitter
i find my question answer myself ! binary cross entropy loss function is useful here , but i am wander if i can use this function as a learner model in skmultiflow.evaluation.evaluate_prequential.EvaluatePrequential .
@jacobmontiel you answer my last question;much obliged for your answer . do you have any idea here too? if you have some experiences let me know ; thanks in advance.
Jacob
@jacobmontiel
EvaluatePrequential requieres as input a model tghat implements the predict and partial_fit methods. For each sample the predict method returns the predicted value, whereas the partial_fit method is used to introduce new data into the model
the important element is to process one sample at a time
Jacob
@jacobmontiel
in this case, as your data seems to not be labeled, then you can consider anomaly detection methods or clustering (depending on the distribution of the data and how often you expect to see “different” samples)
Jacob
@jacobmontiel
ignore my previous comment as I got confussed. Your data is indeed labeled. In this case, then you can follow a similar structure to the one in the KNNClassifier.
farzane
@farzane30788946_twitter
i appreciate you indeed, i was confused in these concepts. have good happenings dear @jacobmontiel
Dossy Shiobara
@dossy
Is there a working example of how to use skmultiflow.data.DataStream?
Jacob
@jacobmontiel

Hi @dossy , here is one

from skmultiflow.data import DataStream
import numpy as np

n_features = 10
n_samples = 50 
X = np.random.random(size=(n_samples, n_feature
y = np.random.randint(2, size=n_samples)
stream = DataStream(data=X, y=y)
# stream.prepare_for_use() # if using the stable version (0.4.1)
stream.n_remaining_samples()

Last line return 50

This is for example if you have your data in different numpy.ndarray
You can also a single np.ndarray as long as you define the index of the target column (last column by default). pandas.DataFrame are also supported, following the same indications.