Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Jan 31 2019 20:11
    ha0ye edited #1225
  • Jan 31 2019 20:11
    ha0ye opened #1225
  • Jan 29 2019 22:19

    henrykironde on master

    Changes to fix new breaks in te… Merge pull request #1224 from z… (compare)

  • Jan 29 2019 22:19
    henrykironde closed #1224
  • Jan 29 2019 22:00
    zhangcandrew opened #1224
  • Jan 29 2019 19:51

    zhangcandrew on testChanges

    (compare)

  • Jan 29 2019 19:51

    zhangcandrew on md5test

    (compare)

  • Jan 29 2019 19:40

    zhangcandrew on md5test

    md5test so we don't lose it (compare)

  • Jan 29 2019 19:39

    zhangcandrew on testChanges

    Changes to fix new breaks in te… (compare)

  • Jan 28 2019 14:05
  • Jan 25 2019 22:37
  • Jan 25 2019 06:17
  • Jan 25 2019 02:44
    amanjain25 commented #1223
  • Jan 24 2019 23:29
  • Jan 24 2019 21:39
    henrykironde edited #1223
  • Jan 24 2019 21:24
    henrykironde edited #1223
  • Jan 24 2019 21:23
    henrykironde edited #1223
  • Jan 24 2019 21:23
    henrykironde edited #1223
  • Jan 24 2019 21:23
    henrykironde labeled #1223
  • Jan 24 2019 21:23
    henrykironde opened #1223
henry senyondo
@henrykironde
is there any specific reason for keeping quiet=False and force=True parameters in such a way? The quiet is to suppress warning and the force is to over ride the previous data hence making sure we grab the new data and use that.
@Kadam-Tushar, what IDE are you using for development?
Tushar Kadam
@Kadam-Tushar
VS code
Tushar Kadam
@Kadam-Tushar
@henrykironde Can you brief me about what are changes need to be done in this function?
henry senyondo
@henrykironde
You finished the first part. Which was to create a script for downloading Kaggle. We are going to use the same concept and make sure we create the correct url to request the data using the kaggle function in the retriever. You will have to download the sample kaggle scripts from any of the PRS here https://github.com/weecology/retriever-recipes/search?q=kaggle&type=issues. I have not yet merged the PRS so the kaggle scripts are not in the retriever yet.
Tushar Kadam
@Kadam-Tushar
Okay, I will look into it.
Tushar Kadam
@Kadam-Tushar
@henrykironde for Kaggle scripts what should be the value for "url" field in the script? because in scripts shared by you .. some of them have the format of "author/dataset_name" while others have a complete link to download the dataset.
for kaggle api we need "author/dataset_name" type of format right?
henry senyondo
@henrykironde
The ones that have a complete url are are datasets that do not require users to log in. The just happen to be public datasets on the kaggle platform. The ones with the format as "author/dataset_name", are the ones that require the api call and we should focus on these ones.
Tushar Kadam
@Kadam-Tushar
One more thing I see data_source parameter of download_from_kaggle function gets value from kaggle script's data_source field right ? But I see only one script has this key-value pair
Tushar Kadam
@Kadam-Tushar
@henrykironde for this script even though url has format "author/dataset_name" but "kaggle:true" is not there in script
henry senyondo
@henrykironde
Yes thanks for catching that. It should have that key word
Tushar Kadam
@Kadam-Tushar

One more thing I see data_source parameter of download_from_kaggle function gets value from kaggle script's data_source field right ? But I see only one script has this key-value pair

@henrykironde Let me know what is supposed to be done with this parameter? As this parameter was introduced to distinguish between competition and usual datasets from Kaggle

and finally what changes are you expecting in this function?

henry senyondo
@henrykironde
like you said the keyword data_source was introduced to distinguish between the forms of data provided by kaggle. Some datasets are from the competition and others are not. So the urls for getting these datasets differs. We should have included data_source in the scripts . Or we should have included a default values for dataset in the function definition download_from_kaggle(... data_source=[the assumed default]
what changes are you expecting in this function? There may be no changes, however start by testing if the script can work with some of the changes you have so far found. Then we can move on to look for other errors
Tushar Kadam
@Kadam-Tushar
Ok
Ethan White
@ethanwhite
[Henry wec office, weecology] Yes that was one thing that we have to improve.
Tushar Kadam
@Kadam-Tushar
@henrykironde I have tested the kaggle function its working fine there are some changes required in the given scripts
Tushar Kadam
@Kadam-Tushar

Changes I can do :

  • Correct the scripts
  • Modify error message when a user tries to install any contest data from Kaggle and hasn't registered for the contest.
  • archive_full_path = archive_full_path.("kaggle:competition:", "" ) + ".zip" Remove "kaggle" from string and keep it only "competition" as it will be always kaggle data because we are already checking "kaggle":"True".

Shall I work on these changes and send PR?

Ethan White
@ethanwhite
[Henry wec office, weecology] That sounds good
Tushar Kadam
@Kadam-Tushar
@henrykironde When will this #1528 be merged? So that I can take a latest pull
Tushar Kadam
@Kadam-Tushar
Created a PR #1531 let me know if something is missing
Tushar Kadam
@Kadam-Tushar
@henrykironde Thanks for merging #1531
I can see the build is failing, earlier in PR, it was showing passing, any ideas on how are my changes making those tests fail? :thinking:
henry senyondo
@henrykironde
Checking right now
henry senyondo
@henrykironde
Just checked it has nothing to do with the PR you created. It is a Postgres failure on Travis. I will rerun it and see.
Tushar Kadam
@Kadam-Tushar
Ok :smile:
henry senyondo
@henrykironde
Fixed
Tushar Kadam
@Kadam-Tushar

@henrykironde I have sent PR weecology/retriever-recipes#89 in retriever-recipes as discussed but I guess there is some problem with Travis because my build is failing with error :

ERROR: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit The command "docker-compose run -e IN_TRAVIS=true --service-ports python_recipes pytest -v -k "not sqlite"" exited with 1.

Tushar Kadam
@Kadam-Tushar
@henrykironde Just a note ,feel free to ignore it if you already know this:
You have added testing label to [weecology/retriever-recipes#88] instead of [weecology/retriever-recipes#89].
Because I was going to close first PR as discussed in comments.
henry senyondo
@henrykironde
Nice catch, I will update that. I am actually testing the correct one, so you can close the 88.
Madhu Charan
@madhucharan
Hello everyone,My name is Madhu.I am an undergrad student from India.I am very interested to contribute to retriever.Kindly guide me through setup and development environment and some beginner friendly issues.
Tushar Kadam
@Kadam-Tushar
Hi @madhucharan you can set up dev environment with help of https://retriever.readthedocs.io/en/latest/developer.html
henry senyondo
@henrykironde
Thanks @Kadam-Tushar , Hi @madhucharan happy to hear that you are interested in the Data Retriever project. @Kadam-Tushar has shown you the best place to start. Also let me know @henrykironde about any ideas you have for the Data Retriever if any.
Madhu Charan
@madhucharan
Hi @henrykironde @Kadam-Tushar , Thank you for the reply.Could you please guide me through some beginner friendly issues as well please
henry senyondo
@henrykironde
The next issue I would recommend requires you to have the development environment ready. Let me know once you are ready. In case you find some problems let me know.
Madhu Charan
@madhucharan
I am currently going through the codebase. I will set up the env and ping you within some time
Tushar Kadam
@Kadam-Tushar
@henrykironde can we get datasets of images like MNIST from retriever ?
Rather my doubt is : Does retriever support datasets which contains multimedia information like images or sound files ?
henry senyondo
@henrykironde
We can add datasets with images, I have not tried with sound files.
Tushar Kadam
@Kadam-Tushar

@henrykironde Sorry for late reply , I do have some ideas for Gsoc. Let me know what are your thoughts on this.

Adding kaggle like quick visulisation of datasets.

Before donwloading any dataset using retriever one may want to know more about dataset like no. of columns,column names/datatype,distribution of values (kaggle shows histogram), unique values , pie chart if columns are categorical ,no of missing values , mean , median , std-deviation,quntiles.
These all statistical measures helps to understand more about dataset and useful to do further processing on datasets

image.png
image.png
I was thinking can we add some option CLI to quickly get these visulations ?

Adding commands in retriever for dataset manipulations.

If I understand correctly correct me if I am wrong.
For cleaned datasets in retriever there are scripts written in python which takes help from json file and cleans the dataset like replacing missing values.

But I guess there are no options in retriever to manipulate or edit any dataset for e.g

I may want to take only subset of columns and normalize the values of particular column.

What if I want to join two or more tables from given datasets ?

I want to replace missing values with mean/median value of columns How can I do it ?

I know we can do these things using other libraries like pandas
but that would require users to have scripting skills, many researchers may not have this skill.

PS. I was referring Frictionless data because they do have packages like goodtables which validates tables and does many other things from this concept I got this idea.

Sharing feature of commited datasets in retriever with others

Suppose I made some changes in my dataset then I want to share this new dataset with others then instead of sharing whole dataset (which may have very large size) can we maintain file which contains changes I have done on this dataset ? (Just like git has commits which shows changes)

Then I can share this file (which may be human readable YAML file) with my friend which shows changes I have made to this dataset.

I have read docs about Provenance. I guess it is not mentioned in docs about this feature.

Issue #1409 weecology/retriever#1409

Let me know what are your thoughts on this. Is this feasible / required ?
Tushar Kadam
@Kadam-Tushar
@ethanwhite Any thoughts on above ideas ?
henry senyondo
@henrykironde
@Kadam-Tushar that is a great idea. I am trying to tailor that in the last project we are presenting for GSoC.
Tushar Kadam
@Kadam-Tushar
@henrykironde I see test coverage for retriever is 60% , how can we improve that ?
Or is there anything more important that I can work on ? I would be happy to contribute .
henry senyondo
@henrykironde
Yes, lets get some test coverage increased where we can
Kush Kothari
@kkothari2001

Hello @henrykironde and @ethanwhite , I am Kush, a passionate web developer with experience in front end development in React, and backend development in Express, Flask and Django. I found data retriever to be a very interesting project that will motivate me to learn something new!

Could someone please help me get started with the codebase? I wish to start contributing right away!