Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Anuar Ustayev
@anuveyatsu
@StephenAbbott :+1: and I’ve just updated the installation docs mentioning your issue :smile:
Zane Selvans
@zaneselvans
Does anyone here have favorite references for how one goes about testing data processing pipelines? We're using PyTest now, to run the entire ETL process, and then do a bunch of outputs... but it seems like a kludge, and it takes a long time to do the entire dataset, and it's mixing together testing the code and testing the data, and these tasks seem like they should be isolated from each other to the extent possible.
Rufus Pollock
@rufuspollock
@zaneselvans great question. If you are using dataflows you can test each test using pytest etc.
David Cottrell
@david-cottrell_gitlab
@zaneselvans are you testing pipeline frameworks or pipelines instances (data)
I would say do not use anything from the usual testing world
except if you are running spot checks/sanity checks. That might be useful. Otherwise you need something more like data-versioning and diff summaries.
Rufus Pollock
@rufuspollock
@david-cottrell_gitlab great suggestions - any thoughts on how you do it?
Zane Selvans
@zaneselvans
@david-cottrell_gitlab Well, we need to do both of those things. When we're re-factoring or debugging the code, or adding functionality, we need to be able to test the pipeline framework i.e. the software that's doing the processing. For that it seems like having a minimal test dataset that exhibits most of the attributes of the data to be processed is good enough (and can be used for CI as well). For this using PyTest seems reasonable. But we also need to be able to run all of the data through the pipeline, to identify issues that arise because of problems with the data (fixes for which we may need to integrate into the code). This involves many gigabytes of data, and can from ten minutes to a few hours, depending on the data source (on a laptop). Right now these things are mixed together. To test the pipeline, we'll typically pull in just 1 year of data (instead of 10 years available), or one state, instead of the whole country. But this is kind of a hack.
Rufus Pollock
@rufuspollock
@zaneselvans what pipeline framework do you use btw? And have you seen data flows https://github.com/datahq/dataflows
Zane Selvans
@zaneselvans
I just watch most of @akariv 's tutorial video on dataflows, and the overall idea seems great! The row-by-row processing seems like it will be quite slow on larger data though. We don't really have "Big Data" but it's big enough that we need to be efficient. Depending on the data source it's between ~1 million and ~1 billion records. For the smaller datasets it can all be done in memory and we use pandas. For the bigger ones we're starting to play with dask, which extends pandas dataframes to work with larger-than-memory data -- and it's still practical on a laptop.
@rufuspollock I wish we had "a framework" -- right now we're passing around dictionaries of pandas dataframes, with an Extract module, a Transform module, and a Load module for each data source.
Rufus Pollock
@rufuspollock
@zaneselvans right and note that dataflows is not like pandas - it’s for running the steps and you can use pandas in one the steps and do everything in memory. You don’t have to do row by row processing. I think of dataflows as more almost pattern (convention over configuration) approach to doing a data pipeline ...

@rufuspollock I wish we had "a framework" -- right now we're passing around dictionaries of pandas dataframes, with an Extract module, a Transform module, and a Load module for each data source.

In that case dataflows might be nice to try - the aim is a framework that is lightweight and runs with python. Its a bit like the way a wsgi app works like flask. A useful structure and a few useful pieces …

you might even say that like flask …

WerkZeugh = Pandas (core library that does work_
Flask = DataFlows a convention and runner for structuring yoru pipeline …
WSGI = data packages / table schema (a convention for your “API” on the data)

Zane Selvans
@zaneselvans
I am (blissfully, but probably not permanently) ignorant of the web development side of things.
Rufus Pollock
@rufuspollock
:smile:
My point is that dataflows and pandas are complements and dataflows is a lightweight structure for organizing your pipelines oriented around tabular data (packages)
Zane Selvans
@zaneselvans
I'll definitely check out the dataflows framework and see how it might work for our purposes. Maybe we end up using that for the smaller datasets, and something else for the 100 GB datasets.
But to get back to my original question... I'd love a more in-depth (book-ish) general reference on how to think about designing & testing data processing pipelines more generally, to be robust, efficient, re-usable, etc. Is that a thing which exists?
David Cottrell
@david-cottrell_gitlab
@zaneselvans https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf ... and I think dynamicwebpaige (twitter) writes some stuff about data "devops", provenance etc.
Tito Jankowski
@titojankowski
hey @rufuspollock we would love to get Earth’s carbon dioxide data hosted on datahub! What’s the next step?
Andrew Cantino
@cantino
@titojankowski Looks like it's already here? https://datahub.io/collections/climate-change
Tito Jankowski
@titojankowski
that would be great, i’ll check it out!
looking through right now, only see monthly data points at the moment
a great foundation to add in daily data
Anuar Ustayev
@anuveyatsu
@titojankowski are you looking at https://datahub.io/core/co2-ppm ?
Tito Jankowski
@titojankowski
@anuveyatsu yes, that’s it
and the most recent datapoint is September 1st
would LOVE it if this was daily data as it became available, MLO posts a new datapoint every day
Anuar Ustayev
@anuveyatsu
Yes, there is only monthly data.. and yes, daily data would be cool :smile: @rufuspollock any suggestions?
Tito Jankowski
@titojankowski
what part of datahub do you work on @anuveyatsu?
Anuar Ustayev
@anuveyatsu
@titojankowski both frontend (mostly) and backend
Tito Jankowski
@titojankowski
what a great project it is!
Anuar Ustayev
@anuveyatsu
@titojankowski :smile: several months ago we’ve tried to implement something like dashboards but it never was published. Probably, we could use daily data in that dashboard as well.

:earth_asia: :bar_chart: Climate Change Dashboards

[feedback is needed]

Today we’re introducing Climate Change Dashboard on DataHub where you can explore State of the World and Impacts of Global Climate Change. We'd appreciate if you could give us any feedback:

https://datahub.io/awesome/dashboards/climate-change

The goal was to combine everything related to climate change on datahub. It would great to improve it and publish/promote. E.g., we could add another tab called “CO2 today” where we have a chart showing the latest data.
We’ve also received feedback that the dashboard’s design/layout can be improved.
Tito Jankowski
@titojankowski
checking it out now, @anuveyatsu
looks great, amazing that co2 emissions are still climbing
Rufus Pollock
@rufuspollock
@titojankowski so we’d love to collaborate on migrating the climate doomsday stuff over and integrating with what’s there. A suggesiton would be:
  • All data on DataHub as standard datasets
  • Migrating existing dashboard over (or integrating / reworking existing ones)
Tito Jankowski
@titojankowski
terrific!
how will you get the daily CO2 data onto DataHub?
@rufuspollock hope you’re as excited as I am about this! DataHub is a terrific home for daily carbon dioxide levels
Rufus Pollock
@rufuspollock
Very excited :smile:
@titojankowski what is the source for daily co2 data? if you have that source we can build a dataflow to auto import it each day.
Tito Jankowski
@titojankowski
@rufuspollock terrific!
Here’s the daily record where new data is pulled from, latest data point is 12/19/2018, 409.54 ppm: https://www.esrl.noaa.gov/gmd/webdata/ccgg/trends/co2_mlo_weekly.csv
However, that’s only the most recent 2 years of data or so
The remaining daily data from 1958 - 2017 is here: ftp://aftp.cmdl.noaa.gov/data/trace_gases/co2/in-situ/surface/mlo/co2_mlo_surface-insitu_1_ccgg_DailyData.txt
Rufus Pollock
@rufuspollock

OK, and we want this in one single dataset right?

ATM we have this dataset http://datahub.io/core/co2-ppm with code for this here https://github.com/datasets/co2-ppm

But this is only annual and monthly from 1980

I suggest we create a new dataset e.g. co2-ppm-daily and create the scraper for this. Do you know python at all? If not we can help do this as we already have the srapers for this kind of file in co2-ppm ...

Tito Jankowski
@titojankowski
@rufuspollock yes, one single dataset of daily data from 1958 - present
how can I help without knowing any python?