Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Irakli Mchedlishvili
@zelima
:+1:
Shrif Rai
@joyryder
hello brothers
fabirubiru
@fabirubiru
Hi everyone
I'm new on that and I wan to learn about Datahub, could someone helpme or shared any documentation about that?
Anuar Ustayev
@anuveyatsu
@joyryder Hi there!
@fabirubiru Hi! Sure, you can start here - http://datahub.io/docs
David Cottrell
@david-cottrell_gitlab
Is there a way to delete a datapackage? I ended up pushing a package called "datapackage", renamed it and repushed so now I have two. Have searched a lot but do not yet see how to delete.
Stephen Abbott Pugh
@StephenAbbott
Hi. I've been trying to install version 0.4.5 of the Data publishing app for MacOS but keep getting an error message. Is there a different version I should try? My laptop OS is MacOS High Sierra (version 10.13.6)
Rufus Pollock
@rufuspollock

@david-cottrell_gitlab

Is there a way to delete a datapackage? I ended up pushing a package called "datapackage", renamed it and repushed so now I have two. Have searched a lot but do not yet see how to delete.

You make it unpublished atm so no-one can see it - we are working on a purge type command but for now make it undeleted ...

Hi. I've been trying to install version 0.4.5 of the Data publishing app for MacOS but keep getting an error message. Is there a different version I should try? My laptop OS is MacOS High Sierra (version 10.13.6)

Can you give a bit more detail on the error message - and we can check that build :smile:

Stephen Abbott Pugh
@StephenAbbott

Can you give a bit more detail on the error message - and we can check that build :smile:

I've downloaded version 0.4.5 . When I open the application, it says 'Please wait, we are installing the CLI tool on this machine'. The install reaches 100% and then I get asked to update permissions on the downloaded CLI. I grant these permissions and then see an error message which just says 'Something went wrong while CLI tool update. We will try again automatically in 1 minute'. I've tried installing this version a few times now.

Rufus Pollock
@rufuspollock
@StephenAbbott ok - can you open an issue in github and we’ll look. In the meantime do you want to try installing the cli tool directly?
Stephen Abbott Pugh
@StephenAbbott
@rufuspollock I've opened an issue now datahq/datahub-qa#246 I'll try installing the CLI tool directly
Stephen Abbott Pugh
@StephenAbbott
Would anyone from the Datahub team be available tomorrow (Tuesday 18th December) for a conversation to help me resolve an issue I'm having with installing data as a command line tool? I'm hoping to use datahub.io to publish some data relating to an academic paper which is due for publication on Wednesday 19th December or shortly afterwards. Thanks
Rufus Pollock
@rufuspollock
@anuveyatsu could you connect with @StephenAbbott tomorrow (tuesday)?
Anuar Ustayev
@anuveyatsu
Hi @StephenAbbott I’m around today so let me know when you’re online :smile:
Stephen Abbott Pugh
@StephenAbbott
@anuveyatsu Thanks! Will DM you now
Stephen Abbott Pugh
@StephenAbbott
Thanks to @anuveyatsu, we've resolved the issue :thumbsup:
Anuar Ustayev
@anuveyatsu
@StephenAbbott :+1: and I’ve just updated the installation docs mentioning your issue :smile:
Zane Selvans
@zaneselvans
Does anyone here have favorite references for how one goes about testing data processing pipelines? We're using PyTest now, to run the entire ETL process, and then do a bunch of outputs... but it seems like a kludge, and it takes a long time to do the entire dataset, and it's mixing together testing the code and testing the data, and these tasks seem like they should be isolated from each other to the extent possible.
Rufus Pollock
@rufuspollock
@zaneselvans great question. If you are using dataflows you can test each test using pytest etc.
David Cottrell
@david-cottrell_gitlab
@zaneselvans are you testing pipeline frameworks or pipelines instances (data)
I would say do not use anything from the usual testing world
except if you are running spot checks/sanity checks. That might be useful. Otherwise you need something more like data-versioning and diff summaries.
Rufus Pollock
@rufuspollock
@david-cottrell_gitlab great suggestions - any thoughts on how you do it?
Zane Selvans
@zaneselvans
@david-cottrell_gitlab Well, we need to do both of those things. When we're re-factoring or debugging the code, or adding functionality, we need to be able to test the pipeline framework i.e. the software that's doing the processing. For that it seems like having a minimal test dataset that exhibits most of the attributes of the data to be processed is good enough (and can be used for CI as well). For this using PyTest seems reasonable. But we also need to be able to run all of the data through the pipeline, to identify issues that arise because of problems with the data (fixes for which we may need to integrate into the code). This involves many gigabytes of data, and can from ten minutes to a few hours, depending on the data source (on a laptop). Right now these things are mixed together. To test the pipeline, we'll typically pull in just 1 year of data (instead of 10 years available), or one state, instead of the whole country. But this is kind of a hack.
Rufus Pollock
@rufuspollock
@zaneselvans what pipeline framework do you use btw? And have you seen data flows https://github.com/datahq/dataflows
Zane Selvans
@zaneselvans
I just watch most of @akariv 's tutorial video on dataflows, and the overall idea seems great! The row-by-row processing seems like it will be quite slow on larger data though. We don't really have "Big Data" but it's big enough that we need to be efficient. Depending on the data source it's between ~1 million and ~1 billion records. For the smaller datasets it can all be done in memory and we use pandas. For the bigger ones we're starting to play with dask, which extends pandas dataframes to work with larger-than-memory data -- and it's still practical on a laptop.
@rufuspollock I wish we had "a framework" -- right now we're passing around dictionaries of pandas dataframes, with an Extract module, a Transform module, and a Load module for each data source.
Rufus Pollock
@rufuspollock
@zaneselvans right and note that dataflows is not like pandas - it’s for running the steps and you can use pandas in one the steps and do everything in memory. You don’t have to do row by row processing. I think of dataflows as more almost pattern (convention over configuration) approach to doing a data pipeline ...

@rufuspollock I wish we had "a framework" -- right now we're passing around dictionaries of pandas dataframes, with an Extract module, a Transform module, and a Load module for each data source.

In that case dataflows might be nice to try - the aim is a framework that is lightweight and runs with python. Its a bit like the way a wsgi app works like flask. A useful structure and a few useful pieces …

you might even say that like flask …

WerkZeugh = Pandas (core library that does work_
Flask = DataFlows a convention and runner for structuring yoru pipeline …
WSGI = data packages / table schema (a convention for your “API” on the data)

Zane Selvans
@zaneselvans
I am (blissfully, but probably not permanently) ignorant of the web development side of things.
Rufus Pollock
@rufuspollock
:smile:
My point is that dataflows and pandas are complements and dataflows is a lightweight structure for organizing your pipelines oriented around tabular data (packages)
Zane Selvans
@zaneselvans
I'll definitely check out the dataflows framework and see how it might work for our purposes. Maybe we end up using that for the smaller datasets, and something else for the 100 GB datasets.
But to get back to my original question... I'd love a more in-depth (book-ish) general reference on how to think about designing & testing data processing pipelines more generally, to be robust, efficient, re-usable, etc. Is that a thing which exists?
David Cottrell
@david-cottrell_gitlab
@zaneselvans https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43146.pdf ... and I think dynamicwebpaige (twitter) writes some stuff about data "devops", provenance etc.
Tito Jankowski
@titojankowski
hey @rufuspollock we would love to get Earth’s carbon dioxide data hosted on datahub! What’s the next step?
Andrew Cantino
@cantino
@titojankowski Looks like it's already here? https://datahub.io/collections/climate-change
Tito Jankowski
@titojankowski
that would be great, i’ll check it out!
looking through right now, only see monthly data points at the moment
a great foundation to add in daily data
Anuar Ustayev
@anuveyatsu
@titojankowski are you looking at https://datahub.io/core/co2-ppm ?
Tito Jankowski
@titojankowski
@anuveyatsu yes, that’s it
and the most recent datapoint is September 1st
would LOVE it if this was daily data as it became available, MLO posts a new datapoint every day
Anuar Ustayev
@anuveyatsu
Yes, there is only monthly data.. and yes, daily data would be cool :smile: @rufuspollock any suggestions?
Tito Jankowski
@titojankowski
what part of datahub do you work on @anuveyatsu?
Anuar Ustayev
@anuveyatsu
@titojankowski both frontend (mostly) and backend
Tito Jankowski
@titojankowski
what a great project it is!