Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Akshay Prashant Shende
@akshayshende129
I'm stuck with recommender system part
Anuar Ustayev
@anuveyatsu
@akshayshende129 can you clarify what do you mean by recommender system?
Akshay Prashant Shende
@akshayshende129
@anuveyatsu a recommender system is an AI algorithm (usually Machine Learning) that utilizes data to suggest additional products to consumers based on a variety of reasons.
Anuar Ustayev
@anuveyatsu
@akshayshende129 ok, I haven’t studied the AI algorithms so cannot help with it. May be somebody from the community would help you - let’s await for response
Zane Selvans
@zaneselvans
Hi there. I'm part of a project that's integrating public data about electric utilities in the US for use by non-profits, activists, journalists, academics. We're exploring options for publication of the data after we've cleaned it up and integrated it, and datahub looks like a great option! Very curious about what the datafactory setup looks like, and also about the scale of datasets which are appropriate for the platform. Some of the data we're integrating gets as large as 100s of GB. Allowing users to pull down modest subsets of the data rather than the whole pile would definitely be useful. https://github.com/catalyst-cooperative/pudl
Rufus Pollock
@rufuspollock

@zaneselvans great to hear from you and DataHub definitely sounds like a fit for what you are doing (which sounds exciting!)

We can definitely tell you more about Data Factory - which is the open source framework we've developed.

Regarding space: 100s of GBs would be fine (we back onto s3 etc).

In terms of pulling subsets can you explain more about what you'd mean - e.g. do you want people just to download a single file or even a sample of a file?

Zane Selvans
@zaneselvans
@rufuspollock On subsets, I mean, if we were to publish a dataset covering the entire US and all 20 years of time and it was 100GB, but someone just wanted the most recent year's worth of data for CA, it would be nice if they could specify that and just download the portion of the big dataset that's relevant to their work.
Zane Selvans
@zaneselvans
Relatedly, right now our analysis and development is mostly happening through Jupyter Notebooks, and we're excited about being able to separate the data storage and access from the analytical applications, so people can use something like Binder to publicize an analysis that uses our code and the compiled data, and others can easily replicate and build upon that analysis without having to manage their own 100GB pile-o-data locally.
Zane Selvans
@zaneselvans
Looking at the CKAN docs, it sounds like what I'm imagining might be implemented by the datastore extension?
Rufus Pollock
@rufuspollock

@rufuspollock On subsets, I mean, if we were to publish a dataset covering the entire US and all 20 years of time and it was 100GB, but someone just wanted the most recent year's worth of data for CA, it would be nice if they could specify that and just download the portion of the big dataset that's relevant to their work.

Got you, yes. There a variety of ways to do that. The most natural would be to partition the initial dataset by year or similar but there are other approaches.

Relatedly, right now our analysis and development is mostly happening through Jupyter Notebooks, and we're excited about being able to separate the data storage and access from the analytical applications, so people can use something like Binder to publicize an analysis that uses our code and the compiled data, and others can easily replicate and build upon that analysis without having to manage their own 100GB pile-o-data locally.

Yes, exactly. that sounds great.

Looking at the CKAN docs, it sounds like what I'm imagining might be implemented by the datastore extension?

The new DataHub.io is not build on CKAN but a new stack built natively around Data Packages and the Frictionless Data toolchain. Data storage is native and implemented by default. If you wanted to play around now you can you can just download the data command line tool and try "pushing" some data - https://datahub.io/download

Zane Selvans
@zaneselvans
Ahh, I see, so datahub has diverged from CKAN. Was getting the sense playing with both that they were on different pages. We're also looking at how this kind of data could be hosted within a computational cloud context (AWS Open Data) and partitioning by e.g. state & year would make that much more efficient and cost-effective as well. I guess a single Data Package could contain a bunch of individual resources, each of which is a year of data from a state. Is that a common approach? Is there a preferred method for wrapping larger datasets in Data Packages? Not immediately seeing the documentation on multipart resources. I'll get the python package and command line tools installed and play with them.
Rufus Pollock
@rufuspollock

@zaneselvans - great questions and keep 'em coming.

Ahh, I see, so datahub has diverged from CKAN. Was getting the sense playing with both that they were on different pages.

Yes, DataHub is CKAN "next gen" if you like. It is modular, service oriented architecture and built around data packages from ground up. A lot of the components can (and some are) used with CKAN. We're also dedicated to running datahub.io as a reliable community oriented data hub.

We're also looking at how this kind of data could be hosted within a computational cloud context (AWS Open Data) and partitioning by e.g. state & year would make that much more efficient and cost-effective as well.

The way DataHub is designed makes it very easy to add additional pipelines to push data to wherever you need (or to pull from where you want),.

I guess a single Data Package could contain a bunch of individual resources, each of which is a year of data from a state. Is that a common approach?

Yes, exactly. That's a very sensible way to do it. The way DataHub is designed also means this could be automated - so you upload in bulk and then process out slices like this. Also with tools like PrestoDB (and AWS Athena) ad-hoc querying against the CSV is getting very easy.

Is there a preferred method for wrapping larger datasets in Data Packages? Not immediately seeing the documentation on multipart resources. I'll get the python package and command line tools installed and play with them.

Good question and basically what you are asking for a is a pattern for doing this. I do think partitioning datasets into sensible resources does help with local and remote management with existing tools you have and also for updates - rather than syncing a 1GB file because a change in one year you can just push the year and state that changed (the data tool will do this automatically for you - ie. avoid pushing unchanged resources).

Zane Selvans
@zaneselvans
Up until now our workflow has been to pull the raw data, clean it up and integrate it using Python/pandas, populate a local postgres DB, and then output various data products as requested by users, or by creating a few stock interesting tabular outputs. The database was the primary product, and the data outputs an application that sits on top of it. But now (especially as we get into needing to use other platforms) after exploring a bit here, I'm wondering if we shouldn't invert that, and think of the cleaned and integrated data as the primary product, which we and others use to populate various analytical platforms -- maybe a database, maybe just some pandas dataframes, maybe something on AWS.
This seems like a beautifully designed system. Thank you for working on it!
Rufus Pollock
@rufuspollock

Up until now our workflow has been to pull the raw data, clean it up and integrate it using Python/pandas, populate a local postgres DB, and then output various data products as requested by users, or by creating a few stock interesting tabular outputs. The database was the primary product, and the data outputs an application that sits on top of it. But now (especially as we get into needing to use other platforms) after exploring a bit here, I'm wondering if we shouldn't invert that, and think of the cleaned and integrated data as the primary product, which we and others use to populate various analytical platforms -- maybe a database, maybe just some pandas dataframes, maybe something on AWS.

Yes, big :thumbsup: - i think that is probably more sustainable (i.e. scalable, adaptable, debuggable, modularizable).

It is also really interesting to hear about your approach and what you been doing.

In this context I want to mention the Data Factory:

Data Factory

We have an open source pattern / framework called the Data Factory for doing data processing and integration. Currently in very active development / documentation.

https://datahub.io/data-factory

This system is how we power the datahub and our own work prepping core datasets (https://datahub.io/docs/core-data) but is usable completely separately from the DataHub.io SaaS platform (of course you can publish to DataHub.io at the end and use DataHub.io for some integration / processing but you don't need to).


This seems like a beautifully designed system. Thank you for working on it!

:smile:

Zane Selvans
@zaneselvans
Very curious to understand more about what's really inside the data factory. That workflow is pretty close to what we've been doing, though our structural data validation has been taking place upon loading into the postgres DB -- but it looks like goodtables and/or the data validate tool could do the same kind of thing. We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools? Right now we're using pytest cases to not just test our code, but also to load the entire database and thus (incidentally) validate the data structure, but especially with the larger datasets that... just doesn't work. It was already taking 20 minutes to run the tests, and with the addition of our first large dataset, it bumped to 8 hours. We need to separate these different kinds of testing.
The data we're extracting from government sources is in some cases extremely dirty and poorly organized, so we're doing a lot of processing in Python before loading it, including starting to use a little bit of machine learning / and automated classification to try and tie together datasets which have no shared keys, and verify that we've done so correctly. The commercial data providers outsource this work to cheap labor overseas, who do it manually, but we don't have those resources so we're trying to automate it ,which seems like a valuable set of tools to develop for other purposes too!
Rufus Pollock
@rufuspollock

Very curious to understand more about what's really inside the data factory.

@zaneselvans more about the conceptual ideas for the data factory in this recent post by our Technical Lead @akariv

http://okfnlabs.org/blog/2018/08/29/data-factory-data-flows-introduction.html

There's a follow up tutorial here:

http://okfnlabs.org/blog/2018/08/30/data-factory-data-flows-tutorial.html

That workflow is pretty close to what we've been doing, though our structural data validation has been taking place upon loading into the postgres DB -- but it looks like goodtables and/or the data validate tool could do the same kind of thing. We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools? Right now we're using pytest cases to not just test our code, but also to load the entire database and thus (incidentally) validate the data structure, but especially with the larger datasets that... just doesn't work. It was already taking 20 minutes to run the tests, and with the addition of our first large dataset, it bumped to 8 hours. We need to separate these different kinds of testing.

Yes, basically everyone who work with data has to do some kind of version of ETL / ELT + testing.

As you mention, as your data grows doing this implicitly (or explicitly) with a traditional DB can be painful - both from performance and because of centralization of the process.

Just like you we use goodtables within a traditional testing framework. Goodtables is also built into DataHub.io so we'll do goodtables validation for you automatically every time you push and then show the results - we find this very useful and liken it to "continuous integration" by calling it "continuous validation"

We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools?

Yes: syntactic support is good and semantic is reasonable (e.g. is this field value > 0 etc). More fancy semantic testing might need some custom work (but I believe you can plug this into goodtables but i'd need to check with @roll)

The data we're extracting from government sources is in some cases extremely dirty and poorly organized, so we're doing a lot of processing in Python before loading it, including starting to use a little bit of machine learning / and automated classification to try and tie together datasets which have no shared keys, and verify that we've done so correctly. The commercial data providers outsource this work to cheap labor overseas, who do it manually, but we don't have those resources so we're trying to automate it ,which seems like a valuable set of tools to develop for other purposes too!

This is very exciting - we've been doing this kind of work personally and with the community for many years cf https://datahub.io/docs/core-data

We know exactly how painful it is - and I agree with you that automation, machine learning plus community can make a big dent in this.

One of the reasons for working on the data factory - which is more pattern/framework - than a solution, is to try to have a common way to do this kind of stuff so that the community can have their own way to do things but collaborate and share more effectively.

We liken it to LAMP for web development back in the day: it was a standard set of patterns for how to do web dev that allowed for reuse, scale etc.

Similarly we want a pattern for "data development" / DataOps / data engineering.

Zane Selvans
@zaneselvans
The commercial data providers are very expensive platform monopolies -- S&P Global Market Intelligence seems to just buy out any company that does this work, while increasing their subscription rates by double digit percentages every year. Even the public utility commissions often don't have access to the data, and end up relying on what is provided in proceedings from the utilities they are supposed to be regulating. We've seen that even a small amount of independent quantitative analysis can be enough to get a PUC to call the utility's bluff and demand more rigorous or different analyses.
This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.
Zane Selvans
@zaneselvans
I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.
Rufus Pollock
@rufuspollock

The commercial data providers are very expensive platform monopolies -- S&P Global Market Intelligence seems to just buy out any company that does this work, while increasing their subscription rates by double digit percentages every year. Even the public utility commissions often don't have access to the data, and end up relying on what is provided in proceedings from the utilities they are supposed to be regulating. We've seen that even a small amount of independent quantitative analysis can be enough to get a PUC to call the utility's bluff and demand more rigorous or different analyses.

Yes!

This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.

Yes, there are two parts here: the approach / pattern, plus a critical mass of collaborators maintaining this stuff. The two inter-connect: standard patterns help distributed communities collaborate ...

I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.

:thumbsup: this is super useful info and we'd love to pull more data from them (or collaborate with others doing that) and get it on the datahub (plus a github / gitlab repo for comments etc)

Zane Selvans
@zaneselvans
I don't know where y'all are located, but I've just become a wanderer. I'm working from near Zürich for the next two months, and will be in Berlin from Sep. 4 to 11th, partly for a meeting of European open energy data folks. I would love to connect with people working on this stuff in person if there are people in either of those cities that you could recommend.
Our (admittedly long term) goal is to liberate all the public utility data that the US collects -- FERC, EIA, EPA, the regional grid operators, and other agencies -- and integrate it together for anyone to use. We've been looking for the right way to redistribute it outside of the analytical platform, and data packages seem like the best option I've seen so far. https://github.com/catalyst-cooperative/pudl
Rufus Pollock
@rufuspollock

This is great - i'm based between (near) Paris and London. It's really great to be in contact :smile:

I've just added an datahub awesome-data issue item: datahq/awesome-data#35 - please add to it and we can turn it into a page at https://datahub.io/awesome/ (this way people both know what people are up to and we have growing listing of material)

afkb for a bit ...
I also really like https://catalyst.coop/ and think you guys have values and approach is very aligned :smile:
Now really afkb!
Zane Selvans
@zaneselvans
:)
Zane Selvans
@zaneselvans
When wrangling data for publication to datahub.io is there a preference for adapting the packaging to reflect the raw data as provided, or modifying the data to reflect common conventions? E.g. if a government agency provides a data file with a strange text encoding or delimiter or date format, should the packaging simply reflect those choices, or is it better to change the underlying data to reflect more common choices (utf-8, commas as delimiters, ISO dates), and if the latter, is there a preferred universe of python packages that the data processing scripts should be limited to using?
Anuar Ustayev
@anuveyatsu

Hi @zaneselvans thanks for asking great questions :+1:

Generally, you don’t need to change the raw data but provide all these information in the metadata (datapackage.json file). If you’re using our data CLI tool, it should guess things like encoding, delimiters and date formats and reflect it in the generated descriptor file. I would suggest reading this blog post re initializing data packages - https://datahub.io/blog/how-to-initialize-a-data-package-using-data-tool and I’d use interactive mode to control the process.

@zaneselvans here is some useful examples:
Rufus Pollock
@rufuspollock
@zaneselvans hope that helped you. Also to clarify:
  • if you are doing your own wrangling i'd suggest going with the raw python libs like datapackage-py and dataflows and then just call the data cli tool at the end of it (from python if you want)
  • if it is pretty straight forward and you just want to publish you can use data
Zane Selvans
@zaneselvans
@rufuspollock @anuveyatsu Yes, definitely helpful. Will continue working with the Python package, and see whether I can get the data package where I want it to be without having to modify any of the underlying data from MSHA or wrangle it in pandas.
Rufus Pollock
@rufuspollock
@zaneselvans yes - key point is that wrangling you do in your tool of choice, perhaps using the pattern developed in dataflows, and then "push" to datahub you use data or the API directly.
Johan Richer
@johanricher
Hey guys, FYI following discussions we had with them a while back, our ex-colleagues at Etalab (the French open data agency) are starting to work with data packages:
https://github.com/opendatateam/datapackage-pipelines-udata
https://twitter.com/taniki/status/1035110812011126785
(uData is the software project powering data.gouv.fr)
Paul Walsh
@pwalsh
@johanricher great to hear!
Chris Hale
@chrispomeroyhale
Hi. I'm having trouble loading https://datahub.io/core -- I get a 502 from Cloudflare
Irakli Mchedlishvili
@zelima
@slythfox thanks for reporting. We are working on it
@slythfox should be fine now
Zane Selvans
@zaneselvans
@johanricher @pwalsh The French electricity grid operator (RTE France) has been enthusiastic about providing open data, and someone from their open data portal (named Hoang Nguyen) got an earful yesterday in support of Data Packages from the folks at the Open Power System Data project. It might be useful to connect someone from etalab with Nguyen at RTE, if they aren't in touch already.
Johan Richer
@johanricher
@zaneselvans sure, I'd be glad to help them navigate! Can you put me in contact?
Anuar Ustayev
@anuveyatsu

📰📢 Check out a list of core datasets that are updated on a regular basis:

https://datahub.io/blog/automatically-updated-core-datasets-on-datahub
Vaibhav Maheshwari
@vaibhavgeek
I have a csv file, it has been renamed because of a competition. I want to know which dataset does that file actually belong to?
Does anyone know a place where I can upload the csv file and it shows me relevant results
Rufus Pollock
@rufuspollock

@vaibhavgeek can you give a bit more detail on the issue with file rename.

To upload a file: just follow the instructions here https://datahub.io/docs/getting-started/publishing-data

Stephen Abbott Pugh
@StephenAbbott

Hi there. Just been testing out Google's new Dataset Search and found some spam datasets uploaded to the old datahub.io around 2013.

Example = https://toolbox.google.com/datasetsearch/search?query=black%20site%3Adatahub.io&docid=hSZEp7J5ZDHbBETSAAAAAA%3D%3D

Where could/should I raise an issue to look at removing spam? Thanks

Rufus Pollock
@rufuspollock
@StephenAbbott flag here is perfect or on https://github.com/datahq/datahub-qa/issues