Branko-Dj on master
[travis][s]: Added update comma… (compare)
@akshayshende129 it is great that you’ve decided to learn Data Science :+1:
I’m not sure if we can help you with the learning path. However, you can find datasets to practice on our website - https://datahub.io/machine-learning
We have also posted a short blog about machine learning here - https://datahub.io/blog/machine-learning-datasets
Hope all this will be helpful!
@zaneselvans great to hear from you and DataHub definitely sounds like a fit for what you are doing (which sounds exciting!)
We can definitely tell you more about Data Factory - which is the open source framework we've developed.
Regarding space: 100s of GBs would be fine (we back onto s3 etc).
In terms of pulling subsets can you explain more about what you'd mean - e.g. do you want people just to download a single file or even a sample of a file?
@rufuspollock On subsets, I mean, if we were to publish a dataset covering the entire US and all 20 years of time and it was 100GB, but someone just wanted the most recent year's worth of data for CA, it would be nice if they could specify that and just download the portion of the big dataset that's relevant to their work.
Got you, yes. There a variety of ways to do that. The most natural would be to partition the initial dataset by year or similar but there are other approaches.
Relatedly, right now our analysis and development is mostly happening through Jupyter Notebooks, and we're excited about being able to separate the data storage and access from the analytical applications, so people can use something like Binder to publicize an analysis that uses our code and the compiled data, and others can easily replicate and build upon that analysis without having to manage their own 100GB pile-o-data locally.
Yes, exactly. that sounds great.
Looking at the CKAN docs, it sounds like what I'm imagining might be implemented by the datastore extension?
The new DataHub.io is not build on CKAN but a new stack built natively around Data Packages and the Frictionless Data toolchain. Data storage is native and implemented by default. If you wanted to play around now you can you can just download the
data command line tool and try "pushing" some data - https://datahub.io/download
@zaneselvans - great questions and keep 'em coming.
Ahh, I see, so datahub has diverged from CKAN. Was getting the sense playing with both that they were on different pages.
Yes, DataHub is CKAN "next gen" if you like. It is modular, service oriented architecture and built around data packages from ground up. A lot of the components can (and some are) used with CKAN. We're also dedicated to running datahub.io as a reliable community oriented data hub.
We're also looking at how this kind of data could be hosted within a computational cloud context (AWS Open Data) and partitioning by e.g. state & year would make that much more efficient and cost-effective as well.
The way DataHub is designed makes it very easy to add additional pipelines to push data to wherever you need (or to pull from where you want),.
I guess a single Data Package could contain a bunch of individual resources, each of which is a year of data from a state. Is that a common approach?
Yes, exactly. That's a very sensible way to do it. The way DataHub is designed also means this could be automated - so you upload in bulk and then process out slices like this. Also with tools like PrestoDB (and AWS Athena) ad-hoc querying against the CSV is getting very easy.
Is there a preferred method for wrapping larger datasets in Data Packages? Not immediately seeing the documentation on multipart resources. I'll get the python package and command line tools installed and play with them.
Good question and basically what you are asking for a is a pattern for doing this. I do think partitioning datasets into sensible resources does help with local and remote management with existing tools you have and also for updates - rather than syncing a 1GB file because a change in one year you can just push the year and state that changed (the data tool will do this automatically for you - ie. avoid pushing unchanged resources).
Up until now our workflow has been to pull the raw data, clean it up and integrate it using Python/pandas, populate a local postgres DB, and then output various data products as requested by users, or by creating a few stock interesting tabular outputs. The database was the primary product, and the data outputs an application that sits on top of it. But now (especially as we get into needing to use other platforms) after exploring a bit here, I'm wondering if we shouldn't invert that, and think of the cleaned and integrated data as the primary product, which we and others use to populate various analytical platforms -- maybe a database, maybe just some pandas dataframes, maybe something on AWS.
Yes, big :thumbsup: - i think that is probably more sustainable (i.e. scalable, adaptable, debuggable, modularizable).
It is also really interesting to hear about your approach and what you been doing.
In this context I want to mention the Data Factory:
We have an open source pattern / framework called the Data Factory for doing data processing and integration. Currently in very active development / documentation.
This system is how we power the datahub and our own work prepping core datasets (https://datahub.io/docs/core-data) but is usable completely separately from the DataHub.io SaaS platform (of course you can publish to DataHub.io at the end and use DataHub.io for some integration / processing but you don't need to).
This seems like a beautifully designed system. Thank you for working on it!
Very curious to understand more about what's really inside the data factory.
@zaneselvans more about the conceptual ideas for the data factory in this recent post by our Technical Lead @akariv
There's a follow up tutorial here:
That workflow is pretty close to what we've been doing, though our structural data validation has been taking place upon loading into the postgres DB -- but it looks like goodtables and/or the data validate tool could do the same kind of thing. We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools? Right now we're using pytest cases to not just test our code, but also to load the entire database and thus (incidentally) validate the data structure, but especially with the larger datasets that... just doesn't work. It was already taking 20 minutes to run the tests, and with the addition of our first large dataset, it bumped to 8 hours. We need to separate these different kinds of testing.
Yes, basically everyone who work with data has to do some kind of version of ETL / ELT + testing.
As you mention, as your data grows doing this implicitly (or explicitly) with a traditional DB can be painful - both from performance and because of centralization of the process.
Just like you we use goodtables within a traditional testing framework. Goodtables is also built into DataHub.io so we'll do goodtables validation for you automatically every time you push and then show the results - we find this very useful and liken it to "continuous integration" by calling it "continuous validation"
We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools?
Yes: syntactic support is good and semantic is reasonable (e.g. is this field value > 0 etc). More fancy semantic testing might need some custom work (but I believe you can plug this into goodtables but i'd need to check with @roll)
The data we're extracting from government sources is in some cases extremely dirty and poorly organized, so we're doing a lot of processing in Python before loading it, including starting to use a little bit of machine learning / and automated classification to try and tie together datasets which have no shared keys, and verify that we've done so correctly. The commercial data providers outsource this work to cheap labor overseas, who do it manually, but we don't have those resources so we're trying to automate it ,which seems like a valuable set of tools to develop for other purposes too!
This is very exciting - we've been doing this kind of work personally and with the community for many years cf https://datahub.io/docs/core-data
We know exactly how painful it is - and I agree with you that automation, machine learning plus community can make a big dent in this.
One of the reasons for working on the data factory - which is more pattern/framework - than a solution, is to try to have a common way to do this kind of stuff so that the community can have their own way to do things but collaborate and share more effectively.
We liken it to LAMP for web development back in the day: it was a standard set of patterns for how to do web dev that allowed for reuse, scale etc.
Similarly we want a pattern for "data development" / DataOps / data engineering.
The commercial data providers are very expensive platform monopolies -- S&P Global Market Intelligence seems to just buy out any company that does this work, while increasing their subscription rates by double digit percentages every year. Even the public utility commissions often don't have access to the data, and end up relying on what is provided in proceedings from the utilities they are supposed to be regulating. We've seen that even a small amount of independent quantitative analysis can be enough to get a PUC to call the utility's bluff and demand more rigorous or different analyses.
This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.
Yes, there are two parts here: the approach / pattern, plus a critical mass of collaborators maintaining this stuff. The two inter-connect: standard patterns help distributed communities collaborate ...
I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.
:thumbsup: this is super useful info and we'd love to pull more data from them (or collaborate with others doing that) and get it on the datahub (plus a github / gitlab repo for comments etc)
This is great - i'm based between (near) Paris and London. It's really great to be in contact :smile:
I've just added an datahub awesome-data issue item: datahq/awesome-data#35 - please add to it and we can turn it into a page at https://datahub.io/awesome/ (this way people both know what people are up to and we have growing listing of material)
Hi @zaneselvans thanks for asking great questions :+1:
Generally, you don’t need to change the raw data but provide all these information in the metadata (
datapackage.json file). If you’re using our
data CLI tool, it should guess things like encoding, delimiters and date formats and reflect it in the generated descriptor file. I would suggest reading this blog post re initializing data packages - https://datahub.io/blog/how-to-initialize-a-data-package-using-data-tool and I’d use interactive mode to control the process.
datapackage.jsonfile to see how the data files are described
|delimited data on datahub - https://datahub.io/anuveyatsu/pipe-delimited
datacli tool at the end of it (from python if you want)