Branko-Dj on master
[travis][s]: Added update comma… (compare)
@zaneselvans - great questions and keep 'em coming.
Ahh, I see, so datahub has diverged from CKAN. Was getting the sense playing with both that they were on different pages.
Yes, DataHub is CKAN "next gen" if you like. It is modular, service oriented architecture and built around data packages from ground up. A lot of the components can (and some are) used with CKAN. We're also dedicated to running datahub.io as a reliable community oriented data hub.
We're also looking at how this kind of data could be hosted within a computational cloud context (AWS Open Data) and partitioning by e.g. state & year would make that much more efficient and cost-effective as well.
The way DataHub is designed makes it very easy to add additional pipelines to push data to wherever you need (or to pull from where you want),.
I guess a single Data Package could contain a bunch of individual resources, each of which is a year of data from a state. Is that a common approach?
Yes, exactly. That's a very sensible way to do it. The way DataHub is designed also means this could be automated - so you upload in bulk and then process out slices like this. Also with tools like PrestoDB (and AWS Athena) ad-hoc querying against the CSV is getting very easy.
Is there a preferred method for wrapping larger datasets in Data Packages? Not immediately seeing the documentation on multipart resources. I'll get the python package and command line tools installed and play with them.
Good question and basically what you are asking for a is a pattern for doing this. I do think partitioning datasets into sensible resources does help with local and remote management with existing tools you have and also for updates - rather than syncing a 1GB file because a change in one year you can just push the year and state that changed (the data tool will do this automatically for you - ie. avoid pushing unchanged resources).
Up until now our workflow has been to pull the raw data, clean it up and integrate it using Python/pandas, populate a local postgres DB, and then output various data products as requested by users, or by creating a few stock interesting tabular outputs. The database was the primary product, and the data outputs an application that sits on top of it. But now (especially as we get into needing to use other platforms) after exploring a bit here, I'm wondering if we shouldn't invert that, and think of the cleaned and integrated data as the primary product, which we and others use to populate various analytical platforms -- maybe a database, maybe just some pandas dataframes, maybe something on AWS.
Yes, big :thumbsup: - i think that is probably more sustainable (i.e. scalable, adaptable, debuggable, modularizable).
It is also really interesting to hear about your approach and what you been doing.
In this context I want to mention the Data Factory:
We have an open source pattern / framework called the Data Factory for doing data processing and integration. Currently in very active development / documentation.
This system is how we power the datahub and our own work prepping core datasets (https://datahub.io/docs/core-data) but is usable completely separately from the DataHub.io SaaS platform (of course you can publish to DataHub.io at the end and use DataHub.io for some integration / processing but you don't need to).
This seems like a beautifully designed system. Thank you for working on it!
Very curious to understand more about what's really inside the data factory.
@zaneselvans more about the conceptual ideas for the data factory in this recent post by our Technical Lead @akariv
There's a follow up tutorial here:
That workflow is pretty close to what we've been doing, though our structural data validation has been taking place upon loading into the postgres DB -- but it looks like goodtables and/or the data validate tool could do the same kind of thing. We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools? Right now we're using pytest cases to not just test our code, but also to load the entire database and thus (incidentally) validate the data structure, but especially with the larger datasets that... just doesn't work. It was already taking 20 minutes to run the tests, and with the addition of our first large dataset, it bumped to 8 hours. We need to separate these different kinds of testing.
Yes, basically everyone who work with data has to do some kind of version of ETL / ELT + testing.
As you mention, as your data grows doing this implicitly (or explicitly) with a traditional DB can be painful - both from performance and because of centralization of the process.
Just like you we use goodtables within a traditional testing framework. Goodtables is also built into DataHub.io so we'll do goodtables validation for you automatically every time you push and then show the results - we find this very useful and liken it to "continuous integration" by calling it "continuous validation"
We need to test the data for both syntactic and semantic correctness (or at least sanity...) is that a functionality that exists within these tools?
Yes: syntactic support is good and semantic is reasonable (e.g. is this field value > 0 etc). More fancy semantic testing might need some custom work (but I believe you can plug this into goodtables but i'd need to check with @roll)
The data we're extracting from government sources is in some cases extremely dirty and poorly organized, so we're doing a lot of processing in Python before loading it, including starting to use a little bit of machine learning / and automated classification to try and tie together datasets which have no shared keys, and verify that we've done so correctly. The commercial data providers outsource this work to cheap labor overseas, who do it manually, but we don't have those resources so we're trying to automate it ,which seems like a valuable set of tools to develop for other purposes too!
This is very exciting - we've been doing this kind of work personally and with the community for many years cf https://datahub.io/docs/core-data
We know exactly how painful it is - and I agree with you that automation, machine learning plus community can make a big dent in this.
One of the reasons for working on the data factory - which is more pattern/framework - than a solution, is to try to have a common way to do this kind of stuff so that the community can have their own way to do things but collaborate and share more effectively.
We liken it to LAMP for web development back in the day: it was a standard set of patterns for how to do web dev that allowed for reuse, scale etc.
Similarly we want a pattern for "data development" / DataOps / data engineering.
The commercial data providers are very expensive platform monopolies -- S&P Global Market Intelligence seems to just buy out any company that does this work, while increasing their subscription rates by double digit percentages every year. Even the public utility commissions often don't have access to the data, and end up relying on what is provided in proceedings from the utilities they are supposed to be regulating. We've seen that even a small amount of independent quantitative analysis can be enough to get a PUC to call the utility's bluff and demand more rigorous or different analyses.
This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.
Yes, there are two parts here: the approach / pattern, plus a critical mass of collaborators maintaining this stuff. The two inter-connect: standard patterns help distributed communities collaborate ...
I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.
:thumbsup: this is super useful info and we'd love to pull more data from them (or collaborate with others doing that) and get it on the datahub (plus a github / gitlab repo for comments etc)
This is great - i'm based between (near) Paris and London. It's really great to be in contact :smile:
I've just added an datahub awesome-data issue item: datahq/awesome-data#35 - please add to it and we can turn it into a page at https://datahub.io/awesome/ (this way people both know what people are up to and we have growing listing of material)
Hi @zaneselvans thanks for asking great questions :+1:
Generally, you don’t need to change the raw data but provide all these information in the metadata (
datapackage.json file). If you’re using our
data CLI tool, it should guess things like encoding, delimiters and date formats and reflect it in the generated descriptor file. I would suggest reading this blog post re initializing data packages - https://datahub.io/blog/how-to-initialize-a-data-package-using-data-tool and I’d use interactive mode to control the process.
datapackage.jsonfile to see how the data files are described
|delimited data on datahub - https://datahub.io/anuveyatsu/pipe-delimited
datacli tool at the end of it (from python if you want)
@vaibhavgeek can you give a bit more detail on the issue with file rename.
To upload a file: just follow the instructions here https://datahub.io/docs/getting-started/publishing-data
Hi there. Just been testing out Google's new Dataset Search and found some spam datasets uploaded to the old datahub.io around 2013.
Where could/should I raise an issue to look at removing spam? Thanks
See screenshot above and visit the page:
Do folks have a favorite easy to use package for visualizing and filtering data that's accessible via data packages? Something that a relative layperson could use?
The perfect thing would be something that already ingests tabular but is made Data Package aware. Right now you can fallback to anything that can ingest csv (which is pretty much all tools). I can suggest some tools for playing with data that would suit (and we could think about how to plugin Data Package support as we have with e.g. pandas etc.
Is there a recommended maximum file size for use with tabular data resources? When running
No there is no limit for tabular data packages. This is a bug with data validate - can you open an issue on https://github.com/datahq/data-cli
I think you can use either route and for bigger packages goodtables may be better (and is used internally).
My other question here is whether any of the files can be chunked/partitioned - frictionlessdata/specs#620
I wanted to updated our datasets on datahub.io/johnsnowlabs
When pushing the dataset this is what I got:
> Error! Max storage for user exceeded plan limit (5000MB)
However the total size of the data that has been uploaded is ~200MB