Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 31 2019 19:27
    rufuspollock commented #241
  • Dec 14 2018 15:27
    StephenAbbott opened #246
  • Dec 03 2018 09:12
    rufuspollock commented #245
  • Nov 26 2018 14:51
    StephenAbbott opened #245
  • Nov 08 2018 08:31
    zelima commented #243
  • Nov 08 2018 08:05
    zelima closed #244
  • Nov 08 2018 08:05
    zelima commented #244
  • Nov 08 2018 07:57
    zaneselvans commented #244
  • Nov 07 2018 07:22
    zelima commented #244
  • Nov 07 2018 07:16
    akariv commented #244
  • Nov 07 2018 07:10
    akariv commented #234
  • Nov 06 2018 16:56
    parrottsquawk commented #234
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:25
    zelima commented #244
  • Nov 01 2018 13:23
    zelima commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Nov 01 2018 08:29
    anuveyatsu commented #244
  • Oct 24 2018 19:03
    zaneselvans opened #244
  • Oct 23 2018 09:40
    geraldb starred datahq/datahub-qa
  • Oct 19 2018 08:22

    Branko-Dj on master

    [travis][s]: Added update comma… (compare)

Zane Selvans
@zaneselvans
This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.
Zane Selvans
@zaneselvans
I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.
Rufus Pollock
@rufuspollock

The commercial data providers are very expensive platform monopolies -- S&P Global Market Intelligence seems to just buy out any company that does this work, while increasing their subscription rates by double digit percentages every year. Even the public utility commissions often don't have access to the data, and end up relying on what is provided in proceedings from the utilities they are supposed to be regulating. We've seen that even a small amount of independent quantitative analysis can be enough to get a PUC to call the utility's bluff and demand more rigorous or different analyses.

Yes!

This kind of integrated programmatic pipeline for open data seems way overdue. I hope it takes off. So much wasted time goes into scraping this stuff over and over again in a way that doesn't produce any cumulative value.

Yes, there are two parts here: the approach / pattern, plus a critical mass of collaborators maintaining this stuff. The two inter-connect: standard patterns help distributed communities collaborate ...

I noticed that you've got a few EIA datasets in the core data collection, which I imagine you're pulling using the EIA API. Interestingly, they do not provide the most fine grained, useful, economically actionable information in their API! For instance, in the Excel spreadsheets they publicize for EIA Form 923, there's fuel consumption at the individual boiler level, and net generation at the individual generator level, but in the API, they only give plant level data, and even in the spreadsheets, they don't publicize a usable set of boiler-generator associations, though one can be inferred from the other information provided. We've heard similar things reported from folks that work with Oil and Gas production data.

:thumbsup: this is super useful info and we'd love to pull more data from them (or collaborate with others doing that) and get it on the datahub (plus a github / gitlab repo for comments etc)

Zane Selvans
@zaneselvans
I don't know where y'all are located, but I've just become a wanderer. I'm working from near Zürich for the next two months, and will be in Berlin from Sep. 4 to 11th, partly for a meeting of European open energy data folks. I would love to connect with people working on this stuff in person if there are people in either of those cities that you could recommend.
Our (admittedly long term) goal is to liberate all the public utility data that the US collects -- FERC, EIA, EPA, the regional grid operators, and other agencies -- and integrate it together for anyone to use. We've been looking for the right way to redistribute it outside of the analytical platform, and data packages seem like the best option I've seen so far. https://github.com/catalyst-cooperative/pudl
Rufus Pollock
@rufuspollock

This is great - i'm based between (near) Paris and London. It's really great to be in contact :smile:

I've just added an datahub awesome-data issue item: datahq/awesome-data#35 - please add to it and we can turn it into a page at https://datahub.io/awesome/ (this way people both know what people are up to and we have growing listing of material)

afkb for a bit ...
I also really like https://catalyst.coop/ and think you guys have values and approach is very aligned :smile:
Now really afkb!
Zane Selvans
@zaneselvans
:)
Zane Selvans
@zaneselvans
When wrangling data for publication to datahub.io is there a preference for adapting the packaging to reflect the raw data as provided, or modifying the data to reflect common conventions? E.g. if a government agency provides a data file with a strange text encoding or delimiter or date format, should the packaging simply reflect those choices, or is it better to change the underlying data to reflect more common choices (utf-8, commas as delimiters, ISO dates), and if the latter, is there a preferred universe of python packages that the data processing scripts should be limited to using?
Anuar Ustayev
@anuveyatsu

Hi @zaneselvans thanks for asking great questions :+1:

Generally, you don’t need to change the raw data but provide all these information in the metadata (datapackage.json file). If you’re using our data CLI tool, it should guess things like encoding, delimiters and date formats and reflect it in the generated descriptor file. I would suggest reading this blog post re initializing data packages - https://datahub.io/blog/how-to-initialize-a-data-package-using-data-tool and I’d use interactive mode to control the process.

@zaneselvans here is some useful examples:
Rufus Pollock
@rufuspollock
@zaneselvans hope that helped you. Also to clarify:
  • if you are doing your own wrangling i'd suggest going with the raw python libs like datapackage-py and dataflows and then just call the data cli tool at the end of it (from python if you want)
  • if it is pretty straight forward and you just want to publish you can use data
Zane Selvans
@zaneselvans
@rufuspollock @anuveyatsu Yes, definitely helpful. Will continue working with the Python package, and see whether I can get the data package where I want it to be without having to modify any of the underlying data from MSHA or wrangle it in pandas.
Rufus Pollock
@rufuspollock
@zaneselvans yes - key point is that wrangling you do in your tool of choice, perhaps using the pattern developed in dataflows, and then "push" to datahub you use data or the API directly.
Johan Richer
@johanricher
Hey guys, FYI following discussions we had with them a while back, our ex-colleagues at Etalab (the French open data agency) are starting to work with data packages:
https://github.com/opendatateam/datapackage-pipelines-udata
https://twitter.com/taniki/status/1035110812011126785
(uData is the software project powering data.gouv.fr)
Paul Walsh
@pwalsh
@johanricher great to hear!
Chris Hale
@chrispomeroyhale
Hi. I'm having trouble loading https://datahub.io/core -- I get a 502 from Cloudflare
Irakli Mchedlishvili
@zelima
@slythfox thanks for reporting. We are working on it
@slythfox should be fine now
Zane Selvans
@zaneselvans
@johanricher @pwalsh The French electricity grid operator (RTE France) has been enthusiastic about providing open data, and someone from their open data portal (named Hoang Nguyen) got an earful yesterday in support of Data Packages from the folks at the Open Power System Data project. It might be useful to connect someone from etalab with Nguyen at RTE, if they aren't in touch already.
Johan Richer
@johanricher
@zaneselvans sure, I'd be glad to help them navigate! Can you put me in contact?
Anuar Ustayev
@anuveyatsu

📰📢 Check out a list of core datasets that are updated on a regular basis:

https://datahub.io/blog/automatically-updated-core-datasets-on-datahub
Vaibhav Maheshwari
@vaibhavgeek
I have a csv file, it has been renamed because of a competition. I want to know which dataset does that file actually belong to?
Does anyone know a place where I can upload the csv file and it shows me relevant results
Rufus Pollock
@rufuspollock

@vaibhavgeek can you give a bit more detail on the issue with file rename.

To upload a file: just follow the instructions here https://datahub.io/docs/getting-started/publishing-data

Stephen Abbott Pugh
@StephenAbbott

Hi there. Just been testing out Google's new Dataset Search and found some spam datasets uploaded to the old datahub.io around 2013.

Example = https://toolbox.google.com/datasetsearch/search?query=black%20site%3Adatahub.io&docid=hSZEp7J5ZDHbBETSAAAAAA%3D%3D

Where could/should I raise an issue to look at removing spam? Thanks

Rufus Pollock
@rufuspollock
@StephenAbbott flag here is perfect or on https://github.com/datahq/datahub-qa/issues
Rufus Pollock
@rufuspollock
@StephenAbbott did you manage to flag this?
Rufus Pollock
@rufuspollock
blob
Rufus Pollock
@rufuspollock

:newspaper: "Awesome" page renamed to collections and made beautiful

See screenshot above and visit the page:

https://datahub.io/collections

Zane Selvans
@zaneselvans
Do folks have a favorite easy to use package for visualizing and filtering data that's accessible via data packages? Something that a relative layperson could use?
Zane Selvans
@zaneselvans
Is there a recommended maximum file size for use with tabular data resources? When running data validate I get a warning about a memory leak. On a 30MB resource, it seems to work fine. On a 160MB resource eventually I get a core dump with JavaScript running out of memory on a machine with 24GB of RAM. For larger data packages does it make more sense to use the python goodtables for validation instead of the Node.js based command line tool?
Rufus Pollock
@rufuspollock

Do folks have a favorite easy to use package for visualizing and filtering data that's accessible via data packages? Something that a relative layperson could use?

The perfect thing would be something that already ingests tabular but is made Data Package aware. Right now you can fallback to anything that can ingest csv (which is pretty much all tools). I can suggest some tools for playing with data that would suit (and we could think about how to plugin Data Package support as we have with e.g. pandas etc.

Is there a recommended maximum file size for use with tabular data resources? When running data validate I get a warning about a memory leak. On a 30MB resource, it seems to work fine. On a 160MB resource eventually I get a core dump with JavaScript running out of memory on a machine with 24GB of RAM. For larger data packages does it make more sense to use the python goodtables for validation instead of the Node.js based command line tool?

No there is no limit for tabular data packages. This is a bug with data validate - can you open an issue on https://github.com/datahq/data-cli

I think you can use either route and for bigger packages goodtables may be better (and is used internally).

My other question here is whether any of the files can be chunked/partitioned - frictionlessdata/specs#620

M. Ali Naqvi
@MAliNaqvi

Hi folks,

I wanted to updated our datasets on datahub.io/johnsnowlabs

When pushing the dataset this is what I got:

> Error! Max storage for user exceeded plan limit (5000MB)

However the total size of the data that has been uploaded is ~200MB

Rufus Pollock
@rufuspollock
@MAliNaqvi we'll need to fix that - probably the total size of the other datasets exceeds 5GB
Rufus Pollock
@rufuspollock
@MAliNaqvi fixed
M. Ali Naqvi
@MAliNaqvi
@rufuspollock The problem is still there. Is there any further information I can provide?
Irakli Mchedlishvili
@zelima
This message was deleted
Irakli Mchedlishvili
@zelima
@MAliNaqvi is it still saying 5000?
Irakli Mchedlishvili
@zelima
@MAliNaqvi never mind, found the problem
Rufus Pollock
@rufuspollock
@zelima is it fixed?
Irakli Mchedlishvili
@zelima
@rufuspollock I've sent instructions to fix this privately.
@MAliNaqvi can you confirm it's fixed now?
M. Ali Naqvi
@MAliNaqvi
Not yet. Still need some support around using the updated config.
M. Ali Naqvi
@MAliNaqvi
@zelima sent you a private message
M. Ali Naqvi
@MAliNaqvi
@zelima the issue has been resolved. Thank you!