These are chat archives for frictionlessdata/chat

3rd
Sep 2018
Zane Selvans
@zaneselvans
Sep 03 2018 08:06
To get familiar with the data package tools and infrastructure, I'm packaging some US federal government data for datahub.io using Python, and am wondering where I ought to offer comments/feedback/questions. For example, the data I'm packaging is available for download from MSHA as (sigh) zipped, pipe (|) delimited, ISO-8859-1 encoded files, which seems to mean that the command line dataflows tools can't deal with it, but the pandas DataFrame.read_csv() is happy to gobble up compressed files and lets you set encoding and alternate delimiters, and it would be nice to have the same kind of functionality here. Or be able to initialize a packages from a DataFrame rather than having to write it to a CSV and then use that. Or... they also provide a file with descriptions of each of the fields included in the data, keyed by the name of the field/column (which I imagine is pretty common?) and it would be great if there were a way to take that information and automatically populate a 'description' attribute for each of the corresponding fields based on their 'name' in the tabular data resource.
roll
@roll
Sep 03 2018 08:12
@zaneselvans I'm happy to help - can you please share some example datasets? It will help to propose better ideas
Also as you have already discovered https://gitter.im/datahubio/chat is awesome source for help too (esp. datahub.io and datasets wise)
Zane Selvans
@zaneselvans
Sep 03 2018 08:15
I almost asked over there, but I figured this was more related to working with the Data Package standards and software than datahub.io in particular, but if you think that's a more appropriate forum happy to jump over there.
I picked the MSHA data since it seems like it most closely conforms to the expectations of the datahub resources. It's tabular, small-to-medium sized, pretty clean already, and it's basically a collection of database dumps, that can be re-assembled into a complete database given the appropriate schema in a data package. That and we haven't integrated it into our postgres DB already, so it's also useful work for us.
roll
@roll
Sep 03 2018 08:33

@zaneselvans In this case we can use datapackage-py library directly to prepare data for publishing. There is a rich documentation with real-world examples. E.g. in our case I would do something like this:

from pprint import pprint
from datapackage import Package, infer

package = Package(infer('tmp/mines/mines.csv'))
package.descriptor['resources'][0].update({
    'encoding': 'ISO-8859-1',
    'dialect': {
        'delimiter': '|',
    }
})
package.commit()
package.save('mines.zip')
pprint(package.descriptor)
pprint(package.get_resource('mines').read(keyed=True, limit=10))

I have extracted and updated extension of mines.zip first but it also can be automated with Python. After this step we have a datapackage with the data file and inferred metadata (incl. a table schema).