These are chat archives for frictionlessdata/chat

18th
Oct 2017
Stephen Gates
@Stephen-Gates
Oct 18 2017 05:50
Hi @bruth provenance in a data package is usually provided in the readme.md. Here's a sample I'm using. Of course you could write anything in the markdown file. I'm intrigued about how you can derive provenance from a data resource. How could you determine what processing has been done by just looking at the end result?
The readme.md is a file included in the datapackage.zip http://specs.frictionlessdata.io/data-package/#illustrative-structure
Byron Ruth
@bruth
Oct 18 2017 10:53
Thanks @Stephen-Gates. All that can be derived are changes from one version to the next (more rows, changed values, etc.). You are correct in that the intent/cause of the change is not known unless you have the context. For my use case, I will have this information since new revisions of a dataset will prompt the user (committing the new version) for a reason.
I am evaluating FD for the specs and tooling as the basis for a "data sharing platform" within my org. I have come across other specs in the past, but FD feels the most nimble and active. Extensibility is important since we may need to add additional metadata specific to my org. I appears that this is allowed within the specs.
Rufus Pollock
@rufuspollock
Oct 18 2017 10:56

Good afternoon. I am reviewing the various specifications and had two questions. First have you come across a use case where a "query" is being represented as a data resource?

@bruth yes we've definitely thought about this use case. You could definitely use it this way.

And second, are there any support/examples for including and/or deriving provenance (PROV or otherwise) from data resources?

You can use the sources attribute. Would you want more than that?

@bruth and welcome - great to have your questions and interest :-)
@bruth and if you are interested in a data package oriented "data sharing platform" you can check out https://datahub.io/
Byron Ruth
@bruth
Oct 18 2017 11:02
@rufuspollock Thanks! Treating a query as a data resource is sort of weird, am may be more appropriate as provenance itself for the dataset being produced. The sources and contributors attributes are a good start for provenance. I need to evaluate to what extent I need/want to embed all provenance information in the datapackage.json or if I would reference a changelog/PROV graph of sorts
datahub looks very nice. i like how a dataset is presented. for my use case, this would be internal to the organization, so unfortunately I can't use this hosted version.
Stephen Gates
@Stephen-Gates
Oct 18 2017 11:22
@bruth if you need an internal solution to create data packages, you may be interested in a project I'm leading http://data-curator.io - work in progress - v1.0.0 due before Christmas
Byron Ruth
@bruth
Oct 18 2017 11:27
@Stephen-Gates This looks very promising. I am going to try it out. My org will be hiring a few library scientists to help in the data curation/documentation process of datasets. This could be a useful tool for them to assist in this process.
Stephen Gates
@Stephen-Gates
Oct 18 2017 11:31
@bruth current release can open, edit, save data, guess or set column properties, validate data. These milestones describe this year's plan and we're seeking funding for version 2.