Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 26 09:36
    sapetti9 edited #703
  • Nov 26 09:35
    sapetti9 closed #701
  • Nov 26 09:35
    sapetti9 edited #703
  • Nov 26 09:35
    sapetti9 opened #703
  • Nov 26 09:33
    sapetti9 edited #701
  • Nov 26 09:33
    sapetti9 edited #701
  • Nov 26 09:33
    sapetti9 edited #701
  • Nov 26 09:26
    sapetti9 edited #701
  • Nov 26 09:26
    sapetti9 edited #701
  • Nov 25 07:51
    roll opened #756
  • Nov 24 15:14
    lwinfree opened #755
  • Nov 18 17:11
    sapetti9 edited #701
  • Nov 18 17:11
    sapetti9 edited #701
  • Nov 18 17:11
    sapetti9 edited #701
  • Nov 18 17:10
    sapetti9 edited #701
  • Nov 18 13:48
    sapetti9 edited #701
  • Nov 18 13:48
    sapetti9 edited #701
  • Nov 18 13:28
    sapetti9 edited #701
  • Nov 18 13:13
    sapetti9 edited #701
  • Nov 17 15:16
    sapetti9 edited #701
Stephen Gates
@Stephen-Gates
Logged at frictionlessdata/specs#541
Byron Ruth
@bruth
Good afternoon. I am reviewing the various specifications and had two questions. First have you come across a use case where a "query" is being represented as a data resource? The assumption being that the dataset is a function of the query at the time it is executed. And second, are there any support/examples for including and/or deriving provenance (PROV or otherwise) from data resources?
Stephen Gates
@Stephen-Gates
Hi @bruth provenance in a data package is usually provided in the readme.md. Here's a sample I'm using. Of course you could write anything in the markdown file. I'm intrigued about how you can derive provenance from a data resource. How could you determine what processing has been done by just looking at the end result?
The readme.md is a file included in the datapackage.zip http://specs.frictionlessdata.io/data-package/#illustrative-structure
Stephen Gates
@Stephen-Gates
Byron Ruth
@bruth
Thanks @Stephen-Gates. All that can be derived are changes from one version to the next (more rows, changed values, etc.). You are correct in that the intent/cause of the change is not known unless you have the context. For my use case, I will have this information since new revisions of a dataset will prompt the user (committing the new version) for a reason.
I am evaluating FD for the specs and tooling as the basis for a "data sharing platform" within my org. I have come across other specs in the past, but FD feels the most nimble and active. Extensibility is important since we may need to add additional metadata specific to my org. I appears that this is allowed within the specs.
Rufus Pollock
@rufuspollock

Good afternoon. I am reviewing the various specifications and had two questions. First have you come across a use case where a "query" is being represented as a data resource?

@bruth yes we've definitely thought about this use case. You could definitely use it this way.

And second, are there any support/examples for including and/or deriving provenance (PROV or otherwise) from data resources?

You can use the sources attribute. Would you want more than that?

@bruth and welcome - great to have your questions and interest :-)
@bruth and if you are interested in a data package oriented "data sharing platform" you can check out https://datahub.io/
Byron Ruth
@bruth
@rufuspollock Thanks! Treating a query as a data resource is sort of weird, am may be more appropriate as provenance itself for the dataset being produced. The sources and contributors attributes are a good start for provenance. I need to evaluate to what extent I need/want to embed all provenance information in the datapackage.json or if I would reference a changelog/PROV graph of sorts
datahub looks very nice. i like how a dataset is presented. for my use case, this would be internal to the organization, so unfortunately I can't use this hosted version.
Stephen Gates
@Stephen-Gates
@bruth if you need an internal solution to create data packages, you may be interested in a project I'm leading http://data-curator.io - work in progress - v1.0.0 due before Christmas
Byron Ruth
@bruth
@Stephen-Gates This looks very promising. I am going to try it out. My org will be hiring a few library scientists to help in the data curation/documentation process of datasets. This could be a useful tool for them to assist in this process.
Stephen Gates
@Stephen-Gates
@bruth current release can open, edit, save data, guess or set column properties, validate data. These milestones describe this year's plan and we're seeking funding for version 2.
Stephen Gates
@Stephen-Gates
I'm interested in implementing the pattern allowing a foreign key to reference another data package. Does anyone have examples of this? Also, what level of adoption is required for work to commence on including this in the table schema standard?
Stephen Gates
@Stephen-Gates
I did find the concept was in the spec at one stage.
Rufus Pollock
@rufuspollock
@Stephen-Gates i've played around with implementing and we're thinking about this in datahub.io so if you were working on this you'd definitely have someone to chat with ... (and experiment with)
Stephen Gates
@Stephen-Gates
Great @rufuspollock just starting to spec Data Curator v2 for next year and the ability to reference an external table for validation is a feature desired by our sponsor.
Rufus Pollock
@rufuspollock
@Stephen-Gates :+1: - i agree it is a really useful feature
Stephen Gates
@Stephen-Gates

If I have a Data Package that contains a Data Resource that is shared under Public Domain, can someone please confirm that the licenses properties should be:

name : other-pd
path :
title : Other (Public Domain)

Based on http://licenses.opendefinition.org/licenses/other-pd.json from http://licenses.opendefinition.org/

Thanks

roll
@roll
@Stephen-Gates Based on current datapackage-js state referencing external data packages is relatively simple feature to add (integrity check + dereferencing). Should cost a few hours of work.
But I think we should add a proper Data Package Identifier support first
Stephen Gates
@Stephen-Gates
Thanks @roll but I'm not sure I understand. Are you saying that using the data package url, data resource location and the foreign key fields names in the current specification are inadequate to location the data to perform an integrity check? Are you suggesting that a data package identifier would assist with the fact that the data at that location could change? Do you think the use of an identifier would be mandatory or a best practice?
Paul Walsh
@pwalsh
@Stephen-Gates I think that @roll is saying that actually implementing foreign keys / references across data packages is quite simple in the JavaScript Data Package library that we maintain. However, he is reluctant to just go ahead and do so without some others things in place first.
roll
@roll
@pwalsh @Stephen-Gates I've meant that implementing a support for external referencing by a descriptor URL could be literally just a few lines addition to https://github.com/frictionlessdata/datapackage-js/blob/master/src/resource.js#L396 (load external DP there instead of current). No problem to add it if it's needed.

My second thought was just that I suppose end-users will like much more an ability to reference by package names, not urls like:

{fields: "country", reference: {package: "country-codes", resource: "countries", fields: "code"}}

But support for identifiers spec (which is also easy) surely could be just next step after basic external referencing support. So here I wan't clear. It's not any kind of requirement of my preference about an implementation order.

Stephen Gates
@Stephen-Gates
Thanks @roll. Reference by URL is all I was seeking. I assume that a change to the Table Schema standard would be required before a change to the code could be made?
roll
@roll
@Stephen-Gates No, I think we're good to go because for now only one mention of external referencing is in patterns - http://specs.frictionlessdata.io/patterns/#table-schema:-foreign-keys-to-data-packages. And it allows url referencing. Other question - should it be a datapackage property or package property? cc @rufuspollock
Serah Njambi Rono
@serahrono
New Frictionless Data pilot case study on eLife's use of goodtables for data validation of scientific research data http://frictionlessdata.io/case-studies/elife/
Rufus Pollock
@rufuspollock

And it allows url referencing. Other question - should it be a datapackage property or package property? cc @rufuspollock

I guess we could switch to simple package - do you have a preference or suggestion?

roll
@roll
@rufuspollock I think our intention on all levels is to use consistent package/resource OR datapackage/dataresource. So having a fk.reference.resource already suggests to use package.
Also The foreignKey MAY have a property datapackage. This property is a string being a url pointing to a Data Package or is the name of a datapackage. If there is an intention to use Data Package Identifier spec across main specs I think we need here to mention it instead of just a url/name.
Paul Walsh
@pwalsh

@roll @rufuspollock I just don't know why we would tie this to Data Package. We have Data Resource as a distinct spec now - there is no reason why publishers could not publish distinct Data Resources.

So, I don't see why fk.reference.resource needs to suggest usage of a data package. This again ties back to the JSON Pointer thing - I still do not see the benefit of us having custom DP identifiers, assumptions about FKs to packages, etc. etc. when we can just reuse existing specifications that are quite simple and designed for this type of referencing.

I'd really like to see the argument for why a custom approach is better, rather than me continuously jumping in on these conversations and saying "but .... json pointer" :)

roll
@roll
I think we need to compare things on real examples (cc @Stephen-Gates) I suppose there was a lot ideas on different referencing approaches. E.g. @akariv's one with resource referencing (but if I could remember correctly it was not directly json-pointers).
There is a Metatab's one - https://github.com/Metatab/appurl
roll
@roll
I have a feeling that Data Package Identifier spec (which almost not implemented for now anywhere) could be evolved to more generic referencing specification which supports versioning (datahub.io), resource referencing, row referencing, cell referencing etc. All this stuff is like in the air lately, appearing in many issue discussions.

So instead of package property it could be something like this:

{fields: "country", reference: {resource: "country-codes#countries", fields: "code"}}

Not saying that I like something like this more (e.g. using json-pointer) but it could be an option.

roll
@roll
On other hand the eco-system is very Data Package centric (e.g. datahub.io stores packages, not resources) so even having the Data Resource specification to use package property feels kinda natural.
Stephen Gates
@Stephen-Gates
My real world example, every year every government department must publish the same data tables in their annual reports. These are also published as open data. An Excel spreadsheet is sent around to all departments to enter the data. The spreadsheet has a validation rules to control data entered (think enum constraint or FK lookup). With Frictionless Data, a template Data Package with a Table Schema is circulated to aid in collecting data. This references a table on CKAN that contains the FK lookup values. The data is validated by each department and the data is consistent for every department. The following year there may be a change to the FK lookup table - I assume this change would result in a new URL for the dataset in CKAN (I could be wrong here). The process is repeated to collect and validate the data. Hope that helps.
Kenji
@Kenji-K
Just poppin in to say that I love what you guys are doing
Kenji
@Kenji-K
Well maybe I should also use this opportunity to ask a question, given that I am just recently getting my feet wet with respect to this subject. What is the relationship between the FD specs and Dublin Core? I have an idea in my head but I'd rather not muddy the waters with my misconceptions and will refrain from putting it here.
Stephen Gates
@Stephen-Gates

Hi, I'm updating the list of open licenses and its data package in https://github.com/okfn/licenses.

I'm validating my changes with Goodtables.io at http://goodtables.io/github/Stephen-Gates/licenses

In my okfn/licenses#57, I accidentally added an error by mis-spelling "superceded". Goodtables.io didn't fail the data despite my enum constraint.

I'm wondering if there's an error in my table schema or if GoodTables has a bug?

roll
@roll
@Stephen-Gates Thanks. I'm looking into it
roll
@roll
@Stephen-Gates It should work now
Stephen Gates
@Stephen-Gates
Thanks @roll just pushed a change to the licenses.csv with one error remained ("superseded" L71) but still passed GoodTables test.
Martín n
@martinszy
Hello, I'm testing datapackages-pipeline
Rufus Pollock
@rufuspollock
@martinszy hey there, that's great!
Martín n
@martinszy
I have a couple of questions:
1) I have changing filenames, is there any way to use wildcards in add_resource?
2) Is is a bad idea to have a dump.to_ckan action?
3) Does ckan handle datapackages yet?