Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 15 21:38
    rufuspollock closed #593
  • Oct 15 21:38

    rufuspollock on master

    Clarify that fields are not req… Clarify that primary key fields… Add a pattern for unique constr… and 1 more (compare)

  • Oct 15 21:38
    rufuspollock closed #643
  • Oct 13 11:20

    roll on master

    Added maintainer.md (compare)

  • Oct 12 11:43

    roll on master

    Updated github templates (compare)

  • Oct 07 11:24

    roll on master

    Added github templates (compare)

  • Oct 06 15:40
    micimize opened #644
  • Oct 05 00:22
    ezwelty closed #642
  • Oct 05 00:22
    ezwelty opened #643
  • Sep 10 18:06
    ezwelty opened #642
  • Sep 06 23:19
    ezwelty edited #641
  • Sep 06 23:17
    ezwelty opened #641
  • Aug 23 16:32
    micimize opened #640
  • Aug 22 11:12
    roll edited #639
  • Aug 22 11:10
    roll labeled #639
  • Aug 22 11:10
    roll opened #639
  • Aug 08 07:26
    rufuspollock opened #369
  • Aug 07 15:45

    rufuspollock on master

    Add R implementation to list of… Merge pull request #638 from nu… (compare)

  • Aug 07 15:45
    rufuspollock closed #638
  • Aug 07 14:37
    nuest opened #638
jobarratt
@jobarratt
looks great @loleg ! if there is something you are working on that is FD related always let us (especially @callmealien ) know here so we can give it a bit of an extra promotion push! And you may already be in touch with the OKI comms team but if not and you want to pitch a blog we'll always be happy to support you with it
Meiran Zhiyenbayev
@Mikanebu

Data Package v1 Specifications. What has Changed and how to Upgrade

This post walks you through the major changes in the Data Package v1 specs compared to pre-v1. It covers changes in the full suite of Data Package specifications including Data Resources and Table Schema. It is particularly valuable if:

  • you were using Data Packages pre v1 and want to know how to upgrade your datasets
  • if you are implementing Data Package related tooling and want to know how to upgrade your tools or want to support or auto-upgrade pre-v1 Data Packages for backwards compatibility

You can find the entire blogpost here http://datahub.io/blog/upgrade-to-data-package-specs-v1

Stephen Gates
@Stephen-Gates
What's the difference between sources in the Data Resource spec and sources in a Data Package? Sources in the resource don't explicitly inherit from the package like licences do. So why have both?
Meiran Zhiyenbayev
@Mikanebu
@Stephen-Gates Thanks for asking this question. We will provide an answer soon, reading through the discussion in specs.
Rufus Pollock
@rufuspollock
@Mikanebu you can have sources in both and there is no specific semantic on inheritance. sources in data package can be taken as sources for whole data package whilst for a given resource they are just for that resoruces ...
Meiran Zhiyenbayev
@Mikanebu
@rufuspollock Thanks for clarifying this
Stephen Gates
@Stephen-Gates
@rufuspollock if that's the case, why not have a statement similar to licenses, licenses: as for Data Package metadata. If not specified the resource inherits from the data package.
Rufus Pollock
@rufuspollock

@Stephen-Gates because i don't think the specific resource inherits in a defined sense like licenses. sources are a less specific in that sense - whereas licenses obviously filter down the sources you specify may apply to some resources but not others etc.

I guess my question is more to you: what semantics do you want and why :-) ?

Stephen Gates
@Stephen-Gates
@rufuspollock From a convenience perspective, I think think you should be able to define a licence or sources once at the package level and explicitly say resources inherit. If sources vary at the resource level, specify at that level and don't specify at the package level. Given licence compatibility issues, you could you specify different licences at the resource and not have a licence at the package level. The Specs support this apart from explicit inheritance of sources from the package. This could be fixed in the data resource spec by source: as for Data Package metadata. If not specified the resource inherits from the data package.
Stephen Gates
@Stephen-Gates
Logged at frictionlessdata/specs#541
Byron Ruth
@bruth
Good afternoon. I am reviewing the various specifications and had two questions. First have you come across a use case where a "query" is being represented as a data resource? The assumption being that the dataset is a function of the query at the time it is executed. And second, are there any support/examples for including and/or deriving provenance (PROV or otherwise) from data resources?
Stephen Gates
@Stephen-Gates
Hi @bruth provenance in a data package is usually provided in the readme.md. Here's a sample I'm using. Of course you could write anything in the markdown file. I'm intrigued about how you can derive provenance from a data resource. How could you determine what processing has been done by just looking at the end result?
The readme.md is a file included in the datapackage.zip http://specs.frictionlessdata.io/data-package/#illustrative-structure
Stephen Gates
@Stephen-Gates
Byron Ruth
@bruth
Thanks @Stephen-Gates. All that can be derived are changes from one version to the next (more rows, changed values, etc.). You are correct in that the intent/cause of the change is not known unless you have the context. For my use case, I will have this information since new revisions of a dataset will prompt the user (committing the new version) for a reason.
I am evaluating FD for the specs and tooling as the basis for a "data sharing platform" within my org. I have come across other specs in the past, but FD feels the most nimble and active. Extensibility is important since we may need to add additional metadata specific to my org. I appears that this is allowed within the specs.
Rufus Pollock
@rufuspollock

Good afternoon. I am reviewing the various specifications and had two questions. First have you come across a use case where a "query" is being represented as a data resource?

@bruth yes we've definitely thought about this use case. You could definitely use it this way.

And second, are there any support/examples for including and/or deriving provenance (PROV or otherwise) from data resources?

You can use the sources attribute. Would you want more than that?

@bruth and welcome - great to have your questions and interest :-)
@bruth and if you are interested in a data package oriented "data sharing platform" you can check out https://datahub.io/
Byron Ruth
@bruth
@rufuspollock Thanks! Treating a query as a data resource is sort of weird, am may be more appropriate as provenance itself for the dataset being produced. The sources and contributors attributes are a good start for provenance. I need to evaluate to what extent I need/want to embed all provenance information in the datapackage.json or if I would reference a changelog/PROV graph of sorts
datahub looks very nice. i like how a dataset is presented. for my use case, this would be internal to the organization, so unfortunately I can't use this hosted version.
Stephen Gates
@Stephen-Gates
@bruth if you need an internal solution to create data packages, you may be interested in a project I'm leading http://data-curator.io - work in progress - v1.0.0 due before Christmas
Byron Ruth
@bruth
@Stephen-Gates This looks very promising. I am going to try it out. My org will be hiring a few library scientists to help in the data curation/documentation process of datasets. This could be a useful tool for them to assist in this process.
Stephen Gates
@Stephen-Gates
@bruth current release can open, edit, save data, guess or set column properties, validate data. These milestones describe this year's plan and we're seeking funding for version 2.
Stephen Gates
@Stephen-Gates
I'm interested in implementing the pattern allowing a foreign key to reference another data package. Does anyone have examples of this? Also, what level of adoption is required for work to commence on including this in the table schema standard?
Stephen Gates
@Stephen-Gates
I did find the concept was in the spec at one stage.
Rufus Pollock
@rufuspollock
@Stephen-Gates i've played around with implementing and we're thinking about this in datahub.io so if you were working on this you'd definitely have someone to chat with ... (and experiment with)
Stephen Gates
@Stephen-Gates
Great @rufuspollock just starting to spec Data Curator v2 for next year and the ability to reference an external table for validation is a feature desired by our sponsor.
Rufus Pollock
@rufuspollock
@Stephen-Gates :+1: - i agree it is a really useful feature
Stephen Gates
@Stephen-Gates

If I have a Data Package that contains a Data Resource that is shared under Public Domain, can someone please confirm that the licenses properties should be:

name : other-pd
path :
title : Other (Public Domain)

Based on http://licenses.opendefinition.org/licenses/other-pd.json from http://licenses.opendefinition.org/

Thanks

roll
@roll
@Stephen-Gates Based on current datapackage-js state referencing external data packages is relatively simple feature to add (integrity check + dereferencing). Should cost a few hours of work.
But I think we should add a proper Data Package Identifier support first
Stephen Gates
@Stephen-Gates
Thanks @roll but I'm not sure I understand. Are you saying that using the data package url, data resource location and the foreign key fields names in the current specification are inadequate to location the data to perform an integrity check? Are you suggesting that a data package identifier would assist with the fact that the data at that location could change? Do you think the use of an identifier would be mandatory or a best practice?
Paul Walsh
@pwalsh
@Stephen-Gates I think that @roll is saying that actually implementing foreign keys / references across data packages is quite simple in the JavaScript Data Package library that we maintain. However, he is reluctant to just go ahead and do so without some others things in place first.
roll
@roll
@pwalsh @Stephen-Gates I've meant that implementing a support for external referencing by a descriptor URL could be literally just a few lines addition to https://github.com/frictionlessdata/datapackage-js/blob/master/src/resource.js#L396 (load external DP there instead of current). No problem to add it if it's needed.

My second thought was just that I suppose end-users will like much more an ability to reference by package names, not urls like:

{fields: "country", reference: {package: "country-codes", resource: "countries", fields: "code"}}

But support for identifiers spec (which is also easy) surely could be just next step after basic external referencing support. So here I wan't clear. It's not any kind of requirement of my preference about an implementation order.

Stephen Gates
@Stephen-Gates
Thanks @roll. Reference by URL is all I was seeking. I assume that a change to the Table Schema standard would be required before a change to the code could be made?
roll
@roll
@Stephen-Gates No, I think we're good to go because for now only one mention of external referencing is in patterns - http://specs.frictionlessdata.io/patterns/#table-schema:-foreign-keys-to-data-packages. And it allows url referencing. Other question - should it be a datapackage property or package property? cc @rufuspollock
Serah Njambi Rono
@serahrono
New Frictionless Data pilot case study on eLife's use of goodtables for data validation of scientific research data http://frictionlessdata.io/case-studies/elife/
Rufus Pollock
@rufuspollock

And it allows url referencing. Other question - should it be a datapackage property or package property? cc @rufuspollock

I guess we could switch to simple package - do you have a preference or suggestion?

roll
@roll
@rufuspollock I think our intention on all levels is to use consistent package/resource OR datapackage/dataresource. So having a fk.reference.resource already suggests to use package.
Also The foreignKey MAY have a property datapackage. This property is a string being a url pointing to a Data Package or is the name of a datapackage. If there is an intention to use Data Package Identifier spec across main specs I think we need here to mention it instead of just a url/name.
Paul Walsh
@pwalsh

@roll @rufuspollock I just don't know why we would tie this to Data Package. We have Data Resource as a distinct spec now - there is no reason why publishers could not publish distinct Data Resources.

So, I don't see why fk.reference.resource needs to suggest usage of a data package. This again ties back to the JSON Pointer thing - I still do not see the benefit of us having custom DP identifiers, assumptions about FKs to packages, etc. etc. when we can just reuse existing specifications that are quite simple and designed for this type of referencing.

I'd really like to see the argument for why a custom approach is better, rather than me continuously jumping in on these conversations and saying "but .... json pointer" :)

roll
@roll
I think we need to compare things on real examples (cc @Stephen-Gates) I suppose there was a lot ideas on different referencing approaches. E.g. @akariv's one with resource referencing (but if I could remember correctly it was not directly json-pointers).
There is a Metatab's one - https://github.com/Metatab/appurl
roll
@roll
I have a feeling that Data Package Identifier spec (which almost not implemented for now anywhere) could be evolved to more generic referencing specification which supports versioning (datahub.io), resource referencing, row referencing, cell referencing etc. All this stuff is like in the air lately, appearing in many issue discussions.

So instead of package property it could be something like this:

{fields: "country", reference: {resource: "country-codes#countries", fields: "code"}}

Not saying that I like something like this more (e.g. using json-pointer) but it could be an option.

roll
@roll
On other hand the eco-system is very Data Package centric (e.g. datahub.io stores packages, not resources) so even having the Data Resource specification to use package property feels kinda natural.
Stephen Gates
@Stephen-Gates
My real world example, every year every government department must publish the same data tables in their annual reports. These are also published as open data. An Excel spreadsheet is sent around to all departments to enter the data. The spreadsheet has a validation rules to control data entered (think enum constraint or FK lookup). With Frictionless Data, a template Data Package with a Table Schema is circulated to aid in collecting data. This references a table on CKAN that contains the FK lookup values. The data is validated by each department and the data is consistent for every department. The following year there may be a change to the FK lookup table - I assume this change would result in a new URL for the dataset in CKAN (I could be wrong here). The process is repeated to collect and validate the data. Hope that helps.
Kenji
@Kenji-K
Just poppin in to say that I love what you guys are doing