These are chat archives for frictionlessdata/chat

22nd
Sep 2018
Andrea Borruso
@aborruso
Sep 22 2018 08:56 UTC

Hi @roll thank you very much.

I have tried to use it, but I do not find info about inferred CSV delimiter.
How to read it?

I have this CSV:

city;location
london;"51.50,-0.11"
paris;"48.85,2.30"
rome;"41.89,12.51"

If I run tableschema infer input.csv, I have

{u'fields': [{u'type': u'string', u'name': u'city', u'format': u'default'}, {u'type': u'geopoint', u'name': u'location', u'format': u'default'}], u'missingValues': [u'']}

If I run datapackage infer input.csv, I have

{
  "profile": "tabular-data-package",
  "resources": [
    {
      "profile": "tabular-data-resource",
      "name": "input",
      "encoding": "utf-8",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "input.csv",
      "schema": {
        "fields": [
          {
            "type": "string",
            "name": "city",
            "format": "default"
          },
          {
            "type": "geopoint",
            "name": "location",
            "format": "default"
          }
        ],
        "missingValues": [
          ""
        ]
      }
    }
  ]
}
David Cottrell
@cottrell
Sep 22 2018 11:38 UTC
What is the right place for URL based sources? Feels like this should maybe be a pipeline with a request.get and transform and save but not sure. There doesn't seem to be much in the way of extractors or data-pipeline builders. How are people doing this? Basically, want to do a dp-create-from-url <url> and get a base starter template ... not too hard but after reading a while, I can not see an obvious place to include this in the projects
For example, the datapackage infer method is largely centred around local paths. I started to modify to take URL and do a cache a pull but then thought pipelines should be the way
The pattern that emerges with extractors is that you start with singletons .get() with no args, but then you have some that take args.
And then you have a kind of collection of convenience arg generators that give you things like date ranges, today etc.
So in summary, my question is: is this pattern a pipeline or a datapackage?
Zane Selvans
@zaneselvans
Sep 22 2018 14:54 UTC
@roll @aborruso I think the delimiter is defined in the dialect element of the tabular data resource descriptor, so I would imagine that it gets inferred if/when you infer the resource, rather than the table schema. Does it also get stored somewhere else? https://frictionlessdata.io/specs/tabular-data-resource/
Andrea Borruso
@aborruso
Sep 22 2018 15:30 UTC

@zaneselvans when I use all these great tools (goodtables, tableschema and tableschema) directly on the above example CSV, these are able to infer the right separator , the ;. My question is: how extract this precious inferred info, that this tools are able to infer?

Thank you

Zane Selvans
@zaneselvans
Sep 22 2018 15:33 UTC
Something like....
tdr = datapackage.Resource('input.csv')
tdr.infer()
tdr.descriptor['dialect']['delimiter']
Andrea Borruso
@aborruso
Sep 22 2018 15:33 UTC
wow, thank you
I try
Zane Selvans
@zaneselvans
Sep 22 2018 15:34 UTC
I think -- anyway it should be available via the descriptor dictionary on the resource that gets inferred.
Andrea Borruso
@aborruso
Sep 22 2018 15:36 UTC
@zaneselvans I do not have a json file with a dialect
I have only the CSV
Andrea Borruso
@aborruso
Sep 22 2018 15:43 UTC

@zaneselvans

My command:

from datapackage import Package
package = Package()
package.infer('input.csv')

Than run package.descriptor['dialect'] and I have

package.descriptor['dialect']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-30-773590eeb596> in <module>()
----> 1 package.descriptor['dialect']

KeyError: 'dialect'
Andrea Borruso
@aborruso
Sep 22 2018 15:52 UTC
@zaneselvans there is no dialect inside the inferred resource info, and than I have error. But the delimiter is properly read from datapackage-py
Zane Selvans
@zaneselvans
Sep 22 2018 17:20 UTC
It should be part of the resource, not the package (each package can contain many different resources)