These are chat archives for frictionlessdata/chat

24th
Sep 2018
Rufus Pollock
@rufuspollock
Sep 24 2018 07:35 UTC

@zaneselvans @aborruso you seem to have found a bug here with the lack of dialect.

Generally, i want to ask a question about how the packages fit together. I've got a preference for having more of a toolkit where you use the tools to create your data package as needed.

I'm mentioning this here because i would guess we've either got a bug in the infer tool or in the combining of that output into the data package. At the moment it's tough to tell which.

I think it would be more transparent to the user to have a set up where you do:

schema = infer('my.csv')   // a simple dictionary
resource = new Resource()
resource.schema = schema
dataset = new Datasset()
dataset.addResource(resource)

More on this here http://okfnlabs.org/blog/2018/02/15/design-pattern-for-a-core-data-library.html

@aborruso Sorry, I have checked the codebase and it seems there is no dialect inference implemented for now. I mean underlying libraries guess and use dialect internally but don't provide this information to higher levels. So we probably have to implement it starting from tabulator. For now, I could recommend to use Python builtin csv.Sniffer as a workaround - https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/parsers/csv.py#L102. E.g. dialect = csv.Sniffer().sniff(resource.read(limit=100)). Also as an option (it was intended behavior for the infer function on this iteration) there could be a manual step adding resource.dialect by hands.
Andrea Borruso
@aborruso
Sep 24 2018 09:03 UTC

@roll thank you. To know and also declare the CSV separator in my opinion it is very important. It's like encoding: if the final user does not know it, he will lose time. Sometime a lot of time.
I think that the separator (for CSV files) should be always in the datapackage info.

It's a feature request

Thank you again to all you

Andrea Borruso
@aborruso
Sep 24 2018 09:33 UTC

@roll if I run

from datapackage import Resource
resource = Resource({u'path': 'input.csv'})
dialect = csv.Sniffer().sniff(resource.read(limit=100))

I have

TypeError                                 Traceback (most recent call last)
<ipython-input-34-735638ac6f72> in <module>()
----> 1 dialect = csv.Sniffer().sniff(resource.read(limit=100))

/usr/lib/python2.7/csv.pyc in sniff(self, sample, delimiters)
    180
    181         quotechar, doublequote, delimiter, skipinitialspace = \
--> 182                    self._guess_quote_and_delimiter(sample, delimiters)
    183         if not delimiter:
    184             delimiter, skipinitialspace = self._guess_delimiter(sample,

/usr/lib/python2.7/csv.pyc in _guess_quote_and_delimiter(self, data, delimiters)
    221                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    222             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 223             matches = regexp.findall(data)
    224             if matches:
    225                 break

TypeError: expected string or buffer

What's wrong in my code?

A temporary way has been the code below, but I would like to use your code

import csv
with open('input.csv', 'rb') as csvfile:
    temp_lines = csvfile.readline() + '\n' + csvfile.readline()
    dialect = csv.Sniffer().sniff(temp_lines, delimiters=',\t;|')
dialect.delimiter
jobarratt
@jobarratt
Sep 24 2018 10:12 UTC
Save the Date! csv,conf,v4 is happening! We will be heading back to the Eliot Centre in Portland on May 8-9 next year for more talks about data sharing and data analysis from science, journalism, government, and open source. more announcements in the next few weeks https://csvconf.com/
Sign up to Slack for the latest updates and https://csvconf-slackin.herokuapp.com/
@aborruso Ahh my bad it should be resource.raw_read (without the limit argument I think)
Andrea Borruso
@aborruso
Sep 24 2018 12:27 UTC
Ok @roll I will try, thank you
Andrea Borruso
@aborruso
Sep 24 2018 12:53 UTC

@aborruso Ahh my bad it should be resource.raw_read (without the limit argument I think)

It works perfectly, thank you

import csv
from datapackage import Resource
resource = Resource({u'path': 'input.csv'})
dialect = csv.Sniffer().sniff(resource.raw_read())
dialect.delimiter
:+1:
Andrea Borruso
@aborruso
Sep 24 2018 13:02 UTC
@roll what about to add ass feature also the delimiter in the managed infos?
Robert Gieseke
@rgieseke
Sep 24 2018 14:42 UTC
Hi all, the latest release of the Pandas Datapackage reader can now also read GeoJSON into GeoPandas DataFrames: https://github.com/rgieseke/pandas-datapackage-reader
David Cottrell
@cottrell
Sep 24 2018 14:49 UTC
Asking a question again, sorry if I missed an answer in the rolling chat. Where should datapackage "getters" go for non-local data? Is this a datapipeline or is there some feature of datapackages themselves I have missed? For example, wget <url> plus a schema, delimiter etc.