These are chat archives for data-8/datascience

17th
Dec 2015
Chris Holdgraf
@choldgraf
Dec 17 2015 00:15
@cboettig that read_table error looks to be related to string encoding rather than something internal to Table. It looks like Table is actually calling Pandas under the hood.
It looks like all the args and *kwargs passed to read_table are in turn passed to the pandas read function
encoding : string, default None
    Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
    standard encodings
    <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
that's a kw argument in the pandas read_csv function...I wonder if you chose a different encoding if you could get past the error
so you'd do ds.read_table(my_file, encoding='magical_bugfree_encoding')
Carl Boettiger
@cboettig
Dec 17 2015 00:19
Hmm, wonder why it's not necessary to mess with that coming from the web version (and @deculler's examples in TableDemos also read from file without manually handling encoding..) Maybe those .csvs are funny somehow, but nothing obvious to me.
Can you replicate the error I'm getting there?
Chris Holdgraf
@choldgraf
Dec 17 2015 00:24
hmm, lemme see what I can figure out
It looks like the file is encoded in latin-1 rather than utf-8
extinct = ds.Table.read_table("data/extinct.csv", encoding='latin-1')
Try that
it tells pandas to change how it parses the bytecode depending on the encoding you supply
Chris Holdgraf
@choldgraf
Dec 17 2015 00:29
when you read from a URL, there must be some intelligent stuff happening under the hood that infers this
I feel like these are the kind of problems that make people hate coding :P
Carl Boettiger
@cboettig
Dec 17 2015 00:29
ha, thanks! yeah, that works.
I suspect the html version is actually somehow getting converted to utf-8
Chris Holdgraf
@choldgraf
Dec 17 2015 00:30
yeah that could be
but in general if you get an error along these lines, it's often an encoding problem
Carl Boettiger
@cboettig
Dec 17 2015 00:32
wonder if I can do some operation on the file itself (e.g. outside of python) to fix the encoding of that file? Never quite had a good grasp of where encodings are set; I always thought they were more a property of assumptions made by the parser about the file than a filetype property....
Chris Holdgraf
@choldgraf
Dec 17 2015 00:32
well, in python I believe you can change the string encodings manually
though it's something that has generally confused me over the years
Carl Boettiger
@cboettig
Dec 17 2015 00:35
Right. I guess there must be a character somewhere in that csv file that is unique to latin-1, but I don't spot it
Carl Boettiger
@cboettig
Dec 17 2015 00:41
okay, well I can have vim rewrite the encoding... interesting to see the git diff to see what changes: dsten/ecology-connector@aed261b Makes the non-UTF8 characters obvious...
Chris Holdgraf
@choldgraf
Dec 17 2015 00:42
interesting
looks like this could be used to detect the character encodings: https://pypi.python.org/pypi/chardet
e.g.: df['Kingdom'].str.encode('utf-8')
so you could loop through your columns, and if it's a column of strings change the encoding to utf-8
then write back to disk
Carl Boettiger
@cboettig
Dec 17 2015 00:44
Cool. Though for the students I think it's best if I can give them utf-8 encoded data whenever possible; like you say, it's kind of the dark underbelly, I'm sure no one actually enjoys battling string encodings...
Chris Holdgraf
@choldgraf
Dec 17 2015 00:45
oh definitely
I meant just for you/me
Carl Boettiger
@cboettig
Dec 17 2015 00:45
right
Chris Holdgraf
@choldgraf
Dec 17 2015 00:47
blob
e.g. that seems to work properly
and if I write that dataframe to file, I can now read it back in like before
but w/o the extra 'encoding' flag
df = extinct.to_df()
for col, vals in df.iteritems():
    try:
        df.loc[:, col] = vals.str.encode('utf-8')      
    except:
        print(col)
df.to_csv('./test.csv')
ds.Table.read_table('./test.csv')
there's the code to do it