These are chat archives for frictionlessdata/chat

17th
Mar 2017
Rufus Pollock
@rufuspollock
Mar 17 2017 06:24

Tidy Data by Hadley Wickham http://vita.had.co.nz/papers/tidy-data.pdf +link

A huge amount of effort is spent cleaning data to get it ready for analysis, but there
has been little research on how to make data cleaning as easy and effective as possible.
This paper tackles a small, but important, component of data cleaning: data tidying.
Tidy datasets are easy to manipulate, model and visualise, and have a specific structure:
each variable is a column, each observation is a row, and each type of observational unit
is a table. This framework makes it easy to tidy messy datasets because only a small
set of tools are needed to deal with a wide range of un-tidy datasets. This structure
also makes it easier to develop tidy tools for data analysis, tools that both input and
output tidy datasets. The advantages of a consistent data structure and matching tools
are demonstrated with a case study free from mundane data manipulation chores.

Comments:

  • Worth a read
  • Would it be worth submitting a "paper" to journal of statistical software about frictionless data (principles and specs)? :-)
Stefan Urbanek
@Stiivi
Mar 17 2017 06:35
Interesting paper and interesting perspective on free-form data.
I think it might be worth it once integrated with some of the tools people in the field are using such as Pandas.
Stefan Urbanek
@Stiivi
Mar 17 2017 06:47
@rufuspollock I think submitting a paper or some note about frictionless data might be useful. Not sure whether widely understood, but at least will make people think about the problem differently.
On the other hand, I find "there has been surprisingly little research on how to clean data well” a bit misleading. It might be true within certain domain or probably within scientific community. There has been a plenty of literature about data preparation in the field of data warehousing/BI/data mining. Unfortunately, significant amount of industry practices was not public and kept only within consultancies offering such services.
Stefan Urbanek
@Stiivi
Mar 17 2017 06:59

I mean, the specs go far beyond the paper, which seems to solve very basic problem that specs expect to be solved already. I’m assuming it from what the author writes:

The principles of tidy data are closely tied to those of relational databases and Codd’s rela- tional algebra

then he mentions SQL and touches some other approaches or tools (all of which are just frameworks or methods). However, the rest of the paper is rather about putting free-form data into a table or aggregated pivot tables than about actual data cleaning (as is understood in the field).

Therefore yes, publishing about frictionless data is worth it :-)
Paul Walsh
@pwalsh
Mar 17 2017 11:14
@rufuspollock worth syncing with @danfowler on this. something he wrote will be getting a journal publication soon, not sure of the details myself.
Sven Willner
@swillner
Mar 17 2017 16:06

hi. im interested in applying for the frictionless data tool fund for a c++ implementation as that is generally my language of choice. i have few questions regarding that:

  • are you fine with the c++11 standard?
  • which systems should be supported?
  • how should third-party libraries (e.g. for parsing/writing JSON) be licensed? or shall everything be implemented from scratch?
  • what are your requirements regarding test coverage?

thanks!

Ekpe Samuel
@geniusgeek
Mar 17 2017 16:17
Good day, I just applied for frictionless data tool fund using java. Any pointers?
Daniel Fowler
@danfowler
Mar 17 2017 22:44
@geniusgeek But Java doesn’t have any pointers! :smile:
@geniusgeek there’s a JavaFX app that reads and writes Table Schema, so you can take a look there: https://github.com/frosch95/SmartCSV.fx
@/all I did my best at reorganizing the /tools page to show a clear sense of what has been built to date: http://frictionlessdata.io/tools/ Comments/fixes/additions welcome!