These are chat archives for FreeCodeCamp/DataScience

16th
Feb 2017
Albert Jonathan
@albert2309
Feb 16 2017 16:42
@apottr so you are scrapping data from PDF? I cannot imagine if the data isn't computer-readable
Hèlen Grives
@mesmoiron
Feb 16 2017 16:45
@albert2309 if the pdf is an image or does not comply to the pdf standard make up it can't be processed. It simply means that a program needs to ocr the image-pdf and then output it as a text-based pdf. Abbyyfine reader online service does not process pdf's that are not properly text-based structured. Just learned that the hard way.
Albert Jonathan
@albert2309
Feb 16 2017 16:47
@mesmoiron Ah okay yeah. I experienced that when I scanned a book once. Some of the texts in the pdf cannot be processed by the computer unless I use a specific program.
Hèlen Grives
@mesmoiron
Feb 16 2017 16:50
@albert2309 yes; I have setup a workflow of 3 main software programs that does the job that I want. As I said it turned out that not all pdf outputs where usable in other programs. Patience is key :-)
Albert Jonathan
@albert2309
Feb 16 2017 16:58
@mesmoiron Haha. No wonder many data scientists said cleaning data is time consuming
Hèlen Grives
@mesmoiron
Feb 16 2017 17:46
@albert2309 yeah everybody is talking about the juice ML and other nice visualizations/applications. But I'm way far from that; still cleaning the dirt in order to that later.
Amelia
@apottr
Feb 16 2017 20:25
the PDF content wasn't exactly organized in the best of ways; the end of the line at the start of the line, no spaces between columns, etc