These are chat archives for FreeCodeCamp/DataScience

2nd
Mar 2017
Jake Waitze
@jwaitze
Mar 02 2017 13:56
i'm not a big fan of pdfs either
currently working on scraping the data from http://www.znu.ac.ir/data/members/rasoulifard_mohammad/crc.pdf
it's difficult
Amelia
@apottr
Mar 02 2017 16:07
@jwaitze what data are you trying to pull from this pdf?
Jake Waitze
@jwaitze
Mar 02 2017 16:20
eventually, many of those tables
but
currently i am working on specifically this one
ie. the standard thermodynamic entropy/enthalpy/gibbs for many chemical substances
trying to figure out a clean way to scrape it off while maintaining the structure
Amelia
@apottr
Mar 02 2017 16:25
isn't there another source for that information?
Jake Waitze
@jwaitze
Mar 02 2017 16:26
surprisingly not
not in a way i can utilize easily programmatically, at least
Amelia
@apottr
Mar 02 2017 16:30
that is surprising
one would think someone would have already pulled that information
Jake Waitze
@jwaitze
Mar 02 2017 16:30
i know, right
Jake Waitze
@jwaitze
Mar 02 2017 16:33
not big enough
there are some small thermodynamics tables for a select few substances
but the big CRC handbook one is the data set i am after
Amelia
@apottr
Mar 02 2017 16:34
interesting
Jake Waitze
@jwaitze
Mar 02 2017 16:34
or at least, a comparable one
Amelia
@apottr
Mar 02 2017 16:34
where did they pull the data from?
do they cite sources?
Jake Waitze
@jwaitze
Mar 02 2017 16:34
Who? The CRC folks?
or the one you linked?
Amelia
@apottr
Mar 02 2017 16:35
the CRC folks
Jake Waitze
@jwaitze
Mar 02 2017 16:36
it appears they've got the data from a number of other books
i'll have to search those up and see...
found the first one
Amelia
@apottr
Mar 02 2017 16:38
interesting
yeah, usually it's a good idea to avoid pdfs if at all possible
so if you can figure out where a pdf is getting it's data from you might be able to find the data in a better format
Jake Waitze
@jwaitze
Mar 02 2017 16:39
someone in here, maybe it was you, said "PDFs are where data goes to die"
true facts
found the second reference
definitely not a book available in non-pdf form
that's the problem with these references
they're all in just books and nothing else
Amelia
@apottr
Mar 02 2017 16:43
Yeah
Hèlen Grives
@mesmoiron
Mar 02 2017 18:12
@erictleung yes it will; I have postponed variable design a bit. The best way to go about is start with a lot and when knowledge grows scaling them back. Sometimes the data is so cluttered that one data source could have many sub tables. The tidy data thing will be next.
@erictleung thanx for the bullshit link. Love that haha that will be so fun :fire: :clap:
CamperBot
@camperbot
Mar 02 2017 18:15
mesmoiron sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
:cookie: 467 | @erictleung |http://www.freecodecamp.com/erictleung
Hèlen Grives
@mesmoiron
Mar 02 2017 18:35
@jwaitze Hi Jake, I looked into your file; depending on what you need (maybe a few pages) then you could consider to OCR the parts. I used gimagereader (I'm on Ubuntu) and that worked reasonably well. The tables are a bit more work. You could try extract the table page and convert it to a proper pdf if it is not text based. Try Abbyyfinereader online. Otherwise there's a free github software that can help with getting the table out. Not perfect but for free just enough to get ahead. If you are interested I think it is Tabula. But you have to switch between some programs to fix it.