These are chat archives for FreeCodeCamp/DataScience

4th
May 2016
Joey Buczek
@joeybuczek
May 04 2016 00:04
I understand that part ... I guess what I'm asking is between the two parts ... how is the data linking? In other words, if someone answered both parts of the questions, how do I know that row "x" answers are the same as the answers on row "?" on the second part?
or is that irrelevant? It's been a while since I took the survey
Eric Leung
@erictleung
May 04 2016 00:11
@joeybuczek Mmm now that I think about it, the second part should have all the data. It's kind of weird. The "2nd" part data has all the answers from the first part. If you look at the column names in the second part, there are some that are not in the form of a question e.g. already_working. That data will also be in the first part. However, columns in the 2nd part that are in the form of a question e.g. "What's your gender?" are exclusively in the 2nd dataset. @evaristoc let me know if my thinking is correct. If so, I can stop focusing on the 1st dataset and just work on the 2nd one.
Joey Buczek
@joeybuczek
May 04 2016 00:18

@erictleung The only headers that appear in both .csv files are as follows:

# Other None Start Date (UTC) Submit Date (UTC) Network ID

I'm getting the .csv files from the github repo
Eric Leung
@erictleung
May 04 2016 00:20
@joeybuczek If you look closely, there's an "empty" string immediately after the hash sign #,"How old are you?"..., which indicates the first column. The first column has strings like fdf0ad5463e8912add20cd0e8cde06d7a.
This message was deleted
Joey Buczek
@joeybuczek
May 04 2016 00:24
I have all the headers and data correctly loading underneath them
there are more entries in the part one csv file than part two ... so what I'm trying to establish is the column used as the link between the two to know which answers in part one pair up with part two
that makes me think that more people answered part one than part two
This message was deleted
Eric Leung
@erictleung
May 04 2016 00:27
@joeybuczek nevermind, the second part doesn't have all the data. I just remembered. The second dataset will have the most complete data i.e. those people filled out both parts. The first dataset will have everyone who finished the first part of the survey i.e. this will contain all answers from people independent of whether they finished the second part of the survey.
Joey Buczek
@joeybuczek
May 04 2016 00:28
that's what I was thinking ... however those who did fill out both parts ... do we associate each by Network ID ... or perhaps submit time stamp ? hmm
Eric Leung
@erictleung
May 04 2016 00:29
@evaristoc nevermind on checking my thinking. You need both data sets to do a "complete" analysis. Been a long day...
Joey Buczek
@joeybuczek
May 04 2016 00:32
Okay .. so it appears that the "Submit Date (UTC)" timestamp on part one is the "Start Date (UTC)" timestamp on part two, and that links up nicely with the Network ID as the identifier for combining both answers ... which makes sense. .. part one is submitted within a few seconds of starting part two
Eric Leung
@erictleung
May 04 2016 00:34
@joeybuczek yes, that appears to be correct.
Joey Buczek
@joeybuczek
May 04 2016 00:34
Kind of funny... one person started their survey on 4/23/2016 2:21 and completed it by 5/1/2016 4:11
those are some long, thoughtful responses
:P
Quincy Larson
@QuincyLarson
May 04 2016 03:14

@sudeepnarkar @ozkoc Thanks for your kind words! Gathering the data was just a small part of it. Analyzing it is the real work :)

@krisgesling thanks! Yes - I am amazed at the geographic diversity of the responses.

@erictleung We shouldn't need two seperate CSV files - we should be able to get everything into one file. Once we've merged everything from part 1 of the survey into part 2, we can probably just delete part 1 and rename part 2. If you can do this once, you will save everyone a ton of trouble down the line.

CamperBot
@camperbot
May 04 2016 03:14
:cookie: 138 | @ozkoc |http://www.freecodecamp.com/ozkoc
quincylarson sends brownie points to @sudeepnarkar and @ozkoc and @krisgesling and @erictleung :sparkles: :thumbsup: :sparkles:
:cookie: 4 | @sudeepnarkar |http://www.freecodecamp.com/sudeepnarkar
:cookie: 394 | @krisgesling |http://www.freecodecamp.com/krisgesling
:cookie: 341 | @erictleung |http://www.freecodecamp.com/erictleung
Daniel
@profoundhub
May 04 2016 03:50
about @ozkoc
CamperBot
@camperbot
May 04 2016 03:50
:cookie: 138 | @ozkoc |http://www.freecodecamp.com/ozkoc
Mat Lane
@zydecat
May 04 2016 07:01
Not sure if it has been addressed in this room, but would there be any issue in my conducting an analysis on the raw data as part of a masters dissertation that I am undertaking early next year? @QuincyLarson I'm guessing that you'd be the person I'd need to to get permission from for this. Cheers, Mat.
evaristoc
@evaristoc
May 04 2016 11:06

Hola everyone,

@erictleung @joeybuczek according to @koustuvsinha the networkID is acting as the primary key between the two files. I haven't check. @joeybuczek the chances that someone skipped the second one highly possible. The chance of duplicate data is also possible. It is likely that the survey will have several discrepancies, we have to try to identify them and at least reduce its impact.
EVERYONE BE AWARE: you have to make conclusion based on the context of how rigorous the fieldwork was. Those who know statistics: you could make observations about the validity of some of the conclusions. I suggest to keep your conclusions within the context in which the data was gathered and be VERY CAREFUL with any generalisation to a larger audience.

Eric Leung
@erictleung
May 04 2016 11:09
@evaristoc yeah, I've been working on trying to put the data together still. Sorry for the lack of communication lately. I've pushed up my progress to my branch if you want to take a look. The network ID is not actually unique. I have found that you can have duplicate IDs, but the answers are different. So maybe two people used the same computer to finish the survey? I don't think there were a significant number of duplicate so assuming they are separate should be fine.
evaristoc
@evaristoc
May 04 2016 11:16
@erictleung Good! I will have to wait until later today to give it a look though... Contact you?
Joey Buczek
@joeybuczek
May 04 2016 11:22
I already combined the data last night and saw 442 duplicate network id's.. given that that leaves 15211 unique id's it's a very small percentage
Eric Leung
@erictleung
May 04 2016 11:24
@joeybuczek did you just match on network ID? And what software did you use to combine the data?
Joey Buczek
@joeybuczek
May 04 2016 11:25
I matched on network Id yes, and I simply imported into excel from the web, nothing fancy, but I did some pivots based on the visuals already presented and it's accurate
Well, 99.9999 accurate given the dupes
But I'll look at it again today with fresh eyes and try to resolve the duplicates by matching on submit times as a second identifier
Thomas Wolfe
@twolfe2
May 04 2016 14:43
Hi all, I just saw the Medium post with the high level results from the survey and they look great, nice job! Is there any way that I could help with the data visualizations?
CaseyJunio
@CaseyJunio
May 04 2016 15:57

<style>

</style>
Jason Boxman
@jboxman
May 04 2016 17:27
I nearly finished D3 projects.
I may be able to help with survey visualization
Eric Leung
@erictleung
May 04 2016 17:36

@twolfe2 thanks for your interest! What kind of visualization skills do you have? You can always head to the GitHub survey repository and take a look at the questions, initiate conversation on them on what kind of visualizations you think would work, or even ask some questions yourself!

@jboxman best of luck finishing the rest! Looking forward to seeing what kind of visualizations we can make out of the data.

@zydecat glad to have you interested in using the data for your master's dissertation! I don't think @QuincyLarson should have a problem with it but let's wait on his approval :smile:

CamperBot
@camperbot
May 04 2016 17:36
erictleung sends brownie points to @twolfe2 and @jboxman and @zydecat and @quincylarson :sparkles: :thumbsup: :sparkles:
:cookie: 259 | @twolfe2 |http://www.freecodecamp.com/twolfe2
:star2: 1128 | @quincylarson |http://www.freecodecamp.com/quincylarson
:cookie: 373 | @zydecat |http://www.freecodecamp.com/zydecat
:cookie: 248 | @jboxman |http://www.freecodecamp.com/jboxman
Jason Boxman
@jboxman
May 04 2016 17:45
But first I have to hop on a call
I forked the repo though, I hope to have time to take a look at it later
It would be neat if we could stick it in a mongodb and query it
Looks perfect for a document oriented db
Jason Boxman
@jboxman
May 04 2016 18:02
Looks like data cleaning is in progress: FreeCodeCamp/2016-new-coder-survey#26
Eric Leung
@erictleung
May 04 2016 18:05

@jboxman yeah, I think @QuincyLarson will want to store this in a db with an API for people to more easily query it in the future.

And yes, I'm a part of that cleaning and combining effort :smile: I should finishing up soon. In the meantime, you can at least explore and familiarize yourself with the raw data.

Jason Boxman
@jboxman
May 04 2016 18:06
Awesome, thanks @erictleung !
CamperBot
@camperbot
May 04 2016 18:06
:cookie: 342 | @erictleung |http://www.freecodecamp.com/erictleung
jboxman sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles: