These are chat archives for FreeCodeCamp/DataScience

1st
Sep 2015
Rex Schrader
@SaintPeter
Sep 01 2015 07:33
Random late night thought: With user data of which Bonfire/Waypoint they are on at a given time, correlate with activity on the /Help channels. See if particular Bonfires/Waypoints correlate strongly with higher levels of activity. Possible Conclusion: Said Bonfires/Waypoints either need clarification or more/better lead up material.
CamperBot
@camperbot
Sep 01 2015 07:33
type bonfire name to get some info on that bonfire. And check HelpBonfires chatroom
Ariel
@ArielLeslie
Sep 01 2015 15:55
@SaintPeter Given the attrition rate, I feel like you would just end up really biased toward the earliest bonfires
evaristoc
@evaristoc
Sep 01 2015 16:05

Welcome! You are reading my mind, @ArielLeslie! I haven't check for attrition yet. I was planning to make a small attrition analysis and present it in the "You know that..." section for the next report.

I am making weekly reports. I put you in the list? Just come and check! Any comment will be more than welcome!

Ariel
@ArielLeslie
Sep 01 2015 16:20
I'd be interested in that.
Rex Schrader
@SaintPeter
Sep 01 2015 18:13
I would like to see the attrition rate as well . . . I'm sure it's a nice power curve drop-off.
But you can filter for "active users" or something
I know that the data is going to be noisy as heck. It would be interesting to classify users in certain ways: For example, someone who is an "expert" coming in would complete LOTS of early waypoints very quickly.
Someone who is a genuine novice may only complete a few a day
I think that we might be able to tease these sort of relationships out and be able to group users into rough categories.
Then we can see how far each type gets, which problems send them to he help rooms, etc.
There is also going to be noise from broken tests as well.
Rex Schrader
@SaintPeter
Sep 01 2015 18:19
My hope is that we can point to spikes and explain them . . . For example, if you look at cohorts of people starting at the same time, or groups that reached the same problem at the same time vs. people who reached a problem at different times.
evaristoc
@evaristoc
Sep 01 2015 18:30

@SaintPeter and @ArielLeslie : For attrition, I understand people who leave and not return... I was planning to work on the data we have for the Help channel... my plan is just to count at least 1 participation per day...

@SaintPeter I have still not covered all those parts you are mentioning, for example about the completion of the exercises, or how to detect the problematic questions. Actually, those type of questions were the reason why Quincy and Berkeley opened this room. Other questions are related to the bot.

The purpose is to get there... But we are also busy trying to do something more FCC-ish: building a small app in JS for presenting the data... An idea would be to try to make this room not only for DataScience topics, but also a FCC playground to learn and practice JS, frontend, backend, etc... with a focus on data analysis...

Rex Schrader
@SaintPeter
Sep 01 2015 18:32
Cool
evaristoc
@evaristoc
Sep 01 2015 18:33
Let's see how it goes...
Rex Schrader
@SaintPeter
Sep 01 2015 18:33
I'm just thinking about the types of questions I'd like to answer, that I think would be helpful. A lot of this is dependant on the new site API. If I can query users, I can write scripts to distill their data down, in terms of bonfire completion, etc.
CamperBot
@camperbot
Sep 01 2015 18:33
type bonfire name to get some info on that bonfire. And check HelpBonfires chatroom
evaristoc
@evaristoc
Sep 01 2015 18:40

Yes, @SaintPeter... I would use the past data to tune a methodology and a code...

Also if you need to have just a taste of the data that there is in the chat, we can let you some or how to get that... We are using the API... if you are interested to do it yourself or with the team, it is up to you... we are working just defining small goals per week, trying not to affect our other activities, but trying to keep active...

Anyway: it will be great to have contributors: then we can easily achieve a final bigger stuff...

@SaintPeter I am planning with the rest of the team to advance with the app so we can eventually call more new people to contribute, hoping they get interested in the idea...
Rex Schrader
@SaintPeter
Sep 01 2015 18:48
The API you're referring to is the Gitter API or the FCC site API?
evaristoc
@evaristoc
Sep 01 2015 18:59
@SaintPeter Gitter... so far, Berkeley and Quincy are talking about making available not an API directly to FCC site but a repo with "log" files...
@SaintPeter but I don't have news about the progress of that part of the project...
@SaintPeter They want the data to be available to public for analysis purposes...
Ariel
@ArielLeslie
Sep 01 2015 19:07
If interactions are going to be log based, there is a lot you can do with logstash and elasticsearch
evaristoc
@evaristoc
Sep 01 2015 19:13
@ArielLeslie Good point (miscellaneous info: Gitter uses elasticsearch tech for its search feature)
@ArielLeslie I actually don't know the format of the file though... I said log, but it could be other kind...
@ArielLeslie I will bring that view to Berekeley: just in case...
Ariel
@ArielLeslie
Sep 01 2015 19:15
Logstash really works best on an active log file because it reads each entry as it is added, but elasticsearch doesn't require a logstash filter
If nothing else, I've found the documentation of elasticsearch to be very good
evaristoc
@evaristoc
Sep 01 2015 19:21
@ArielLeslie I also read a bit... but your point is interesting, actually... it is about which format allows the best way to analyse the data... I would just re-address this point with Berekeley and in the next report... (Disclaimer: I am not involved in the decision, but we could suggest some ideas, as we normally do...)
Ariel
@ArielLeslie
Sep 01 2015 19:24
I haven't done a whole lot with this kind of data analytics (at least not at scale). I just saw "log files" and ran with it.
evaristoc
@evaristoc
Sep 01 2015 19:28

@ArielLeslie hahaha! So, then!? We say in Venezuela: "you kill tigers, but THEN you get scared of the tiger skins"?

No, but really: your point is VEERY valid...

You should have some idea of data analysis for sure, and the nightmares of cleaning data, joining files, etc... this is the worst part, actually: getting everything up for analysis...
So if there are some ideas to share on what could be a useful data deployment, that should be always welcome...
Ariel
@ArielLeslie
Sep 01 2015 19:31
Consider everything I say as coming from the peanut gallery :)
evaristoc
@evaristoc
Sep 01 2015 19:32
@ArielLeslie sounds like there is a lot of wisdom down there, in the peanut gallery...
:)
Ariel
@ArielLeslie
Sep 01 2015 19:33
:) experience at least
evaristoc
@evaristoc
Sep 01 2015 19:34
Hehehe!!
Ariel
@ArielLeslie
Sep 01 2015 19:36
I think that a major component of a decision like that is probably how much you want to do in-house vs. how much you want to open it up to the community to say "Here is safe data. Show us what you can do." In the latter case tools like elasticsearch (and others, of course) are great because they are already so pluggable. On the other hand, if you are working toward a specific goal and want to optimize the associated behavior, there are probably better options
evaristoc
@evaristoc
Sep 01 2015 19:38
Yes... I think they are more about the first one.... so: not more words: elasticsearch: period!
No really... I think they are about the first thing, so it still make a lot of sense...
But how they are going to make it available, that it is something I don't really get... I think they are planing to open a repo for data to be downloaded...
Ariel
@ArielLeslie
Sep 01 2015 19:41
I wonder if a tool like logstash (or similar) could be configured to create repo commits of sanitized log files
evaristoc
@evaristoc
Sep 01 2015 19:41
So the way the data should be manipulated would be left to the user...
This message was deleted
Ariel
@ArielLeslie
Sep 01 2015 19:42
There would need to be some sort of process on the FCC side that both sanitizes data and keeps the repo relatively up to date
evaristoc
@evaristoc
Sep 01 2015 19:42
About logstash: I guess so... to some extent
CamperBot
@camperbot
Sep 01 2015 19:42
you need to ask about @someone!
evaristoc
@evaristoc
Sep 01 2015 19:42
what is with you, camperbot??
Exactly...
Ariel
@ArielLeslie
Sep 01 2015 19:43
these are just tools that I'm familiar with using, so I'm just mentally thinking of ways to achieve goals
evaristoc
@evaristoc
Sep 01 2015 19:43
Do you use them at work?
Ariel
@ArielLeslie
Sep 01 2015 19:43
I did for one project.
We were creating alarms based off of log events and analyzing relevant data to send with the alarms to speed up user trouble-shooting
evaristoc
@evaristoc
Sep 01 2015 19:45
Ok... to be honest, I haven't used them before, but I don't see why you can't sanitize the data with them, actually: you SHOULD be able: logs are now SOOO important source of information!!!
Ok!!
Hmmm.... sounds a bit like the camperbot...
Of course it is a different thing but... well: nowadays almost EVERYTHING sounds like camperbot here (don't tell dcsan...)
Ariel
@ArielLeslie
Sep 01 2015 19:47
You definitely can sanitize data.
That's what I was musing about. There needs to be some automated process that sanitizes data and pushes it to the public source.
evaristoc
@evaristoc
Sep 01 2015 19:49
Yeah! I think that it is what is going to happen, indeed...
Ariel
@ArielLeslie
Sep 01 2015 19:50
one way or another, yeah. What I don't know about is how easy/hard/frustrating it is to automate pushes to a repo as opposed to a database/other
evaristoc
@evaristoc
Sep 01 2015 19:50
I normally use the word: "clean the data"... sanitize is a sort of new term? Sounds "sanitary"
Rex Schrader
@SaintPeter
Sep 01 2015 19:51
Sanitize in the sense that it doesn't have any personally identifying information.
As opposed to "cleaning"the data - removing parts that are wrong, malformed, or broken
evaristoc
@evaristoc
Sep 01 2015 19:51
Hey man! Sound like "sorry, I have an emergency!"
Rex Schrader
@SaintPeter
Sep 01 2015 19:51
The first is for user privacy, the second is for your sanity :)
evaristoc
@evaristoc
Sep 01 2015 19:52
Hehehe!
Rex Schrader
@SaintPeter
Sep 01 2015 19:53
Haha - Sanitize - I think this usage is somewhat new, but is closer to the second meaning: "to make (something) more pleasant and acceptable by taking things that are unpleasant or offensive out of it", but without the negative connotation
evaristoc
@evaristoc
Sep 01 2015 19:53
@ArielLeslie I would also prefer a database! It is not only pushing: it is also pulling
@SaintPeter hahaha!
Rex Schrader
@SaintPeter
Sep 01 2015 19:55
In terms of pushing data to a repo, that is not hard to do with git - a push command can easily do that and the extraction can be automated.
It would be idea to push to a live database, though
I just don't know what volume of data we're looking at. If it's large, then pushing to a file and then reading the file into a DB may be more effecient/managable
Maybe something daily
evaristoc
@evaristoc
Sep 01 2015 19:56
I always would prefer the DB, but then they have to think on an API or so, I guess...
No much: the data of about 300.000 people registered in FCC?
Rex Schrader
@SaintPeter
Sep 01 2015 19:56
95k registered
but you're only going to send deltas after the initial load
evaristoc
@evaristoc
Sep 01 2015 19:57
Oh!! I was referring to another course? Ok...
Rex Schrader
@SaintPeter
Sep 01 2015 19:57
Well, I assume it would just be deltas
you lose anonymity if you retain a reconstitutible user id
evaristoc
@evaristoc
Sep 01 2015 19:57
Indeed... I don't think that right now there is a lot of info there, although certainly relevant...
And yes: the idea is anonymous info, like that one I reported about you today, for example...
Ariel
@ArielLeslie
Sep 01 2015 19:58
@SaintPeter Yeah. In that case you would need the FCC data sensitization to be on the other side: between the DB and the user
Rex Schrader
@SaintPeter
Sep 01 2015 19:58
Here are things I'm thinking would be useful/interesting:
1) User Start Date (to the day)
2) User challenge (waypoint/bonfire/etc) completion date (to the day)
I would love to be able to correlate between the FCC site and Gitter data
evaristoc
@evaristoc
Sep 01 2015 19:59
Well... remember they have all that info in a DB, in fact...
Ariel
@ArielLeslie
Sep 01 2015 19:59
I was just going to say that I would be really interested in a "community involvement" set of metrics
Rex Schrader
@SaintPeter
Sep 01 2015 20:00
For the gitter side, Ithink it would be interesting to do some data characterization: Can we determine low/medium/high gitter utilization. Are there clear buckets?
I gotta go - have a meetign - I'll check back later.
Ariel
@ArielLeslie
Sep 01 2015 20:00
It would be a good way of identifying "power users"
evaristoc
@evaristoc
Sep 01 2015 20:00
Actually the only point for real-time data is occurring here, in the chat...
Yes, we are going to work on that, @ArielLeslie
Ariel
@ArielLeslie
Sep 01 2015 20:01
Not trivial problems, but interesting ones
partly because it forces you to weight actions and qualities
evaristoc
@evaristoc
Sep 01 2015 20:02
What would be the interesting ones for you?
Ariel
@ArielLeslie
Sep 01 2015 20:04
Just for my own curiosity, I'd be interested in the different "types" of campers (young and early career, career changers, hobbyists, professional devs, etc) and how their behavior differs as well as how it changes over time and eventual outcomes
evaristoc
@evaristoc
Sep 01 2015 20:05
This is not a date site, eh?
Ariel
@ArielLeslie
Sep 01 2015 20:05
date site?
evaristoc
@evaristoc
Sep 01 2015 20:06
I am joking... but that is hard...
Ariel
@ArielLeslie
Sep 01 2015 20:06
Very! I just meant that that's the kind of information I'm most interested in.
Another interesting outcome would be a resource for identifying/matching potential pairs for pair-programming or mentors
You know how Words With Friends will suggest opponents by saying "Ariel plays at the same pace as you"?
evaristoc
@evaristoc
Sep 01 2015 20:08
Ok... I am expecting to find a lot of gaps in the information... BUT there are ways to approximate that information... although some additional resources (ie time) would be needed...
Ariel
@ArielLeslie
Sep 01 2015 20:08
again: hard!
evaristoc
@evaristoc
Sep 01 2015 20:09
And that last point you mention: SNA (social network analysis)... I believe we can play a bit with that here too...
Ariel
@ArielLeslie
Sep 01 2015 20:09
In the realistic-scope, just having a resource for the common questions like "how many people actually finish?" "how long does it really take?" "Am i going too slow?" would be great
evaristoc
@evaristoc
Sep 01 2015 20:10
How the people associate in FCC? This is more a research question though... but I would like to do something like that...
Ariel
@ArielLeslie
Sep 01 2015 20:11
Some would require self-reporting
evaristoc
@evaristoc
Sep 01 2015 20:11
Let's see: for now my plan is to prepare a small-size Text Mining project to see if we can go further with that...
Ariel
@ArielLeslie
Sep 01 2015 20:12
that's nontrivial itself
evaristoc
@evaristoc
Sep 01 2015 20:12
Yes... there are other ways, although less precise... very much deeply in pure DataScience, Machine Learning, etc...
Ariel
@ArielLeslie
Sep 01 2015 20:12
Oh sure, but from a cost/value perspective I don't know that they're sufficiently significant
evaristoc
@evaristoc
Sep 01 2015 20:13
It is not, but for example one of the documents I will use for training dataset has been used to determine age (categories) and gender with good level of accuracy...
Ariel
@ArielLeslie
Sep 01 2015 20:13
a machine learning model that advises/predicts success of campers based on interactions would be a fascinating PhD project
evaristoc
@evaristoc
Sep 01 2015 20:14
Hey!!! Have ever tried research, @ArielLeslie?
Ariel
@ArielLeslie
Sep 01 2015 20:14
Only at the undergraduate level
evaristoc
@evaristoc
Sep 01 2015 20:14
You seem to have questions...
Ariel
@ArielLeslie
Sep 01 2015 20:14
<<Nerd
evaristoc
@evaristoc
Sep 01 2015 20:14
Hehehehe!
Ariel
@ArielLeslie
Sep 01 2015 20:15
I'll do a PhD eventually, but not yet. I'm also married to a mathematician, so this is dinner-table conversation for me :D
evaristoc
@evaristoc
Sep 01 2015 20:15
Bad or good nerd? Are there bad and good?
Ariel
@ArielLeslie
Sep 01 2015 20:15
nerds like us are allowed to be unironically enthusiastic about stuff… Nerds are allowed to love stuff, like jump-up-and-down-in-the-chair-can’t-control-yourself love it. Hank, when people call people nerds, mostly what they’re saying is ‘you like stuff.’ Which is just not a good insult at all. Like, ‘you are too enthusiastic about the miracle of human consciousness’
evaristoc
@evaristoc
Sep 01 2015 20:16
Hehehe! Look! Nice...
Maths! And you a computer scientist! What a combination! Do you have kids!?
Ariel
@ArielLeslie
Sep 01 2015 20:17
Nope. Just very smart dogs :D
evaristoc
@evaristoc
Sep 01 2015 20:17
That would be INTERESTING!
Nerd dogs!!
Ariel
@ArielLeslie
Sep 01 2015 20:17
We hear that a lot :D
evaristoc
@evaristoc
Sep 01 2015 20:18
Well... if there are many people saying it, there is a big chance that it could be true...
Statistics...
Ariel
@ArielLeslie
Sep 01 2015 20:19
I think in this case it just reflects certain cultural biases... But I tend to agree with the majority of our friends who say our kids would either be brilliant or totally fucked up.
evaristoc
@evaristoc
Sep 01 2015 20:20
Very much black&white!! Hehehehe! No middle terms!
Ariel
@ArielLeslie
Sep 01 2015 20:20
It's an inclusive or at least
evaristoc
@evaristoc
Sep 01 2015 20:21
It can be both? fucked brilliant? hopefully not brilliantly fucked up...
Well... I will be asking you again... I don't know... 10 years? And see what happened...
Hey, Smart Tiger-Killer! I have to go! So see you here again, eh!!? Not in 10years!
Take it easy!!! They are going to be brilliant!
Ariel
@ArielLeslie
Sep 01 2015 20:27
Take care