These are chat archives for FreeCodeCamp/DataScience

31st
Jul 2015
Quincy Larson
@QuincyLarson
Jul 31 2015 16:12
Let me kick off this room by saying that our near-term plans are to open up our entire anomyzed dataset for academic study. This will include:
1) Date of signup
2) True/false values for whether a given user has added LinkedIn, GitHub, Facebook, Twitter or CodePen URLs
3) reported location
4) a unix timestamp of when they completed each challenge, if they've completed it yet.
5) Their longest consecutive streak of completing challenges each day and their current streak
6) Whether they've unsubscribed from our email
7) The timestamps of their non-challenge Brownie Points (submitting on Camper News then being upvoted, thank-you points for helping other campers in chat )
We plan to make this weekly-updated .csv file available at a public URL
dc
@dcsan
Jul 31 2015 16:13
w00t datascience :) we can have some sweet D3 graphy stuff!
Quincy Larson
@QuincyLarson
Jul 31 2015 16:13
Regarding mining Gitter’s archive - Google indexes all of Gitter’s messages. Rather than trying to scrape them, I will ask if they can prepare an endpoint we can hit for a big dump. I don’t know how high priority this is, but considering that it would avoid a boat load of API calls, I think they’ll consider it.
@evaristoc has already started mining our chat room data, and has started drafting a paper about it here: https://docs.google.com/document/d/1vowY6o943e8NhPSmxifFvIthU0sLnAdTXE19z9yJQFA/edit
dc
@dcsan
Jul 31 2015 16:14
oh thats really interesting. ML for the Bot
i was thinking last night about an auto generated FAQ
based on, literally FAQs
we could start sending all the chatter to wit.ai and then mine it afterwards
Quincy Larson
@QuincyLarson
Jul 31 2015 16:16
There’s a dark side to trying to measure/automate everything, and I see way too many developers fall into this trap. So I’m going to post this XKCD to remind everyone that automation has costs, too: https://imgs.xkcd.com/comics/is_it_worth_the_time.png
dc
@dcsan
Jul 31 2015 16:17
aka shaving a yak
evaristoc
@evaristoc
Jul 31 2015 16:38
Hi @QuincyLarson! Thanks for the info and the room!
@dcsan hehehe!
dc
@dcsan
Jul 31 2015 16:39
do you have experience writing ML code then?
not sure if you saw but we are working a chat bot for the rooms
right now you have to ask it specifically for help on a topic
but we could pipe the chatter through some ML analysis to get a list of FAQs?
evaristoc
@evaristoc
Jul 31 2015 16:45
@dcsan are you writing to me?
dc
@dcsan
Jul 31 2015 16:45
yep!
sry i'll use DMs then more
evaristoc
@evaristoc
Jul 31 2015 16:45
I have some experience with ML, yes!
but python and R...
No, I haven't seen the chat bot yet... @benmcmahon100 mentioned something in the /CurriculumDevelopment chat we had
evaristoc
@evaristoc
Jul 31 2015 16:53
I am going jogging but we can keep talking about this... I am curious... I would like to see if it is in my possibilities to contribute with some aspects of the project if you think it is worthy
dc
@dcsan
Jul 31 2015 16:53
there's a demo in
still buggy though
most ML libs seem in those two languages
are there any decent "ML as a service" companies out there yet?
I was using wit.ai for simple classification
evaristoc
@evaristoc
Jul 31 2015 17:01

I cannot remember one out of the top of my head right now... but I am sure there are... It seems for the time being that the focus is more on text mining, right? Anyway, dcsan: let me go for a jog and I come back... I will check the bot. I also thought using the bot to create "tickets". But by having a look at the problem of finding help at the /Help, I just asked myself "wait: we can also let our trainees to know more about how to use the Gitter search..."... The thing is that I have barely used the service...

This can be a bot...

evaristoc
@evaristoc
Jul 31 2015 19:44
the wit.ai is quite outstanding... I thought you were talking about something different, but SaaS for NLP as open source is something I haven't seen until now... I have no suggestions, I am afraid...
dc
@dcsan
Jul 31 2015 19:45
api.ai is another one
they're pretty much aimed at "IOT" stuff tho, so not much of a classification engine
looking for something else to classify our chats and see what questions come up again and again. its a classic problem in ML circles so sure there must be a project out there.
evaristoc
@evaristoc
Jul 31 2015 19:47
Well, they classify the text... and apparently in a super way... they should be using techniques and classifications that are becoming standard
dc
@dcsan
Jul 31 2015 19:48
they do classify but only after we trim the set down a bit
evaristoc
@evaristoc
Jul 31 2015 19:48
ah...
dc
@dcsan
Jul 31 2015 19:48
they can kind of figure out
evaristoc
@evaristoc
Jul 31 2015 19:48
so you have to do a bit of data preparation...
dc
@dcsan
Jul 31 2015 19:48
does anyone know about XX
can someone help me with XX
whats XXX all about
but trying to figure out what are the actual questions ...
throwing a ton of text at some engine and hope it finds repeat groups of words, that are question types
evaristoc
@evaristoc
Jul 31 2015 19:49
ah!! ok... I think I understand..
I thought they were providing more than that... but still it is pretty useful and good, specially for short communications...
instructions...
dc
@dcsan
Jul 31 2015 19:52
yeah
we could break all our inputs up
there are other engines that could figure if $input is likely a question or a reply
then we can sort the questions in some fuzzy way
to see what comes up the most
do some kind of clustering
evaristoc
@evaristoc
Jul 31 2015 19:54
hmmm... that is interesting...
so in the ideal world, your bot would be a true learning robot...
dc
@dcsan
Jul 31 2015 19:58
thats the goal
btw this is a public room right?
evaristoc
@evaristoc
Jul 31 2015 19:59
that I don't know.... I guess it is...
evaristoc
@evaristoc
Jul 31 2015 20:16

I will have a check these days to see if we can come up with the tools you are asking to analyse data... I guess that finding a full open source NLP engine could probe hard to find, I would say not necessarily due to the algorithms per se but because of the challenges for data cleaning and or the corpora that they use to standardise the text... my guess is that those corpora are an asset that no many people are willing to share... but who knows...

In the meantime, do you have any other thing in mind about data analysis that we can work easily? I would like to share this with going through completing the training in the meantime... Let me know? DM is ok too! Leaving now...

dc
@dcsan
Jul 31 2015 20:17
ill give it some thought
evaristoc
@evaristoc
Jul 31 2015 20:22
Have a look at what @BerkeleyTrue suggested at /CurriculumDevelopment... I would like to start something simple, no ML but more exploring the data...
One of the questions that popped up to my mind was to explore the activity in the /Help channel: there are hours I don't see because I sleep meanwhile... I wonder: what are the peak times of activity in the /Help channel? What are the most requested help at each? How is the activity of the helpers at those times? Are there enough helpers to cover the demand at each peak? This could be an exploration of one month of data, just to show something...
Leave you @dcsan! Nice job, man!
Bye everyone
dc
@dcsan
Jul 31 2015 20:23
wheres the link to that?
/CurriculumDevelopment ?
where?
@evaristoc