These are chat archives for FreeCodeCamp/DataScience

3rd
Feb 2016
evaristoc
@evaristoc
Feb 03 2016 08:48

Hi All:

Everyone is welcome to participate with discussions, links, etc but

For those who are interested in real hands-on experiences:

Just remember that the current dataset in Academic Torrent is not the only place to find real data to play with:

  • this chatroom, and any chatroom in Gitter is of public access (be nice!)
  • I will be talking to Quincy to give us some access to FCC Twitter and FCC Facebook (be nice!)
  • FCC has publications in medium, quora, etc to get data from
  • codepen: you should ask for authorisation though, because the only way to get data from codepen is scraping; please consult with me if you are interested

Although there already exists a couple of projects that do that, we are not encouraging scraping on FCC pages because scraping is currently putting a lot of pressure on FCC databases. Only one project is supported that scrap in a regular base (@roelver's "top100"). So, unless you are planning a small project with low impact (eg. @luishendrix92's "Solution Getter"), it is better not to try scraping, much less in a regular base. If you still want to go ahead with some scraping, please allows us to have a look to the project before implementing it. Be nice!

You can come up with new ideas, projects, etc. Doesn't have to be FCC, but it would be nice if you incorporate JS in your project. You decide what part of the JS stack you want to include. I have a personal interest on d3js and web dev fundamentals with special interest in backend (MEAN stack). However I discussed previously that d3js could be an overkill for small projects. Additionally angular (the A of MEAN) is not supported by FCC program any more. It is a question of preferences, so decide your JS path and go ahead!

All type of levels and (almost) all type of projects are welcome!

You can add more projects and or more data you want to get into...


My personal path...
Taking in consideration my level, expertise as well as the value of the information for FCC, my personal preferences at the moment are:

  • chatroom analysis: foundations of social network analysis (SNA); user life cycle; speech acts; camper's performance
  • FCC dataset: user life cycle; foundations of code submission analysis (I proposed a collaborative project to @luishendrix92); camper's performance (eg. report by @george-stepanek was a simple but nice one)
  • products: interactive visualization tools; simple recommenders or indexes; currently proposed for the hackaton a registration and evaluation app (user stories: a peer-to-peer evaluation section and a resolution time evaluation section, with leaderboards) that is being worked by @jbmartinez
  • others: Google Analytics; I have a special interest on social-minded topics (environment, health, wealth, etc); also communication to large audiences through articles (falling closer to digital journalism)
Serenity
@qmikew1
Feb 03 2016 08:50
heya @evaristoc I'm going to download (yes, will actually do it this time) the ubuntu chat corpus described looks like around .5 gig. I can definitely see (though not well versed in python) the overhead in processing (i.e., the loops on user (then multiplied out across over all users) in this... as an aside, I wonder if capturing responses from one user name to another matching a question origin (or hit) would be helpful --- hey you just posted while I was writing this lol
evaristoc
@evaristoc
Feb 03 2016 08:52
@qmikew1 yes I wanted to be the first :)
that is part of the challenge of doing the analysis with large datasets ;)... that is why learning algos and data structures could be useful
Serenity
@qmikew1
Feb 03 2016 09:00
@evaristoc no doubt. (I'm a gluten for punishment thinking about processing overhead at 4am)
evaristoc
@evaristoc
Feb 03 2016 09:01
@qmikew1 Try to see what you can do... start just by evaluating performance on a sample, don't do that all... just take a fraction and tune your code on that fraction first
@qmikew1 what is exactly your idea?
@qmikew1 and why not trying FCC chatroom instead for your practice ;)?

what do you mean with:

I wonder if capturing responses from one user name to another matching a question origin (or hit) would be helpful

HelpBasejumps is a room that has no analyses yet...
Serenity
@qmikew1
Feb 03 2016 09:05
@evaristoc if, say, user asks question and it's identified (i.e., among the approaches in gist) an answer that has questioner's tag...
and the amount of responses
and.... content among those
(responses) could be inferred content as well (not sure I have this fully developed)
that is to say the patterns in responses could somewhat confirm the question validity as well
(i.e., additional criteria to indicate question's pattern)
Serenity
@qmikew1
Feb 03 2016 09:13
In other words, you can (possibly) draw conclusions on the responses (if exists) -
evaristoc
@evaristoc
Feb 03 2016 09:15

No really... :)

The last phrase:

that is to say the patterns in responses could somewhat confirm the question validity as well

in theory, yes. You expect some redundancy that could become a pattern in some cases, but Natural Language Processing is a bit hard: precision could be low because that redundancy is not that good.

"Questions" can be represented as a question ("?") or as a request ("hey! I need help for this!") for example. In short, you can communicate the same using different phrases. Then you have a lot of misspellings, etc... that we understand but the machine not...

If the ubuntu you are using is annotated (ie pre-classified), that would be great! You can use it for comparisons for sure...

The common practice when you have pre-classified data is going for a supervised method. Generally speaking consists in:
1) Get a small classified section of the dataset to use for training
2) Get another classified section for test
3) Apply a classification model on the first, and then use the test one to verify how good your model is

(I don't remember if the ubuntu dataset was annotated though...)

Serenity
@qmikew1
Feb 03 2016 09:20
yeah I actually have to look at the data set (lol) .... but the main (yes I went on a bit of a detour) reason is I wanted to see why/how the first two issues (first one actually where specific ubuntu technologies is used - and its adaptability or non adaptability is the case --)
evaristoc
@evaristoc
Feb 03 2016 09:22
Can you elaborate a bit more, @qmikew1?
You wanted to see why/how the first two issues of what?
If the triggering question is identified, depending of the question, check the amount of answers?

Ahh... ok:

I wonder if capturing responses from one user name to another matching a question origin (or hit) would be helpfu

Yes you want to navigate to the first trigger question, right?
Serenity
@qmikew1
Feb 03 2016 09:27
so, the abstract is talking about ubuntu technologies (their example unity (sucks btw) is ubuntu context specific) the questions and exercises in the/this curriculum isn't infinite. I'm wondering if a contextual model can be applied to exercises and the technologies being taught. (i.e., why would that not work?)
evaristoc
@evaristoc
Feb 03 2016 09:28
hmmmm.... I didn't find an annotated dataset that I could trust for the analyses I did (over chatrooms), so I used a different approach: "time of silence". I haven't tested the effectiveness of the approach though...
I think I start getting your idea. But can you explain what contextual model means?

This:

the questions and exercises in the/this curriculum isn't infinite.

Sounds reasonable

evaristoc
@evaristoc
Feb 03 2016 09:35
@qmikew1 Are you saying using the ubuntu dataset on FCC chatroom, assuming that some contextual similarities exist?
Serenity
@qmikew1
Feb 03 2016 09:36
you would swap ubuntu technologies with fcc exercises (as well as general stuff like node, css etc.) the challenge would be the context
evaristoc
@evaristoc
Feb 03 2016 09:38
Yes! This is a usual approach in text mining: that is the sort of redundancies you are looking for and that was my approach in this project
It was relatively ok! But the algorithm I applied was very resource-expensive (exhaustive search). I was just trying a brute force test though.
But in general when you are in data mining / machine learning projects, you have this as a usual dilemma: the more effective analyses are very resource-intensive.
Serenity
@qmikew1
Feb 03 2016 09:43
totally
evaristoc
@evaristoc
Feb 03 2016 09:44
Solving a data mining project is like facing an NP-hard problem, computationally speaking
So you will have an advantage if you know some algorithms
Serenity
@qmikew1
Feb 03 2016 09:45
or chunk things out in passes
but you're right, can't do it any context (i.e., via script on a file or rows in a db) without iterating through
evaristoc
@evaristoc
Feb 03 2016 09:47

Iterations is not the problem. Iterating all over the dataset again and again could be. I say so because under some conditions you can do that...

But back to your point, I think I already did something similar to what you are mentioned. Let's try something different? More to your level perhaps...

Serenity
@qmikew1
Feb 03 2016 09:50
processing large files is at my level
evaristoc
@evaristoc
Feb 03 2016 09:50
And with python? How far are you?
Serenity
@qmikew1
Feb 03 2016 09:50
but I get what you're saying
evaristoc
@evaristoc
Feb 03 2016 09:51
If I talk about queues, hash tables, heaps, trees, multiprocessing... do you know what they are and how to implement that in python?
Serenity
@qmikew1
Feb 03 2016 09:52
python = non-existant (the concepts yep)
my comments are from a scripting and db perspective
evaristoc
@evaristoc
Feb 03 2016 09:56

Ok: I think we can then try to get one of the codes I already made and improve it, so you learn python meanwhile by having a look to the code...

A possible idea would be to reformulate the code so it becomes more efficient. I don't want to touch OO in my scripts for these projects yet: procedural is fine for short scripts.

Serenity
@qmikew1
Feb 03 2016 09:57
sounds good
Serenity
@qmikew1
Feb 03 2016 10:01
ok, bookmarked (it will take me awhile to digest )
evaristoc
@evaristoc
Feb 03 2016 10:02
@qmikew1 If I remember well, I included some relevant comments... What I would suggest to start with though is how to use urllib and request libraries
I think that would be the best place to start: because I am afraid we have to download some data from Gitter...
Serenity
@qmikew1
Feb 03 2016 10:03
got it. Thank you @evaristoc I will go through (as time permits through this week)
CamperBot
@camperbot
Feb 03 2016 10:03
qmikew1 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:star: 216 | @evaristoc | http://www.freecodecamp.com/evaristoc
evaristoc
@evaristoc
Feb 03 2016 10:04
Take it easy.
Serenity
@qmikew1
Feb 03 2016 10:04
you too
evaristoc
@evaristoc
Feb 03 2016 10:08
Anyway @qmikew1: your ideas are fine... try to work some of them? And you don't need to be so elaborated to do something: @george-stepanek came with interesting insights with just having a quick look at the dataset...

People!
Again: for those who would like to get some hands-on experience: read my previous post!
Serenity
@qmikew1
Feb 03 2016 10:11
@evaristoc I think the pull of the elaborate also applies to you. (lol) but yes, you have to start small, your advice isn't lost... now to go answer some questions. I'm out . :smile:
noncasus
@noncasus
Feb 03 2016 21:58
anyone know how many FCC campers are active?