These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
Everyone is welcome to participate with discussions, links, etc but
Just remember that the current dataset in Academic Torrent is not the only place to find real data to play with:
Although there already exists a couple of projects that do that, we are not encouraging scraping on FCC pages because scraping is currently putting a lot of pressure on FCC databases. Only one project is supported that scrap in a regular base (@roelver's "top100"). So, unless you are planning a small project with low impact (eg. @luishendrix92's "Solution Getter"), it is better not to try scraping, much less in a regular base. If you still want to go ahead with some scraping, please allows us to have a look to the project before implementing it. Be nice!
You can come up with new ideas, projects, etc. Doesn't have to be FCC, but it would be nice if you incorporate JS in your project. You decide what part of the JS stack you want to include. I have a personal interest on d3js and web dev fundamentals with special interest in backend (MEAN stack). However I discussed previously that d3js could be an overkill for small projects. Additionally angular (the A of MEAN) is not supported by FCC program any more. It is a question of preferences, so decide your JS path and go ahead!
You can add more projects and or more data you want to get into...
My personal path...
Taking in consideration my level, expertise as well as the value of the information for FCC, my personal preferences at the moment are:
what do you mean with:
I wonder if capturing responses from one user name to another matching a question origin (or hit) would be helpful
No really... :)
The last phrase:
that is to say the patterns in responses could somewhat confirm the question validity as well
in theory, yes. You expect some redundancy that could become a pattern in some cases, but Natural Language Processing is a bit hard: precision could be low because that redundancy is not that good.
"Questions" can be represented as a question ("?") or as a request ("hey! I need help for this!") for example. In short, you can communicate the same using different phrases. Then you have a lot of misspellings, etc... that we understand but the machine not...
If the ubuntu you are using is annotated (ie pre-classified), that would be great! You can use it for comparisons for sure...
The common practice when you have pre-classified data is going for a supervised method. Generally speaking consists in:
1) Get a small classified section of the dataset to use for training
2) Get another classified section for test
3) Apply a classification model on the first, and then use the test one to verify how good your model is
(I don't remember if the ubuntu dataset was annotated though...)
I wonder if capturing responses from one user name to another matching a question origin (or hit) would be helpfu
the questions and exercises in the/this curriculum isn't infinite.
Iterations is not the problem. Iterating all over the dataset again and again could be. I say so because under some conditions you can do that...
But back to your point, I think I already did something similar to what you are mentioned. Let's try something different? More to your level perhaps...
Ok: I think we can then try to get one of the codes I already made and improve it, so you learn python meanwhile by having a look to the code...
A possible idea would be to reformulate the code so it becomes more efficient. I don't want to touch OO in my scripts for these projects yet: procedural is fine for short scripts.
qmikew1 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles: