These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
shivakrishna9 sends brownie points to @mesmoiron :sparkles: :thumbsup: :sparkles:
Additionally to what @mesmoiron I would suggest to pick a standard known library, specially if you are new. NLTK for python is a very popular one and it is extensively documented. NLTK combines very well with scikit-learn but you can also try other libraries. Don't go to far than a few known libraries if you are using python though (not Theano, TensorFlow, etc) unless you want to rather try neural networks, which it might be an overkill for most of sentiment analyses.
In the case of R, you should find one or two packages, more difficult to choose IMO. If you are not sure what to use, one thing I would suggest is to identify projects in kaggle that focus on sentiment analysis and check what people use for R.
Pick what you feel more comfortable. I am using python (3.x).
R and python (nltk) could rely each on different shipped corpa (which might not be enough for good results), and algos could be a bit different, so results using each could be also different.
Depending on level of accuracy you want to reach, corpa as well as data size are both the key.
evaristoc sends brownie points to @rudolphh :sparkles: :thumbsup: :sparkles:
For those interested in the text mining / information retrieval aspect of ML and DS, I am going through the Specialization in Data Mining (Coursera) by the University of Illinois. Good, although heavy (doing 4 courses simultaneously) and very much theoretical. Also focused on the basics and the less fancy but effective methods for some of the information retrieval problems. I already have relevant theoretical/practical background in several of the topics so that helps.
And if someone has problems downloading videos, etc from the new coursera format, let me know to advice you a simple hack.
Just to keep this one around... https://www.quora.com/Can-scikit-learn-be-used-to-build-a-recommendation-system
If you read the link you will find a post about a project in a Data Science Bootcamp. I like that exercise. I have so far implemented very simple projects having that kind of implementations in mind, mostly having FCC users as beneficiaries...
jayvora92 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
Your challenge is to perform basic analytics on the server log file, provide useful metrics, and implement basic security measures.
The desired features are described below:
List the top 10 most active host/IP addresses that have accessed the site.
Identify the 10 resources that consume the most bandwidth on the site
List the top 10 busiest (or most frequently visited) 60-minute periods
Detect patterns of three failed login attempts from the same IP address over 20 seconds so that all further attempts to the site can be blocked for 5 minutes. Log those possible security breaches.