These are chat archives for FreeCodeCamp/DataScience

2nd
Apr 2017
Rudy Hernandez
@rudolphh
Apr 02 2017 11:21
i found this really cool presentation on D3, just started learning it yesterday and my codepen shows it, but thought i'd share because it's helped immensely along with documentation, and slightly dated lynda resources
http://www.macwright.org/presentations/dcjq/
if anyone has information on resources that helped them please let me know. It'd be greatly appreciated. thanks.
Shiva Krishna Yadav
@shivakrishna9
Apr 02 2017 12:49
@all any one has done sentiment analysis on news articles. need some guidance
Hèlen Grives
@mesmoiron
Apr 02 2017 13:48
@shivakrishna9 Not specially, but I looked into NLTK. Depending on what you want to do; but with my knowledge I would make a set of cleaned news articles of plain text files (the corpus); Next I would strip the stopwords from the files with stop lists. Research a list of words that convey sentiments (ie positive, negative, neutral etc) I haven't looked into that. Finally train a dataset and use clustering to find the sentiment. You could look for libraries that already have done the categorization of words and use those for convenience. Difficulty with sentiment is double negations or using opposite words for certain say positive meanings. So context, probability of occurring with type of words matter too.
Shiva Krishna Yadav
@shivakrishna9
Apr 02 2017 13:54
@mesmoiron thank you! it helps me alot.
CamperBot
@camperbot
Apr 02 2017 13:54
shivakrishna9 sends brownie points to @mesmoiron :sparkles: :thumbsup: :sparkles:
:cookie: 330 | @mesmoiron |http://www.freecodecamp.com/mesmoiron
evaristoc
@evaristoc
Apr 02 2017 15:24

@shivakrishna9

Additionally to what @mesmoiron I would suggest to pick a standard known library, specially if you are new. NLTK for python is a very popular one and it is extensively documented. NLTK combines very well with scikit-learn but you can also try other libraries. Don't go to far than a few known libraries if you are using python though (not Theano, TensorFlow, etc) unless you want to rather try neural networks, which it might be an overkill for most of sentiment analyses.

In the case of R, you should find one or two packages, more difficult to choose IMO. If you are not sure what to use, one thing I would suggest is to identify projects in kaggle that focus on sentiment analysis and check what people use for R.

Pick what you feel more comfortable. I am using python (3.x).

R and python (nltk) could rely each on different shipped corpa (which might not be enough for good results), and algos could be a bit different, so results using each could be also different.

Depending on level of accuracy you want to reach, corpa as well as data size are both the key.

@rudolphh Thanks!
CamperBot
@camperbot
Apr 02 2017 15:27
evaristoc sends brownie points to @rudolphh :sparkles: :thumbsup: :sparkles:
:cookie: 134 | @rudolphh |http://www.freecodecamp.com/rudolphh
Shiva Krishna Yadav
@shivakrishna9
Apr 02 2017 15:41
I will use nltk thank you @evaristoc
Jay Vora
@jayvora92
Apr 02 2017 18:38
yone interested in Data analytics( data engineering/data science domain) challenge? The problem is very interesting and includes data source which has to be played with different angle like getting connected with server,cleansing it and building necessary feature. Interested people please pm me.
anyone*
evaristoc
@evaristoc
Apr 02 2017 19:49
@jayvora92 I can't take your invitation at the moment but I think it would be nice to know more? Sounds interesting for anyone in this group.
evaristoc
@evaristoc
Apr 02 2017 20:18

People:

For those interested in the text mining / information retrieval aspect of ML and DS, I am going through the Specialization in Data Mining (Coursera) by the University of Illinois. Good, although heavy (doing 4 courses simultaneously) and very much theoretical. Also focused on the basics and the less fancy but effective methods for some of the information retrieval problems. I already have relevant theoretical/practical background in several of the topics so that helps.

And if someone has problems downloading videos, etc from the new coursera format, let me know to advice you a simple hack.

If your interest in TensorFlow and Deep Learning, check the Udacity course by Google (very good!).
evaristoc
@evaristoc
Apr 02 2017 21:21

And...

Just to keep this one around... https://www.quora.com/Can-scikit-learn-be-used-to-build-a-recommendation-system

If you read the link you will find a post about a project in a Data Science Bootcamp. I like that exercise. I have so far implemented very simple projects having that kind of implementations in mind, mostly having FCC users as beneficiaries...

Jay Vora
@jayvora92
Apr 02 2017 21:38
@evaristoc thank you for that insight. Here is the basic summary. I have the data source in txt format and looking for open discussion on the best way to solve it
CamperBot
@camperbot
Apr 02 2017 21:38
jayvora92 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 333 | @evaristoc |http://www.freecodecamp.com/evaristoc
Jay Vora
@jayvora92
Apr 02 2017 21:38

Your challenge is to perform basic analytics on the server log file, provide useful metrics, and implement basic security measures.

The desired features are described below:

Feature 1:

List the top 10 most active host/IP addresses that have accessed the site.

Feature 2:

Identify the 10 resources that consume the most bandwidth on the site

Feature 3:

List the top 10 busiest (or most frequently visited) 60-minute periods

Feature 4:

Detect patterns of three failed login attempts from the same IP address over 20 seconds so that all further attempts to the site can be blocked for 5 minutes. Log those possible security breaches.

Alice Jiang
@becausealice2
Apr 02 2017 23:55
Hi guys! I finally got an alright-ish draft done. Can I get some feedback? I'm looking for accurate and concise information with as little opinion as possible.