These are chat archives for FreeCodeCamp/DataScience

16th
Aug 2016
evaristoc
@evaristoc
Aug 16 2016 10:44

People

For those who might be interested in practising with skale.me:

People from skale.me answered to a question about requirements:

Using skale-engine directly with JSON files is totally ok. You can already take advantage of parallelism from one node with multiple CPUs with skale (as opposed to a single nodeJS process), no need to go to a fully distributed system if it's not required by the volume of data or speed of computation.

Installing a parallel functionality could be up to us, just for the sake of also practising with the architecture. I am still seriously thinking about it but we can decide that as a group.

For those who decide to install skale locally, just remember that:

"skale run" command is to run an app in the cloud, still in development and not ready yet... In the mean time, use "skale test" to execute your app locally on your host.

evaristoc
@evaristoc
Aug 16 2016 11:04

People

What to do after the pySpark course or with the skale.me project?

I still think the most interesting project would be that one involving skale.me. You can also make use of pySpark through the Databricks platform: if you are registered you will have access to a cluster and I think 20Gb of space, and there are ways to load data into your account.

If some of us decide to go ahead with more exercises during/after trainings:

  • My invitation is still the same: get some data, more specially FCC data, use the tool to transform/analyse the data, and render the results on a webpage. Also the opportunity to prepare a short medium article for FCC is encouraged.
  • My suggestion is to make small teams in this group.
  • If we follow the route above, it is my impression that we won't be able to make a persistent demo like a heroku website. If that is the case an option could be to work on it locally and to make a short video showing the results.
  • The FCC data available is not Big Data but I am sure it could be enough for exercises as it will pose similar challenges than a Big Data dataset. Feel free to go ahead with any other dataset though.
  • REMEMBER: it is ALWAYS not about how big the data is but how to extract relevant information out of it. You might end up applying similar methods on Big Data than those ones you apply on small data (eg. regressions, clusters, etc), or just modifications of those techniques to suit larger datasets. Big Data technologies are usually meant to help applying those same techniques easier on Big Data. So no worries if the selected dataset is not Big Data.
sosacrl
@sosacrl
Aug 16 2016 13:12
Anybody have a good perspective on the new Microsoft Data Science Xseries on EdX, other EdX series, and other courses out there for Data Science?
Alice Jiang
@becausealice2
Aug 16 2016 13:41
@sosacrl the Microsoft curriculum isn't technically an xseries because they include a course from another institution. All that really means, though, is that to get the certificate for the whole series (all courses must be paid for, not the free versions) is slightly more complicated. With exception to the intro to Python and R courses, all the courses I have taken from their curriculum are more lessons in theory than practical application of data science (the hardest thing you have to do in most of them is copy/paste in the correct order), and they focus on using Microsoft tools, including things I have never heard a data scientist list as one of their tools (looking at you, excel...). If you want a very beginner level look at data science it's a good set of courses to start with. Otherwise I'd really only advise someone go through then because they only take a few hours and look good on a resume.
I should also add, I have intentionally not taken a couple of the courses because they had ridiculously awful reviews, so be sure to read through each course 's review section before enrolling
sosacrl
@sosacrl
Aug 16 2016 13:47
Thanks @alicejiang1 I'm a bit annoyed about the timeline restrictions. Like right now the SQL course is closed. What are your thoughts on actually paying for the certificate for this course/EdX courses in general? Any specific course recommendations at EdX or elsewhere?
Alice Jiang
@becausealice2
Aug 16 2016 14:49
I don't believe in buying knowledge and experience, personally... that's a political discussion I won't instigate here. If you want to pay for the verified certificates, go for it! They're meant to be proof of course completion, so if you want to include them in your portfolio, you can.
sosacrl
@sosacrl
Aug 16 2016 15:30
@alicejiang1 Do you recommend any specific resource for learning Data Science?
Alice Jiang
@becausealice2
Aug 16 2016 17:03
Experience
Eric Leung
@erictleung
Aug 16 2016 19:25

@sosacrl I echo @alicejiang1's suggestion, which is to just do data science. I'll admit it is vague but it at least gets you to doing things and then you can afterwards focus on interpretation. @evaristoc above has emphasized that data science is about "how to extract relevant information out of [data]."

So I would suggest Kaggle as a place to start because it gives you a data set and at least a question to answer. You can also check out the UCI Machine Learning Repository for datasets to mess around with. It is a little less guiding than the Kaggle competitions in terms of what you can ask about the data but some of the datasets, such as the iris dataset, will tell you what kind of "associated tasks" you can perform on the data (for the Iris dataset, it says you can do classification).

All in all,

  • choose a programming language (probably R or Python),
  • be curious about your data by asking questions,
  • explore potential patterns that may exist in the data, and
  • provide ways to present the data.
evaristoc
@evaristoc
Aug 16 2016 19:35
:+1: for @erictleung!
Alice Jiang
@becausealice2
Aug 16 2016 23:55
What is Data Science: A Business Perspective <- link to a current livestream of a data science meetup in Saint Louis
There is a "Perspectives from Data Scientists" counterpart coming up in October, fyi