These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
People from skale.me answered to a question about requirements:
Using skale-engine directly with JSON files is totally ok. You can already take advantage of parallelism from one node with multiple CPUs with skale (as opposed to a single nodeJS process), no need to go to a fully distributed system if it's not required by the volume of data or speed of computation.
Installing a parallel functionality could be up to us, just for the sake of also practising with the architecture. I am still seriously thinking about it but we can decide that as a group.
For those who decide to install skale locally, just remember that:
"skale run" command is to run an app in the cloud, still in development and not ready yet... In the mean time, use "skale test" to execute your app locally on your host.
I still think the most interesting project would be that one involving skale.me. You can also make use of pySpark through the Databricks platform: if you are registered you will have access to a cluster and I think 20Gb of space, and there are ways to load data into your account.
If some of us decide to go ahead with more exercises during/after trainings:
@sosacrl I echo @alicejiang1's suggestion, which is to just do data science. I'll admit it is vague but it at least gets you to doing things and then you can afterwards focus on interpretation. @evaristoc above has emphasized that data science is about "how to extract relevant information out of [data]."
So I would suggest Kaggle as a place to start because it gives you a data set and at least a question to answer. You can also check out the UCI Machine Learning Repository for datasets to mess around with. It is a little less guiding than the Kaggle competitions in terms of what you can ask about the data but some of the datasets, such as the iris dataset, will tell you what kind of "associated tasks" you can perform on the data (for the Iris dataset, it says you can do classification).
All in all,