These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
wwwfreedom sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
vicky002 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
@wwwfreedom: Here a more detailed overview:
if you are interested in DataMining, ML, DataScience, etc, yes: you are better off by learning some python. I would start learning python first. But it is not the only option.
R is also an excellent option and you have a more statistical orientation. Matlab if you are interested in modelling / solving difficult derivative equations and it is also strongly used in finance/economics (where you will find a lot of applications of diff equations).
Note: Be aware that ML and statistics share a common background. Although ML is more algorithmic-oriented, there is a point at which those algorithms are found in both fields to solve the same problems. I personally think that ML is a branch of statistics, although ML was developed mostly by CS and engineers.
Continuing with the languages...
R (in C++) is an statistics packages mainly made by staticians. python (in C) libraries are build by CS and mathematicians. Matlab has a more engineering tradition.
I have used all, although Matlab (actually Octave, Matlab's free software clone) a bit less. I prefer python because I find it substantially easier and faster than R, and easier in manipulating data structures than Matlab (strongly vectorised arrays and matrices as the atomic data structure).
python phylosophy is interesting here, very much Japanese style: most of the solutions that python offers are actually "clones" of previously existing solutions. numpy/scipy is somehow Matlab; pandas is somehow R. So by learning python correctly, you will have an introduction to the other two (except for the heavy use of anonymous functions in Matlab and some in R).
In python, packages trend to be complete: scikit-learn is the current better known library for DM and ML; pandas is currently the most popular library for data manipulation. Normally, in the python community once a library catch up, everyone leave previous competing efforts and concentrate on that library, unless the developer believes is after something relatively different. Instead of adding different packages as in R, in python every new solution is build as a module of a popular existing library. Even more: successful projects are the base of other projects: example: numpy/scipy is the base of scikit-learn, nltk or pandas, and currently scikit-learn/nltk/pandas are made to work together. This helps to avoid redundancies (common in JS and R). The python trade-off to avoid stagnation of the modules (eg a better solution) is by usually providing boiler-plate: if you want more, you can easily do it but you have to code. In R it is much more difficult to change code; even more difficult with Matlab/Octave. But in R you are able to find better algorithms for specific tasks by looking for a different package.
Because R is made for statistics, it has a much better reporting feature. R used to have more powerful graphics setups than python (python is really catching up). R had more problems handling large amount of data. This is partly because R also assign numerous attributes to all classes that builds to facilitate the reporting, which in turn affects the available memory, but also because its main atomic data structure, the data.frame: it is not a real vectorised array. Things have changed since R introduced tables data structure about 5 years ago: I think tables are true arrays in the form of numpy/matlab arrays/matrices. python has also better approach to multiprocessing when required.
@jameswinegar correctly mentioned Scala (related to java). I don't know Scala, but it has been the much popular option for heavy-weight tasks in the last 2-5 years for data manipulation and collection. Spark is a Data Mining / Machine Learning library and it is an spin-off of Hadoop-like solutions: it is a sub-language (java) that solves the IO problem posed by many map-reduce solutions by solving many tasks in memory instead. I have just glanced Spark, planning to study it later deeply... The Spark API for python is becoming very popular.
Hope this helps.
wwwfreedom sends brownie points to @evaristoc and @jameswinegar :sparkles: :thumbsup: :sparkles:
evaristoc sends brownie points to @jameswinegar :sparkles: :thumbsup: :sparkles: