These are chat archives for FreeCodeCamp/DataScience

16th
Dec 2015
Kevin Quoc Truong
@wwwfreedom
Dec 16 2015 00:35
@evaristoc Thanks mate. Interesting blog with a JS focus. I was thinking I might have to learn python because it seems that most resources for ML use python. Do you think that’s worthwhile to learn python?
CamperBot
@camperbot
Dec 16 2015 00:35
wwwfreedom sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:star: 193 | @evaristoc | http://www.freecodecamp.com/evaristoc
Vikesh Tiwari
@vicky002
Dec 16 2015 03:06
@evaristoc thanks. WIll give my best :)
CamperBot
@camperbot
Dec 16 2015 03:06
vicky002 sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:star: 194 | @evaristoc | http://www.freecodecamp.com/evaristoc
James Winegar
@jameswinegar
Dec 16 2015 05:56
@wwwfreedom You should definitely learn Python and potentially Scala in the very long term. Scala is the language underpinning Spark which is the distributed compute engine of choice for most large scale ETL and Machine Learning. They have a python API that calls the underlying Scala from Python since it acts as an excellent wrapper language. If you want to add additional data types to the Spark engine you will need to know Scala to expand the library. Then use Python for the typical data analysis work in it where there exist a large library for performing data analysis.
evaristoc
@evaristoc
Dec 16 2015 09:35

@wwwfreedom: Here a more detailed overview:
if you are interested in DataMining, ML, DataScience, etc, yes: you are better off by learning some python. I would start learning python first. But it is not the only option.

R is also an excellent option and you have a more statistical orientation. Matlab if you are interested in modelling / solving difficult derivative equations and it is also strongly used in finance/economics (where you will find a lot of applications of diff equations).

Note: Be aware that ML and statistics share a common background. Although ML is more algorithmic-oriented, there is a point at which those algorithms are found in both fields to solve the same problems. I personally think that ML is a branch of statistics, although ML was developed mostly by CS and engineers.

Continuing with the languages...

R (in C++) is an statistics packages mainly made by staticians. python (in C) libraries are build by CS and mathematicians. Matlab has a more engineering tradition.

I have used all, although Matlab (actually Octave, Matlab's free software clone) a bit less. I prefer python because I find it substantially easier and faster than R, and easier in manipulating data structures than Matlab (strongly vectorised arrays and matrices as the atomic data structure).

python phylosophy is interesting here, very much Japanese style: most of the solutions that python offers are actually "clones" of previously existing solutions. numpy/scipy is somehow Matlab; pandas is somehow R. So by learning python correctly, you will have an introduction to the other two (except for the heavy use of anonymous functions in Matlab and some in R).

In R there is not a specific library to solve DatMin, ML problems: it looks more like JavaScript in that sense: there is a repository of solutions called packages. The difference to JS community is that all the solutions are peer-reviewed and should provide things like manuals to use the library, etc before being available to public.

In python, packages trend to be complete: scikit-learn is the current better known library for DM and ML; pandas is currently the most popular library for data manipulation. Normally, in the python community once a library catch up, everyone leave previous competing efforts and concentrate on that library, unless the developer believes is after something relatively different. Instead of adding different packages as in R, in python every new solution is build as a module of a popular existing library. Even more: successful projects are the base of other projects: example: numpy/scipy is the base of scikit-learn, nltk or pandas, and currently scikit-learn/nltk/pandas are made to work together. This helps to avoid redundancies (common in JS and R). The python trade-off to avoid stagnation of the modules (eg a better solution) is by usually providing boiler-plate: if you want more, you can easily do it but you have to code. In R it is much more difficult to change code; even more difficult with Matlab/Octave. But in R you are able to find better algorithms for specific tasks by looking for a different package.

Because R is made for statistics, it has a much better reporting feature. R used to have more powerful graphics setups than python (python is really catching up). R had more problems handling large amount of data. This is partly because R also assign numerous attributes to all classes that builds to facilitate the reporting, which in turn affects the available memory, but also because its main atomic data structure, the data.frame: it is not a real vectorised array. Things have changed since R introduced tables data structure about 5 years ago: I think tables are true arrays in the form of numpy/matlab arrays/matrices. python has also better approach to multiprocessing when required.

@jameswinegar correctly mentioned Scala (related to java). I don't know Scala, but it has been the much popular option for heavy-weight tasks in the last 2-5 years for data manipulation and collection. Spark is a Data Mining / Machine Learning library and it is an spin-off of Hadoop-like solutions: it is a sub-language (java) that solves the IO problem posed by many map-reduce solutions by solving many tasks in memory instead. I have just glanced Spark, planning to study it later deeply... The Spark API for python is becoming very popular.

Hope this helps.

evaristoc
@evaristoc
Dec 16 2015 09:48
Correction: @wwwfreedom about the relation between Scala and Spark: I think @jameswinegar has a better idea than me: as I said, I don't know much about those two yet...
Kevin Quoc Truong
@wwwfreedom
Dec 16 2015 09:51
@evaristoc @jameswinegar Thanks mates, I think you guys have sway me into learning python on the side now beside my main goal of completing FCC. I’m gonna start learning python gradually from http://learnpythonthehardway.org/. If you guys know anything better let me know. Cheers :)
CamperBot
@camperbot
Dec 16 2015 09:51
wwwfreedom sends brownie points to @evaristoc and @jameswinegar :sparkles: :thumbsup: :sparkles:
:star: 195 | @evaristoc | http://www.freecodecamp.com/evaristoc
:star: 204 | @jameswinegar | http://www.freecodecamp.com/jameswinegar
evaristoc
@evaristoc
Dec 16 2015 10:22
@wwwfreedom good luck!
evaristoc
@evaristoc
Dec 16 2015 10:55
@wwwfreedom something more: although in python there is a relatively easy connectivity between libraries, transition is not straight forward: each library has a different benchmark (scikit = matlab; pandas = R) and therefore it will require from you to deal with the particularities of the basic data structures of each library (eg numpy arrays vs pandas data.frames). R is a bit less problematic, although ggplot (a powerful graphic interface) looks rather another language. Matlab/Octave libraries don't normally pose transition issues: they are very consistent.
James Winegar
@jameswinegar
Dec 16 2015 19:53
@evaristoc the Python API for Spark is just a wrapper that turns your Python code into Scala to run against the underlying engine.
evaristoc
@evaristoc
Dec 16 2015 20:23
@jameswinegar thanks for the clarification!
CamperBot
@camperbot
Dec 16 2015 20:23
evaristoc sends brownie points to @jameswinegar :sparkles: :thumbsup: :sparkles:
:star: 205 | @jameswinegar | http://www.freecodecamp.com/jameswinegar