These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
Hmmm interesting. In the middle of watching this video (set at the right place for my comment) and the person being interview defined a black box model as
"models people pre-create like linear regression, random forests, or neural networks".
Things might be better these days for interpreting neural networks (and thus making them less black box-ish), but at least for linear regression and random forests, it is fairly clear on what they are doing and you can make inferences on what the data is telling you and why.
The person does mention that you can't just put data into models (which I agree with) and he might be confusing that concept with a true black box, which will just give you an answer without much interpretation or understanding on how the answer was obtained.
Side comment, this misunderstanding may be a result of easy access to tools (e.g. scikit-learn) where you just put in some function and it'll give you an answer unless you dig around a little bit to see if that result will give you some more information on how it comes up with the result.
Hmmm... interesting... I think that author might not know exactly why he referred to, but I would say he is still right even for how linear regression is used these days.
On how I see it, the "black box" the author seems to refer to is not the linear regression, but the method you select to solve it. Nowadays usually those methods are based on heuristics.
For example, @erictleung, in my times studying statistics I was taught heuristics to solve the linear regression, known as the stepwise methods.
However they were heavily criticised. Because they are heuristics, they usually select variables that don't suggest any phenomenological explanations.
There are several reasons for that lack of "explanation power".
One is by applying incorrect models (eg. using a linear model like linear regression to explain a more complex model). The other is the data (data is not necessarily the best representation of the "phenomenon").
The fact is that we were usually asked to forget about applying heuristic models and to apply "common sense" in the selection of the variables that better adjust to the phenomenon (aka the model).
It is my impression that using common sense is not a usual practice these days though. Countless people just apply heuristic methods without a critical evaluation of their results. They are not taught to be critical to their results.
This is probably why many descriptions of a good data scientist also stress the need the data scientist being expert in the field (= "common sense"). As @erictleung noticed, being master in applying the methods of sk-learn might not be enough to become a good Data Scientist.