These are chat archives for FreeCodeCamp/DataScience

15th
Feb 2017
Aditya Kurniawan
@akurniawan
Feb 15 2017 04:06 UTC
Hi guys, I’m really new in this field. Right now I’m trying to grasp more deeply about what could be the effect of not normally distributed data in some features to train a model. Is there any articles that could enlighten me about this problem?
Amelia
@apottr
Feb 15 2017 04:08 UTC
@akurniawan try searching on https://scholar.google.com or http://arxiv.org
Victor Ngo
@Victorkjngo
Feb 15 2017 04:45 UTC
Is it hard to go from Junior web dev to Data science without a math background?
Amelia
@apottr
Feb 15 2017 04:47 UTC
yes
Eric Leung
@erictleung
Feb 15 2017 06:03 UTC
@akurniawan two ways to go about it. You can simulate non-normal data and fit a model based on it and see for yourself what kind of effects it has on your model. Or you can do a Google search on it. You can get away without having a math background, but sooner or later, you'll near to learn some probability, mathematical statistics, and linear algebra in order to better understand the modeling techniques you're using.
Victor
@Evaderei
Feb 15 2017 08:30 UTC
anyone here learn data sci from scratch that "made it"
?*
it'd definitely get me super moto
Alice Jiang
@becausealice2
Feb 15 2017 09:27 UTC
@Evaderei long time no see, friend! What would you consider having "made it" to be?
Victor
@Evaderei
Feb 15 2017 09:28 UTC
who're you?
made it == got data sci related job
Alice Jiang
@becausealice2
Feb 15 2017 09:29 UTC
Alice ;P it's been a while and I've had a within handle change. Think back to FCCs main channel with rphares and all them
Victor
@Evaderei
Feb 15 2017 09:30 UTC
within?
Alice Jiang
@becausealice2
Feb 15 2017 09:30 UTC
Not related to this, but for anyone interested: Spanner's now open to the public: https://www.wired.com/2017/02/spanner-google-database-harnessed-time-now-open-everyone/
Ah. Autocorrect still winning the war. Within should be with a
Oh I lied it should be github
330AM :sleepy:
evaristoc
@evaristoc
Feb 15 2017 13:37 UTC

@cuent always hard for being a bit subjective. For clustering, when talking about "users" and other things that are hard to measure keep a single thing in mind:

How much is this result applicable to my business case?

(actually the rule might apply to almost every case, really...)

The most simple way to decide the number of clusters for the k-mean is the elbow method). As in the reference, you can also add some statistics by applying an F-test. It is a bit subjective and some level of arbitrariness might be required.

To clean the data and select better candidates:

  • Check for correlations and keep one of many correlated variables. There are a few heuristics that you could follow.
  • Check for the importance of the variable that makes business sense.
  • Identify the flaws of the analysis based on the data available and report accordingly.

A clasical way to visualize relations in this case is multidimensional scaling. Another one is by using correlation matrices.

PCA and other of those techniques are accessible through R and python. R packages will let you check the clusters too.

Let us know if this helps?

evaristoc
@evaristoc
Feb 15 2017 13:48 UTC

@cuent

I think you have already noticed that encoding and particularly PCA, as @erictleung is suggesting, is ok as long as you don't have a problem by missing information about the variable naming?

Otherwise, if you need to report based on the name of the existing variables PCA might not be the best option, even if better in reducing the noise. Also check for the PCA assumptions and see if they apply to your data. Although robust, it is still a sensitive method. In any case, SVD is usually more general and powerful IMO.

Amelia
@apottr
Feb 15 2017 15:29 UTC
thanks for the article @becausealice2
CamperBot
@camperbot
Feb 15 2017 15:29 UTC
apottr sends brownie points to @becausealice2 :sparkles: :thumbsup: :sparkles:
:warning: @becausealice2's account is not linked with freeCodeCamp. Please visit the settings and link your GitHub account.
Eric Leung
@erictleung
Feb 15 2017 16:13 UTC
@apottr @becausealice2 another article about Spanner straight from Google and link to their white paper if you're interested https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Spanner-and-the-CAP-Theorem.html
Amelia
@apottr
Feb 15 2017 16:18 UTC
thanks @erictleung
CamperBot
@camperbot
Feb 15 2017 16:18 UTC
apottr sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles:
:cookie: 456 | @erictleung |http://www.freecodecamp.com/erictleung
Shindeor
@shindeor
Feb 15 2017 17:18 UTC
Hello, could anyone tell me how I can get an API key for the FreeCodeCamp open-api? Thanks!
Hèlen Grives
@mesmoiron
Feb 15 2017 19:28 UTC
So far the data cleaning slowed down. Sometimes I need 3 programs to process one file. Even Regex has its quirks. Replacing one or more digits comma one or more digits by the equivalent with dot; the program replaces the digits with the Regex codes. I am grateful for all the open source contributions; but we are far from well designed working software. Outsourcing to low paid countries becomes attractive. PDF's are also not created equal; so some output can not be analysed or converted. Switching, choking and hacking my way slowly through the data pile.
Eric Leung
@erictleung
Feb 15 2017 21:09 UTC
@Psychosete here's a start on improving your math skills, probability distributions
Amelia
@apottr
Feb 15 2017 22:56 UTC
data cleaning with PDFs is my own personal hell
i have been doing this for several days with very little progress