These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
@cuent always hard for being a bit subjective. For clustering, when talking about "users" and other things that are hard to measure keep a single thing in mind:
How much is this result applicable to my business case?
(actually the rule might apply to almost every case, really...)
The most simple way to decide the number of clusters for the k-mean is the elbow method). As in the reference, you can also add some statistics by applying an F-test. It is a bit subjective and some level of arbitrariness might be required.
To clean the data and select better candidates:
A clasical way to visualize relations in this case is multidimensional scaling. Another one is by using correlation matrices.
PCA and other of those techniques are accessible through R and python. R packages will let you check the clusters too.
Let us know if this helps?
I think you have already noticed that encoding and particularly PCA, as @erictleung is suggesting, is ok as long as you don't have a problem by missing information about the variable naming?
Otherwise, if you need to report based on the name of the existing variables PCA might not be the best option, even if better in reducing the noise. Also check for the PCA assumptions and see if they apply to your data. Although robust, it is still a sensitive method. In any case, SVD is usually more general and powerful IMO.
apottr sends brownie points to @becausealice2 :sparkles: :thumbsup: :sparkles:
apottr sends brownie points to @erictleung :sparkles: :thumbsup: :sparkles: