These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
import pandas as pd import numpy as np import statsmodels.api as sm data= np.loadtxt('bostontrain.csv',delimiter=',') X=data[:,0:13] Y=data[:,13] X = sm.add_constant(X) est = sm.OLS(Y, X).fit() test_data=np.loadtxt('bostontest.csv',delimiter=',') test_predict=est.predict(test_data) np.savetxt('anuraglahon.csv',test_predict,fmt='%1.5f')
Trying to improve the model I was working for analysing text.
The previous method, counting words, worked fine for text of about 300 words or less, but it was not that easy when working with more words.
For your info, the method is not mine: it is based on concepts by Luhn.
Now I am combining the Luhn method with a entity-centric one.
I am also displaying in browser and terminal.
Still a work in progress and messy as always. But hope better than just reading through the full text.
sklearn.model_selection.train_test_split. All training and test sets are just sampling and splitting. It's been a while since I've done Python, but just choose how you wanna split all your data and select the rows based on the indices you've chosen.
erictleung sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
@erictleung thanks man!
@GoldbergData probably adding something like this: https://stackoverflow.com/questions/28820551/interactive-selection-highlighting-of-text-inside-the-browser
I would introduce a word selection option (probably a tick option) to mark a fixed number of characters (before and after the selected keyword present in text) while making the rest of the text less visible (opacity).
evaristoc sends brownie points to @erictleung and @goldbergdata :sparkles: :thumbsup: :sparkles:
If I get fancy, I could include a choice to fill in a Google spreadsheet with sentences I am interested in.
Anyway... My main goal is to progress on this so I won't work on any procedure if it takes time. Not really the core of the project right now.
def error(x, y, initial_thetta): error_sum = 0 m, n = x.shape for i in range(1, m): error_sum = (initial_thetta.T[i] * x[i] - y[i] ** 2) return (error_sum) / (2 * m)
def error(x, y, initial_thetta): error_sum = 0 m, n = x.shape for i in range(1, m): error_sum += (initial_thetta.T[i] * x[i] - y[i]) ** 2 return (error_sum) / (2 * m)
Initial erorr is [1.73828125e-01 1.15221354e+01 7.75501562e+03 2.54704036e+03 3.29631510e+02 9.78108268e+03 5.30021348e+02 1.48438496e-01 6.07216146e+02]
@Sprinting interesting... I haven't competed that much to be honest. I would probably check the data though. It is a topic I used to analyse a lot before.
numpy? I think there are ways to skip the loop.
model value[i] - average value? (eg. https://hlab.stanford.edu/brian/error_sum_of_squares.html, https://en.wikipedia.org/wiki/Partition_of_sums_of_squares)
You can also check the following:
y) is to its estimated point (
f(x)) as a way to analyse the model fitness).
It is expected that the more data points you add to the sum the larger the error. However, the shape of your dataset and your procedure are not clear. I guess
m are your "examples"?
Hope this helps.