These are chat archives for FreeCodeCamp/DataScience

1st
Nov 2017
Eric Leung
@erictleung
Nov 01 2017 02:30
Splendid visual introduction and explanation of hierarchical modeling aka mixed effects modeling http://mfviz.com/hierarchical-models/
evaristoc
@evaristoc
Nov 01 2017 09:14
@test12221 I thought from your question you were more interested in building an app to collect the data! But @timjavins have given you already some help. I was still not able to help better. Thanks @timjavins !
CamperBot
@camperbot
Nov 01 2017 09:14
evaristoc sends brownie points to @test12221 and @timjavins :sparkles: :thumbsup: :sparkles:
:cookie: 135 | @timjavins |http://www.freecodecamp.com/timjavins
api offline
evaristoc
@evaristoc
Nov 01 2017 11:02

People:

Just to let you know we have been working on this:

https://github.com/freeCodeCamp/open-data

The repository hasn't been made public but this is in fact a repository that was not worked since its creation 2 years ago - I am just trying to activate.

As you will see, it is about a collection of datasets regarding fCC data, from small to big. That includes, @mcbarlowe and @dmesquita, the recently posted Facebook dataset. But also the "big dump", the surveys and some other projects.

Some of them are just for reference (eg. the leaderboard or the bots), so anyone can have a look of how to do that. They are included because they deal with fCC data.

The most important is that we are not publishing only datasets but scripts and code for scrapers, data collection, data handling, descriptives and so on.

This is an invitation to explore fCC data and leave your analyses available to other users. If you decide to do so and we think you have done a nice job, we will be interested in linking to your work or Github repository.

The fCC open-data repository is frequently found by organic searches, so it might give more visibility to your work.

There are more datasets and analyses, and we hope there will be more in the future.

Do you want to start a new project with new different data? Let us know and we will see what we can do.

So far I have been very active with the Gitter data, working with it for about 2 years already. I will make my work available during the next days hoping it could be useful for you.

What about your work? Hope we will be able to include yours too!

evaristoc
@evaristoc
Nov 01 2017 16:20

This specialization has been SO good! I already finished the first three courses, so 2 to go. But I will be repeating those again, probably a few times more to learn all the many concepts and techniques. I already understand a lot but I really need to revisit the content to consolidate the learning.
Alice Jiang
@becausealice2
Nov 01 2017 20:27
coursera spam D:
evaristoc
@evaristoc
Nov 01 2017 21:22
@becausealice2 yes it is... sorry guys!!!!!
I will delete those that can't be opened.
Joshua
@BananaCire
Nov 01 2017 21:30
Hi everyone. I'm taking a data structure and analytics course and ned help with a project
Matthew Barlowe
@mcbarlowe
Nov 01 2017 21:32
What help do you need ?
Joshua
@BananaCire
Nov 01 2017 21:32
alright so i'm making an infix expression evaluator
Alice Jiang
@becausealice2
Nov 01 2017 21:32
@evaristoc I'm just giving you a hard time ;)
Joshua
@BananaCire
Nov 01 2017 21:33
I'm using two stacks
but for my operand stack it doesn't seem to read more than one item

public static void main(String[] args) {

    String expression;
    int value1;
    int value2;
    int result;
    String optr;

    Scanner stdin = new Scanner(System.in);

    Stack<String> operatorStack = new Stack<String>();
    Stack<Integer> operandStack = new Stack<Integer>();

    System.out.print("Type an infix expression: ");
    expression = stdin.nextLine();
    StringTokenizer st = new StringTokenizer(expression,"()*/+-",true);  
    while (st.hasMoreTokens()) {
        String token = st.nextToken();
        if(token.equals("(")) {
            operatorStack.push(token);
        } else if(token.matches("[0-9]+")) {
            operandStack.push(Integer.parseInt(token));
        } else if(token.equals(")")) {
            while(operatorStack.peek() != "(") {
                value1 = operandStack.pop();
                value2 = operandStack.pop();
                optr = operatorStack.pop();
                result = calculate(value1, value2, optr);
                operandStack.push(result);
            }
        } else if(token.equals("*") || token.equals("/")
                || token.equals("+") || token.equals("-")) {
            while(!operatorStack.isEmpty() && precedence(token) <= precedence(operatorStack.peek())) {
                value1 = operandStack.pop();
                value2 = operandStack.pop();
                optr = operatorStack.pop();
                result = calculate(value1, value2, optr);
                operandStack.push(result);
            }
            operatorStack.push(token);
            while(!operatorStack.isEmpty()) {
                value1 = operandStack.pop();
                value2 = operandStack.pop();
                optr = (String) operatorStack.pop();
                result = calculate(value1, value2, optr);
                operandStack.push(result);
            }
            result = (int) operandStack.peek();
            System.out.println("Result = " + result);

        }

    }

}

private static int calculate(int v1, int v2, String opr) {
    int result = 0;

    if(opr.equals("*")) {
        result = v1 * v2;
    } else if(opr.equals("/")) {
        result = v1 / v2;
    } else if(opr.equals("+")) {
        result = v1 + v2;
    } else if(opr.equals("-")) {
        result = v1 - v2;
    }
    return result;

}

private static int precedence(String opr) {
    int temp = 0;
    if(opr.equals("*") || opr.equals("/")) {
        temp = 1;
    } else if(opr.equals("+") || opr.equals("-")) {
        temp = 2;
    } else if(opr.equals("(") || opr.equals(")")) {
        temp = 3;
    }
    return temp;
}

}

Alice Jiang
@becausealice2
Nov 01 2017 21:35
Some good news: I've been confirmed as a presenter at a conference next weekend. I'm still waiting to hear if my second topic was accepted, so I might not be making a technical presentation which I'm genuinely mad about but whatever. I'm speaking at a conference
Joshua
@BananaCire
Nov 01 2017 21:35
Congrats Alice!
@becausealice2
Alice Jiang
@becausealice2
Nov 01 2017 21:36
Thanks :D
evaristoc
@evaristoc
Nov 01 2017 21:36
@becausealice2 great! which topic?
Alice Jiang
@becausealice2
Nov 01 2017 21:36
and I'm not skilled enough to help with your issue, so good luck!
proactive community inclusion
I was hoping they'd favor my presentation on D3 since that's more career-relevant but whatever
evaristoc
@evaristoc
Nov 01 2017 21:46

@BananaCire I more or less see what your purpose is. However I cannot really correct it because it is in C, right?

Anyway: I think you should start by checking your (else) if statements. Can you print the output for each test and see if the machine is entering them accordingly?

If you think this help is not enough, maybe you want to post the same problem in the forum?

Joshua
@BananaCire
Nov 01 2017 21:48
java
thanks. I'll start with what you said
this is really my first time doing this asking for online help.
what is a good forum
?
Tim Blazina
@tblazina
Nov 01 2017 21:58
Hi all! I have a fairly open question regarding doing regression on discrete values? Say I am trying to predict the consumption of some product over time that is sold in units, so 0, 1, 2, 3, ect. My original inclination was to treat this as a sort of time series regression problem however by doing it this way ultimately I end up predicting continuous values as outputs, which then introduces decisions about rounding which ultimately don't make sense (predicting sales of 1.4 units is nonsensical in my case). However, i'm having trouble wrapping my head around how to frame this as a classification problem because for example say a training data point is 3, I intuitively think if a model predicts 4 it is much better outcome than if it predicts 15, but am fairly unfamiliar with classification problems in general. Anyone have any experience with this?
evaristoc
@evaristoc
Nov 01 2017 22:04

@BananaCire
I was proposing to try https://forum.freecodecamp.org/ to see if someone come up with an answer, but then there are several other places to ask.

I would actually suggest that if you can't find an answer to your question in the freecodecamp forum, I am sure you will find information about where you can find more information (ie. other sites).

Joshua
@BananaCire
Nov 01 2017 22:11
@evaristoc thank you
CamperBot
@camperbot
Nov 01 2017 22:11
bananacire sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 377 | @evaristoc |http://www.freecodecamp.com/evaristoc
Matthew Barlowe
@mcbarlowe
Nov 01 2017 22:13
If you haven’t tried stack overflow I would head there as well
Joshua
@BananaCire
Nov 01 2017 22:14
oh yeah thanks @mcbarlowe
CamperBot
@camperbot
Nov 01 2017 22:14
bananacire sends brownie points to @mcbarlowe :sparkles: :thumbsup: :sparkles:
:cookie: 132 | @mcbarlowe |http://www.freecodecamp.com/mcbarlowe
evaristoc
@evaristoc
Nov 01 2017 22:34

@tblazina

First out of curiousity, what are the explanatory variables that your model is considering?

I wouldn't call it a classification problem if your purpose is prediction. However I think you can try to use some techniques used in classification for this problem.

In the past I explored random forest for predictive purposes and in that opportunity I got not high but mid level of accuracy (over 50%). Be aware that using things like ML algorithms have some limitations.

Here a reference about using things like Decision Trees for regression: one of the tricks resides in changing the metric.

https://chrisalbon.com/machine-learning/decision_tree_regression.html

It might be needed from you to train the model, so you must know why and how.

However, I would try this first if I would be you:

http://www.theanalysisfactor.com/regression-models-for-count-data/

There are variants of that model that will allow for different kind of data types.

Yours is a common problem although approaches might be different depending of the data. There are other methods too, I would invite you to find more and share them here?

Tim Blazina
@tblazina
Nov 01 2017 22:40
@evaristoc thanks a lot!, i'll have a look at these and will post any other info I find here. For your information we're dealing with sales of products, so using expontentially weighted moving averages of previous sales, and some one-hot-encoded relevant categorical variables, even treating it as a continuous regression problem we can get ok accuracy but I think we can improve greatly, but wasn't sure how to frame these problems.
CamperBot
@camperbot
Nov 01 2017 22:40
tblazina sends brownie points to @evaristoc :sparkles: :thumbsup: :sparkles:
:cookie: 378 | @evaristoc |http://www.freecodecamp.com/evaristoc
evaristoc
@evaristoc
Nov 01 2017 22:48
@tblazina :) Good luck! Linear Reg is indeed a good approximation that could be enough for your business goals. Time Series? Use it if you think you can identify sensible patterns that you think would be better to control (considering your data, you surely have some...).
evaristoc
@evaristoc
Nov 01 2017 23:25

Correcting the coursera spams...

1) Using cost function as evaluation metrics: using F1 score as an evaluation metric to compare performance of different classifiers. Why can't we just the cost function (e.g. cross entropy) to compare two classifiers?
A good answer:
F1 score is often used for anomaly detection, because it helps measure more than just prediction accuracy. If a data set has 99% false examples, then just predicting false always will give 99% success, but it won't detect any anomalies at all.
Cost is closely related to prediction accuracy, so it has the same drawbacks if the data set is highly skewed.
But F1 score can only be used for classification systems. If you're doing a linear prediction, then the cost value J is about the only tool you have.

2) Explain the meaning of "cross entropy" in simple terms? I have seen the term "cross entropy" used in the context of describing the loss function.
A good answer: https://math.stackexchange.com/questions/1074276/how-is-logistic-loss-and-cross-entropy-related

3) L2 Regularization Cost and Dataset Size. In this week's videos, Andrew explains that L2 Cost is computed by taking the sum-square of the weights, then multiplied by Lambda/(2*m), where m is the number of training examples in the dataset. Could anyone give a intuition on why the regularization cost needs to be scaled down by m, especially since dimensions of the weight matrices are not dependent on m?
One of the Answers (not the best one, the best included some math and Latex...):
The intuitive reason why we want to "balance" the two cost components is because when we have less data, we can't be sure that the optimal weights we find in the sample available to us are representative of what would be the optimal weights in an idealized perfect model that had access to all available data that was ever going to matter to understand the phenomenon you were looking at. However, as you acquire more data, you are less and less worried about keeping this "balance" because the sample you are using to train becomes more an more representative of the "world" you need to model. In the hypothetical limit, as m goes to infinity, you'd need no "balance" because overfitting doesn't really come in anymore; at that point you'd be hypothetically training on all the information you could ever use to improve the representativeness of the parameters in modelling the underlying phenomenology you were acquiring data for. I am not sure which part in particular you want to formalize precisely, but you could start by looking up material on ridge regression, which is the stats name for L2 regularization, like Jason mentioned above (you may also want to look into LASSO regression while you are at it). Maybe you'll find some theorems that will answer your questions (do share if you find something interesting!).

4) How to deal with imbalance problem in NN?
A good answer:
I found a great article on how to deal with imbalanced datasets in general: https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/

5) Why normalize with variance but not with standard deviation?
Answer:
(I leave this one to you to find out ;) )