These are chat archives for FreeCodeCamp/DataScience

17th
Nov 2017
Eric Leung
@erictleung
Nov 17 2017 02:37
Looks like they're dropping Python 2.7 support in numpy soon https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst
Matthew Barlowe
@mcbarlowe
Nov 17 2017 02:59
and pandas as well I believe
Himanshu Chawla
@hchawla437
Nov 17 2017 05:21
@evaristoc I really appreciate your help but I still didn’t get one point that how I will use bagging to validate my full training distribution..In my mind I am thinking that I will use bagging to average precision of initial label data(2k out of 70k) in training Any thoughts on it..
evaristoc
@evaristoc
Nov 17 2017 11:09

@hchawla437
Yes, you are correct - whatever technique you apply (bagging is one possibility), you should do it on your 2K labeled data.

Again, my suggestion would be to keep it simple at the beginning. For example, I have a similar issue as yours and so far I have practiced a simple embedded model without bagging. Why? Although I am interested in the accuracy of the model, high accuracy is not what I am looking for. Variance is one thing that might change. Why? Because the dataset might not be representative, so why bothering so much about accuracy now (IMO)?

Let me explain my last point a bit more. The more new examples you add to the training dataset, the more representative of the population it becomes and therefore the more stable the classification (less variance). But it might be that it is not the case now. So if 2k is not representative enough, the composition of the training dataset might change with new added examples, and therefore the distribution of the data will also change. Thus, it is likely that the model will change too.

Furthermore, whatever technique you apply and whatever accuracy you get, there is something that won't be that different between approaches - you still have to check manually that the classes were correctly assigned at every iteration. HOWEVER... it is expected that you will check less examples at each iteration because your model gets better in classifying the classes (as in my case).

In my case I have a multiclass problem with unbalanced classes, so it is a bit harder than yours I think. I go through precision and recall estimates for each class, to have a better understanding of how my model is performing at each case and then trying to find a better featuring. Yours is a binary problem (alert/not alert) and I guess it is balanced? I suspect it is easier. I think you can even start by trying a Naïve Bayesian as part of your selection.

evaristoc
@evaristoc
Nov 17 2017 11:14
@hchawla437
Keep in mind that you will have to revise your model as soon as you get more labeled examples. The more labeled examples, the better. Then you can start experimenting with more advanced models.
evaristoc
@evaristoc
Nov 17 2017 11:28

@hchawla437
Tell you more: in my case I found that the data distribution changed between iterations, so I decided to stay happy with a model with an overall F1 score of around 55%-60% as target and put more emphasis on understanding the data.

I found that the iteration process was a good opportunity for that.

It could be that in your case, those 2K are already representative. You might find out after 2-3 iterations that you have indeed enough data. If that is your case, then it is time to try a fancier model if you think applicable, include more new examples at each iteration and yes: put more emphasis on improving your accuracy estimates.

Matthew Barlowe
@mcbarlowe
Nov 17 2017 13:23
&#x00E8
Himanshu Chawla
@hchawla437
Nov 17 2017 13:28
@evaristoc In my case I too have unbalanced classes and I have to classify alerts as positive alerts and negative alerts rather than alerts or not.. you can think of it as high priority alerts or not And also to label my unlabelled data in training I am think to use bagging( Bag on 2k labelled from training ) and then give scores to 68k using samples from bagging and then take average score from all the bagging samples to label them what do u think is it a good approach or did u try something else in your case
evaristoc
@evaristoc
Nov 17 2017 20:28

@hchawla437

I have not seen your data but my first impression is that you might be too ambitious for what the 2000 labeled examples can do for you. Again, I haven't seen your data but it might be that training 2000 examples is a very sample. And at the end it depends on the level of accuracy that would satisfy possible client/tutor.

I would encourage you though to try your idea and see how it works? Good luck!

Alice Jiang
@becausealice2
Nov 17 2017 22:07
Anyone here got tips on how to make a sad LinkedIn not look quite so sad?
Eric Leung
@erictleung
Nov 17 2017 22:20
@becausealice2 here's some advice on writing a LinkedIn summary http://www.newstoliveby.net/2014/07/17/how-to-write-linkedin-summary/ Hope that helps a bit.
Alice Jiang
@becausealice2
Nov 17 2017 22:21
It's a good start! Thanks :)
Eric Leung
@erictleung
Nov 17 2017 22:21
@becausealice2 try to get involved with projects and post interesting articles. There's also this bit of advice of how to increase your visibility in jobs without being inaccurate. (Hint: it has to do with changing your current job description.)
Alice Jiang
@becausealice2
Nov 17 2017 22:22
Projects I got a-plenty. Just need to actually add them to my profile :/
Alice Jiang
@becausealice2
Nov 17 2017 22:23
It's all the words and being attractive to as many potential employers without actually actively participating that I struggle with
Alice Jiang
@becausealice2
Nov 17 2017 22:31
so, before I go in and actually change anything, I hadn't logged in to linked in for almost a year, and my summary currently is: "As a data scientist, I can confirm that size does matter."
I don't understand why no one thinks I'm a competent professional :joy:
Eric Leung
@erictleung
Nov 17 2017 22:37
@becausealice2 it's always nice to have coworkers with a sense of humor :laughing:
Matthew Barlowe
@mcbarlowe
Nov 17 2017 23:07
While we’re talking about linked in if you endorsements I’ll do a little quid pro quo