These are chat archives for FreeCodeCamp/DataScience
discussion on how we can use statistical methods to measure and improve the efficacy of http://freeCodeCamp.com
Yes, you are correct - whatever technique you apply (bagging is one possibility), you should do it on your 2K labeled data.
Again, my suggestion would be to keep it simple at the beginning. For example, I have a similar issue as yours and so far I have practiced a simple embedded model without bagging. Why? Although I am interested in the accuracy of the model, high accuracy is not what I am looking for. Variance is one thing that might change. Why? Because the dataset might not be representative, so why bothering so much about accuracy now (IMO)?
Let me explain my last point a bit more. The more new examples you add to the training dataset, the more representative of the population it becomes and therefore the more stable the classification (less variance). But it might be that it is not the case now. So if 2k is not representative enough, the composition of the training dataset might change with new added examples, and therefore the distribution of the data will also change. Thus, it is likely that the model will change too.
Furthermore, whatever technique you apply and whatever accuracy you get, there is something that won't be that different between approaches - you still have to check manually that the classes were correctly assigned at every iteration. HOWEVER... it is expected that you will check less examples at each iteration because your model gets better in classifying the classes (as in my case).
In my case I have a multiclass problem with unbalanced classes, so it is a bit harder than yours I think. I go through precision and recall estimates for each class, to have a better understanding of how my model is performing at each case and then trying to find a better featuring. Yours is a binary problem (alert/not alert) and I guess it is balanced? I suspect it is easier. I think you can even start by trying a Naïve Bayesian as part of your selection.
Tell you more: in my case I found that the data distribution changed between iterations, so I decided to stay happy with a model with an overall F1 score of around 55%-60% as target and put more emphasis on understanding the data.
I found that the iteration process was a good opportunity for that.
It could be that in your case, those 2K are already representative. You might find out after 2-3 iterations that you have indeed enough data. If that is your case, then it is time to try a fancier model if you think applicable, include more new examples at each iteration and yes: put more emphasis on improving your accuracy estimates.
I have not seen your data but my first impression is that you might be too ambitious for what the 2000 labeled examples can do for you. Again, I haven't seen your data but it might be that training 2000 examples is a very sample. And at the end it depends on the level of accuracy that would satisfy possible client/tutor.
I would encourage you though to try your idea and see how it works? Good luck!