These are chat archives for FreeCodeCamp/DataScience

16th
Nov 2017
Rajath
@rajathrao
Nov 16 2017 07:19
@Rhistina Thanks for the reply.. Well my goal was to start with say doing some Exploratory analysis to find out relationships between variables. then create some visualiations etc. Since I was interested in that I started with SQL, Tableau, R and some Python.. Also my goal is to get a job as a Data Science / Data analyst .. but in the job market they ask a hoard of otehr things especially ETL tools like informatica, SSIS etc.. and data warehouse is a different ball game.So wanted to understand anyone who has got a job as a Data Analyst or Data Scientist what to focus on..
CamperBot
@camperbot
Nov 16 2017 07:19
rajathrao sends brownie points to @rhistina :sparkles: :thumbsup: :sparkles:
:cookie: 251 | @rhistina |http://www.freecodecamp.org/rhistina
Alice Jiang
@becausealice2
Nov 16 2017 08:05
Data science is a team sport. What we recommend you focus on depends on what role you want to fill. Don't worry too much about job postings. They're more guidelines than actual rules.
Rajath
@rajathrao
Nov 16 2017 08:07
@becausealice2 cool sounds great!
Himanshu Chawla
@hchawla437
Nov 16 2017 12:02
I have 70k records in my training dataset and 2k are only labelled in it, I want to try supervised learning on it and want to understand how I can use nagging to validate performance on the full distribution of the training side..Any help will be much appreciated
evaristoc
@evaristoc
Nov 16 2017 14:10
@hchawla437 I can try to help but I need more details. First, what for data you have?
Himanshu Chawla
@hchawla437
Nov 16 2017 17:39
@evaristoc I have email alerts data and out of 100k only 2k are labeled as positive and negative
evaristoc
@evaristoc
Nov 16 2017 17:52

@hchawla437

email alert data

How does the message look like? Can you show an example? Is just that (alerts)? It seems to be a relatively easy problem, although I need to know more.

evaristoc
@evaristoc
Nov 16 2017 18:02
I think I know how they might look like but can you show one positive and one negative?
Please fragment them so it is not a big message in the post - it is only to have a quick look at it...
Himanshu Chawla
@hchawla437
Nov 16 2017 18:58
@evaristoc data is not constraint here problem is how you will use bagging when you don’t have your train data fully labelled
evaristoc
@evaristoc
Nov 16 2017 19:29

@hchawla437
Don't understand why you are saying "data is not constraint". You are in fact asking to solve a problem when data is a constraint.

Ok. My purpose was to realize if it was indeed possible that your were dealing with a relatively simple problem. I would assume I know the sort of messages (positive and negative) you have.

One usual approach is to use semi-supervised learning. That means that you might have some additional work to do, I am afraid.

Within the techniques it is one that I have seen called "self-training" in some reference books, which would be like using an iterative process of labeling data and adding it to your existing labeled dataset in order to label more data. The thing is, you or a group of you should still check that the labeling was correct.

If you follow that approach, I would suggest to keep it simple at the beginning. If bagging, I would suggest to use no many models and those should be relatively simple. You should expect poor performance at the beginning. Why? You still have to train your classifier over the initial small dataset, so don't get fancy.

There are other techniques applicable for these situations and they might depend on the kind of data you have. In particular I have found that "self-training" technique allowed me a better focus on feature engineering during the process. Having a small dataset could help you to find those relevant features that might distinguish your model in further steps. However, keep your mind open and be ready for changes - having small data means that you might not have a representative set and therefore it might seriously underfit for a larger number of added examples.

Let me know if this helps.