These are chat archives for FreeCodeCamp/DataScience

31st
Mar 2018
evaristoc
@evaristoc
Mar 31 2018 09:15
@becausealice2 0o0 Niceeee!!!
@SauravDeb Relation Extraction is a topic in Text Mining, so I assume that is what you are talking about. I don't know exactly the problem you are facing but I suspect Naive Bayes might not help: you are not after a classification. You might be after pattern discovery. For a quick check I would suggest to try unsupervised methods for pattern discovery first? This is a simplest option: to get more out of it, you might need a bit of NLP knowledge. For a more advanced implementation you might need to go NN, word2vec being a possible candidate (again, not sure about your problem), or even deep learning, but you need a lot of (annotated) data for that to work correctly.
Saurav Deb
@SauravDeb
Mar 31 2018 10:33
@erictleung Thanks for the reply. I'm very confused about what I need to do to approach my problem, so its okay actually if you my question confused you. @evaristoc Thanks to you too, and yes you assumption was spot on. I would certainly go on and try unsupervised methods, but for the time being can you give me any advice regarding how to proceed with supervised
*learning methods. Also if you give me a heads start about how to proceed with unsupervised methods in R I'd be really grateful.
evaristoc
@evaristoc
Mar 31 2018 11:50

@SauravDeb

I have done few projects regarding Relation Extraction myself, nothing fancy but enough for what it was meant to. A simple Apriori algorithm was ok for what I did. If you are in a more advanced assignment you are asked to work with graphs and ontologies, probably. So you have already a classified set of rules to follow so you can assign terms to objects in the text.

Sorry R is not my main tool but there are people here who might help? If you are insisting on using supervised methods is because you have some data? If you describe your approach and what your train/test dataset is about there are people here who could help, me included. Please show some link to your code and datasets? (try to keep this chat thread as clean as possible, please).

@erictleung already mentioned the applicability of some methods. You might have to compare several of them. NB trends to be hardly effective with unbalanced classes, which is usually the case for any text classification where some entities are usually under-represented.

I searched quickly and found an article that might interest you:
https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/

(I REALLY REALLY LOVE doing this, but many people come here with questions that are already answered on Internet so please do your research?)

It seems LR is ok? I like random forest too for your case. SVM is excellent but won't really work if a good normalization of the data can't be achieved, which is usually the case with text data. Normalizing text could be really tough and data intensive.

Having you good datasets I would think of AdaBoost but IMO it will add value if you have 1 or 2 strong classifiers and few other weak ones. I invite you to find a boosting procedure that works better with some sort of fuzzy rules IMO.

@erictleung: do you have any experience with relationship extraction techniques applied to your field?

I hope this can help.

@SauravDeb I have the gut feeling that word2vec could also work for this case but I haven't tried.
evaristoc
@evaristoc
Mar 31 2018 11:58

@SauravDeb Here some very quick examples of what I was suspecting regarding deep learning (convolutional but I would say recurrent could be more effective but harder to implement) I haven't really read them:
https://github.com/may-/cnn-re-tf

Here something using word2vec that you could also read:

Saurav Deb
@SauravDeb
Mar 31 2018 13:48
@evaristoc , you have been a wonderful person :) I'll check on those links if I find something of use in the context of my problem. And I couldn't find any annotated data ready for training from the internet that could fit my model. However if you could tell me how to proceed with choosing an unsupervised one, it'd be great.
Saurav Deb
@SauravDeb
Mar 31 2018 13:59
And thanks a lot for telling me that this isn't a classification problem because it seems I've been misled by other sources.
Saurav Deb
@SauravDeb
Mar 31 2018 14:07
@evaristoc I've been trying to find out about how must I approach the pattern discovery thing you so correctly mentioned but haven't found any success :worried: .
Saurav Deb
@SauravDeb
Mar 31 2018 15:05
@evaristoc The link you provided (https://www.microsoft.com/developerblog/2016/09/13/training-a-classifier-for-relation-extraction-from-medical-literature/) was amazing :+1: and made a lot of things clear for me. However, I couldn't follow the part where the tasks performed under Feature Representations. Could you kindly help me navigate to someplace where there's any simple explanation to the concepts used in the above.
Kartik Mudgal
@Sprinting
Mar 31 2018 16:34
@becausealice2 is this a toy project ?
I hope to god this is a toy project
why would someone use js for ml
Josh Goldberg
@GoldbergData
Mar 31 2018 17:04
Haha @Sprinting
Alice Jiang
@becausealice2
Mar 31 2018 17:38
@Sprinting I'm not using it at all. It's blowing up my Twitter feed.
Kartik Mudgal
@Sprinting
Mar 31 2018 17:49
haha, I was not assuming you did
evaristoc
@evaristoc
Mar 31 2018 22:01

@SauravDeb

  1. I must admit I confused the term Relation Extraction with a different technique. My fault. You can indeed treat the Relation Extraction very much like a classification problem as I found out while looking at your problem. Apologies.
  2. The unsupervised techniques I found were mostly based on clustering. This is an example: https://aclanthology.info/pdf/P/P09/P09-1115.pdf
  3. Bag of Words is commonly a counter. n-Grams is the number of words that are joined that you usually count as one entity. A 1-gram is a word, 2-gram is 2 near words, and so on. The bow model they describe looks like a simple encoding.
  4. POS is a NLP method that assign a grammatical definition to each word in the sentence. For POS you need usually to apply ML. It is so common that some libraries do that automatically nowadays.
  5. Word embeddings are vectorizations of the relations of the words and here is where word2vec or similar should be used. word2vec is a unsupervised NN implementation that finds similarity patterns of the words in the corpus and lay those relations on a vector space. It is standard technique. They implement Doc2Vec, which is the same but at document level.

I hope this helps. Good luck!!