[ { value: 'He', tag: 'word', normal: 'he', pos: 'PRP' },
// { value: 'is', tag: 'word', normal: 'is', pos: 'VBZ', lemma: 'be' },
// { value: 'trying', tag: 'word', normal: 'trying', pos: 'VBG', lemma: 'try' },
// { value: 'to', tag: 'word', normal: 'to', pos: 'TO' },
// { value: 'fish', tag: 'word', normal: 'fish', pos: 'VB', lemma: 'fish' }
[
{ value: 'He', tag: 'word', normal: 'he', pos: 'PRP' },
{ value: 'is', tag: 'word', normal: 'is', pos: 'VBZ', lemma: 'be' },
{ value: 'trying', tag: 'word', normal: 'trying', pos: 'VBG', lemma: 'try' },
{ value: 'to', tag: 'word', normal: 'to', pos: 'TO' },
{ value: 'fish', tag: 'word', normal: 'fish', pos: 'VB', lemma: 'fish' }
]
.map
this array to any format that you may require.
text
and entityType
are required
uid
and value
that'll be returned as is if that entity is found
Tokenization
is a pre-requisite to Named Entity Recognition(NER)
. So a text is first tokenised before identifying the entities in the given text. The value
represents the contents of a token and uid
represents the unique id given to various text patterns representing a single entity. While preparing content for learn()
api, multiple patterns of an entity can be defined with a unique identity(as in uid as uk), which helps the recognize()
api detect these patterns as one entity only. You can try this example by learning u.k./UK/United Kingdom with uid
as uk and test the outcomes with various combinations these patterns. Hope this is useful. Cheers!
Thanks for this explanation, now it is clear for me what the uid is used for. :)
I was already able to set up my own NE recognizer, but it is not quite doing what I had expected.
I want to build a Resume-parser and for this Im trying to get all needed information with NER. I used a training data set from the interent which I manipulated to get the form { text: "sample", entityType: "sample" }. After I applied .learn() and .recognize() nearly no entity was found correctly, everything was wheter word, punctuation or alien. I wanted to look for names, skills, expierence, etc.
I had a look at the data set and my idea is that the recognizer is kind of overfitted. (The data set consists mostly of indian resumes and the 'text' values are quite long sometimes, for example "C# (1 year), C++, JS".
My question now is, is there a way to really learn the recognizer what I am looking for or is it just checking if the desired strings are found anywhere in the tokens?
Sorry for the spam, but I wanted to make things clear :)
Thanks in advance for any help, I really appreciate this chat!
wink-ner
is a gazetteer based (i.e. look-up driven) NER which can spot patterns smartly. Therefore it will be able to spot skills, cities etc. easily but names, experience could be tricky, especially if you are looking for generalization. We will try and share some ideas on how you can achieve some of it in next couple of days. If you have annotated data then you could consider using wink-perceptron to achieve the objective. A similar case study is there on our blog – NLP in Agriculture.
Here is a simple example that may help you:
var NER = require( 'wink-ner' );
var Tokenizer = require( 'wink-tokenizer' );
tokenize = Tokenizer().tokenize;
ner = NER();
var trainingData = [
{ text: 'c + +', entityType: 'skill', uid: 'C++' },
{ text: 'c #', entityType: 'skill', uid: 'C#' },
{ text: 'php', entityType: 'skill', uid: 'PHP' },
{ text: 'my sql', entityType: 'skill', uid: 'MySQL' },
{ text: 'mysql', entityType: 'skill', uid: 'MySQL' },
{ text: 'python', entityType: 'skill', uid: 'Python' },
{ text: 'javascript', entityType: 'skill', uid: 'Javascript' },
{ text: 'java script', entityType: 'skill', uid: 'Javascript' },
{ text: 'nodejs', entityType: 'skill', uid: 'Node.js' },
{ text: 'node js', entityType: 'skill', uid: 'Node.js' },
{ text: 'web design', entityType: 'skill', uid: 'Web Design' },
];
ner.learn( trainingData );
var r = 'I have worked in C++, node js, MY SQL, extensively and have limited web design experience! My email is r2d2@gmail.com.'
tokens = tokenize( r );
tokens = ner.recognize( tokens );
tokens.forEach( ( t ) => {
if ( t.uid ) console.log( `Skill: ${t.uid}` );
if ( t.tag === 'email' ) console.log( `E-Mail: ${t.value}` );
} );
Produces following output:
Skill: C++
Skill: Node.js
Skill: MySQL
Skill: Web Design
E-Mail: r2d2@gmail.com
Please download the latest version of wink-ner
and use.
Really nice work you have done here!
Do you have a recommendation how to extract the entity (if one is associated) for each word in a sentence? So far I connect the Information from different sources of 'doc' and it feels really hacky. Also - When extracting the entities - is there a way to find out what type it is?