ines on master
Tidy up and auto-format (compare)
set_extensionare stored as class properties in the
Underscoreclass to either
Doc. The only way I can think of to allow different behaviour is reloading the modules to "reset" them, before adding the next extensions. So no, as far as I know there is currently no way to bin the extensions to a pipeline instead of the classes
documenttext.txtis file with some wiki data
@johann-petrak This section explains how the tokenizer works. You can add special cases if you have some specific edge cases which you want to handle or write a custom tokenizer with your own ruleset. But most likely you want to just add to the existing ruleset to add some regex for your paratheses situation.
About your second question: From what I see by looking at the code is, that the EntityRuler internaly uses a (Phrase)Matcher and its methods to it might be possible to use all pattern rules. If so this definitly needs be added to the docs
@johann-petrak Good to hear that they helped :+1: As @ines stated in some issue (can't find it) that the docstrings should not be too big but include the main information and mostly a link to the API docs. On those docs you should find the most useful information or further links to other parts of the docs (e.g. usage). I just created #4064 to update the Tokenizer docs.
I didn't work with the EntityRuler by now so I can't tell about the real usage or the difference to the Matcher. But I think you are right and I can't tell why there might be/is such a difference. Maybe you want to open an issue about it to get some... And you already opened #4063 ^^
valve, and fix spelling errors. Is this something I can use spaCy for, or should I be using something else to pre-process the text before spaCy?
title_matcher = PhraseMatcher(nlp.vocab); prepared_titles = list(nlp.pipe([title.lower() for title in JOB_TITLES])); title_matcher.add("JOB_TITLE", None, *prepared_titles)
PhraseMatcher(nlp_model.vocab, attr='LEMMA')but it doesn't work and even worse, it matches every single word.
attr='LEMMA'option does not work for me either. I get the same results as you if I use
nlp =English()but I get no matches if I use
nlp = spacy.load('en_core_web_sm'). Maybe it is a bug or maybe the LEMMA attribute is not actually supported by the PhraseMatcher? The best workaround I can think of is adding the plurals as well to the patterns...
Hi all. I'm looking at the tutorials for training a TextCategorizer model. I came across these lines:
with textcat.model.use_params(optimizer.averages): # evaluate on the dev data split off in load_data() scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
Hoping someone can help me understand why
textcat.model.use_params(optimizer.averages) is used here and what it's doing.