Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 13:57
    santoshboina edited #4296
  • 13:45
    santoshboina opened #4296
  • 13:28
    HenrykBorzymowski opened #4295
  • 13:18

    ines on master

    Ensure that doc.ents preserves … (compare)

  • 13:18
    ines closed #4294
  • 13:17

    ines on develop

    Make "unnamed vectors" warning … Pass vectors name correctly in … Allow passing vectors_name to V… and 1 more (compare)

  • 12:59
    Hazoom commented #4292
  • 12:50
    Hazoom synchronize #4292
  • 12:49
    Hazoom commented #4292
  • 12:49
    Hazoom commented #4292
  • 12:39
    Hazoom commented #4292
  • 12:29
    honnibal commented #4292
  • 12:04
    svlandeg synchronize #4294
  • 11:43
    svlandeg labeled #4294
  • 11:42
    svlandeg labeled #4294
  • 11:42
    svlandeg labeled #4294
  • 11:42
    svlandeg opened #4294
  • 11:39
    svlandeg commented #4098
  • 08:41
    jackmen commented #3637
  • 07:54
    Hazoom synchronize #4292
Johann Petrak
@johann-petrak
Does anyone know how the Tokenizer in Spacy works and how the rules for splitting can be changed? The German tokeniser seems to split text like "dies(und)das" into 5 tokens while the English tokenizer makes one token out of this? How can I change the behaviour of the German tokeniser to match the English for just the parentheses without spaces around them and enclosed between alphabetic words?
Johann Petrak
@johann-petrak
Also, I am puzzled by the token rules that can be used in an entity ruler versus the token rules for a matcher: in a matcher, only a subset of the attributes are allowed and have to be given in all uppercase while in the entity ruler all attributes can be used with the original case? But the value for a matcher can be extended pattern syntax and for an entity rule not? Does the entity ruler also allow quantifiers?
Björn Böing
@BreakBB

@johann-petrak This section explains how the tokenizer works. You can add special cases if you have some specific edge cases which you want to handle or write a custom tokenizer with your own ruleset. But most likely you want to just add to the existing ruleset to add some regex for your paratheses situation.

About your second question: From what I see by looking at the code is, that the EntityRuler internaly uses a (Phrase)Matcher and its methods to it might be possible to use all pattern rules. If so this definitly needs be added to the docs

Johann Petrak
@johann-petrak
@BreakBB thanks, those pointers helped a lot! I think it would be great if the spacy API docs would have pointers to where those details are documented.
I understand now that German has an infix rule that splits on the parentheses while English does not.
The Matcher vs EntityRuler details are still a mystery to me: in the documentation the examples seem to show that for the entity ruler one can use the original lower case versions of the token attribute names while the matcher needs the special subset of uppercase names. So if you saw that in the code the entity ruler uses the matcher internally, there should be no such difference?
Björn Böing
@BreakBB

@johann-petrak Good to hear that they helped :+1: As @ines stated in some issue (can't find it) that the docstrings should not be too big but include the main information and mostly a link to the API docs. On those docs you should find the most useful information or further links to other parts of the docs (e.g. usage). I just created #4064 to update the Tokenizer docs.

I didn't work with the EntityRuler by now so I can't tell about the real usage or the difference to the Matcher. But I think you are right and I can't tell why there might be/is such a difference. Maybe you want to open an issue about it to get some... And you already opened #4063 ^^

Fabien B
@keschnir
Hello World I'm a complete begginer in machine learning and the end of my internship is all about NLP. I was hoping maybe you could share some links or resourses if i describe my need.
I have a model that uses entity match rules to find specific words in my texts. Now that this pipe works like a charm i'd like to find further information surrounding those words. My texts are real estate descriptions wich means i can find multiple surface areas in my texts. For example i could have something like "This appartment counts 3 rooms for a total surface of 50m². The living room is 20m², ideal for students. Gaz heating, (etc..)." I'd like to be able to know that the total area is 50 and the living room is 20 because for now i only listed the areas mentioned, but not what those areas are related to.
There are many other entities that could use the same method but i don't know where and how to start, or even the name of what i'm doing ^^
fizban99
@fizban99
@keschnir If you already have identified the surface_value (e.g. "50m²") and surface_name (e.g. "living room") as entities, One simple option would be use the sentencizer and assume that if a surface_value and a surface_name occur in the same sentence, they are related. I believe you want to extract the relationship of named entities in a "Named Entity Recognition" process. I understand its full implementation is planned for a future release of spaCy, but there are currently some workarounds such as the simple rule method I mentioned or train the the dependencies as if they were a POS tagging. An intermediate option between simple rules and full relationship training could be using the POS tagging of one of the pre-trained models to get the syntactic dependencies and assume that the surface_value and the surface_name will have a syntactic relationship.
Fabien B
@keschnir
@fizban99 thx alot, I might have found what i need in the code examples of the official doc. This example uses the POS tree wich can be retrained for my purpose.
King Chung Huang
@kinghuang_gitlab
I’d like to use spaCy for NER, but I need to clean up the text, first. One of the cleanup tasks is to expand shorthand text like vlv to valve, and fix spelling errors. Is this something I can use spaCy for, or should I be using something else to pre-process the text before spaCy?
fizban99
@fizban99
@kinghuang_gitlab spaCy is based on a non-destructive tokenizer, so it generally assumes the text is already cleaned before getting into the spaCy pipeline. Typically you would clean up your text with regular expressions, replacing, adding and removing according to your needs. If your text comes from an OCR process, it might be an iterative process until you are more or less happy with all the pre-processing rules...
King Chung Huang
@kinghuang_gitlab
@fizban99 Thanks, that helps.
jonathan-bonnaud
@jonathan-bonnaud
Hello everyone. I am using PhraseMatcher to match job titles (Software Developer, etc), I have a big list of job titles and I give it to the PhraseMatcher, this way:
title_matcher = PhraseMatcher(nlp.vocab); prepared_titles = list(nlp.pipe([title.lower() for title in JOB_TITLES])); title_matcher.add("JOB_TITLE", None, *prepared_titles)
The problem is that when the job title in the text from which I want to extract is in plural form, it doesn't match.
I tried to instanciate the matcher with PhraseMatcher(nlp_model.vocab, attr='LEMMA')but it doesn't work and even worse, it matches every single word.
Does anyway have a solution? Am I doing something wrong?
fizban99
@fizban99
@jonathan-bonnaud The attr='LEMMA' option does not work for me either. I get the same results as you if I use nlp =English() but I get no matches if I use nlp = spacy.load('en_core_web_sm'). Maybe it is a bug or maybe the LEMMA attribute is not actually supported by the PhraseMatcher? The best workaround I can think of is adding the plurals as well to the patterns...
jonathan-bonnaud
@jonathan-bonnaud
@fizban99 Thank you for your answer! Sure, well, for now, I won't support matching of plurals then...
Girraj Jangid
@Girrajjangid
Is it possible to download
fizban99
@fizban99
@jonathan-bonnaud I opened an issue about this (#4100) and it seems that the problem was that you were using an empty model and the lemmas were not set. If you load one of the models it should work. It was not working for me because i was using make_doc and it does not set the lemmas even when having a model loaded.
jonathan-bonnaud
@jonathan-bonnaud
@fizban99 Does it mean that English() is an empty model?
fizban99
@fizban99
It is empty. It provides lemmatization via lookups, tokenization and a little more. But the PhraseMatcher does not support lemmatization via lookups and expects the lemmas to be set by a statistical model.
jonathan-bonnaud
@jonathan-bonnaud
Ah okay. That's why! Thanks!
Girraj Jangid
@Girrajjangid
Is it possible to download 'en_core_web_lg' model on colab?? Please provide any source
fizban99
@fizban99
@Girrajjangid not an expert on colab, but this seems to work for me:
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("/usr/local/lib/python3.6/dist-packages/en_core_web_lg/en_core_web_lg-2.1.0")
doc = nlp(u"displaCy uses JavaScript, SVG and CSS.")
spacy.displacy.render(doc, style='dep', jupyter=True)
Vladyslav Shcherbyna
@valdislav
Hello everybody. Who knows good examples to tackle email/message closing signatures. Basically to getting rid of such signatures, is it ok to assume that text after such signature will not exists? Is there a ways to tackle text which may happen after signature (e.g. part of other email included inside another so there will be two signature and text between them)
Mustafa Qamar-ud-Din
@mustafa-qamaruddin
I would like to contribute to Spacy model implementation stuff like brown clustering etc. how do I get started?
dugast
@chris.dugast_gitlab
Hi There: is it possible to get access to the data that is behind the spacy language models and the spacy NER models?
Mat Leonard
@mcleonard

Hi all. I'm looking at the tutorials for training a TextCategorizer model. I came across these lines:

with textcat.model.use_params(optimizer.averages):
    # evaluate on the dev data split off in load_data()
    scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)

Hoping someone can help me understand why textcat.model.use_params(optimizer.averages) is used here and what it's doing.

Damiano Porta
@damianoporta
guys i split the sentences of my documents by looking at \n characters because it is a semi-strctured document. The problem is when i have very short sentences. I give you an example. Let's suppose we have a text like: "Birthdate:\n1980-01-01" i split the text into two sentences because of the \n but the result is "birthdate:" and "1980-01-01". So the problem is, how can spacy understand (tag with the ner model) that 1980-01-01 is a birthdate? i have added a custom label called "BIRTHDATE" but, without context, should be very hard to have good accuracy. If i pass the entire text to train the model the result will be better but i always read that i should segment the documents into sentences to: 1. speed up the training 2. less time to annotate
manish maharjan
@mmanishh
Why is Spacy using Residual CNN instead of RNN in NER ?
Joseph Catrambone
@JosephCatrambone
Morning, all. I just downloaded the en-1.1.0.tar.gz model and I was wondering where one should extract those. The normal spacy.en.download approach has failed me.
PeterWolf
@PeterWolf-tw
@stuz5000 Hi, this may be a little bit old. We've just work out a solution for Chinese language (Traditional Chinese currently). Would you like to give it a try? https://github.com/Droidtown/ArticutAPI/blob/master/Docs/Articut-GraphQL/ReadMe_EN.md
manish maharjan
@mmanishh
Can somebody explain this ? explosion/spaCy#4181
Campbells
@aCampello
Hey guys, does any one know if there is a way to see Spacy's NER precision and recall per label ? The releases show the average performance but in some applications it would be interesting to see it per label.
tsoernes
@tsoernes
@mcleonard it's explained in the textcat training tutorial on spacy.io
Campbells
@aCampello
Sorry, I mean the precision recall for the pre-trained models (e.g., en_core_web_sm)
amirouche
@amirouche
Spacy is blocking inside nlp('my toy project') any idea what might be causing this? (using en_core_web_lg)
It is non-determinist.
it happens after a thousand times at least.
pythonBerg
@pythonBerg
Hey guys, I use several models for custom named entities. For the more complex, I also have a classifier to refine results. Can anyone tell me how I can ask a model if a classifier is present? Is it a matter of checking if a pipeline exists?
martijnvanbeers
@martijnvanbeers
pythonBerg: that sounds like a good aproach to me, yes
Damiano Porta
@damianoporta
hello guys, a fast question. Can i continue training a model with a different dropout value? During the first training i have used 0.2, now i would like to increase it to 0.3. Should i create a new model? or can i simply increase that value?
kickme26
@kickme26
Hello guys, I need to use the spacy for language detection. I need to identify the given text(text contains date also.) is american english or UK english. Can anyone would help me out? Based on that i need to extract the given date.
Bram Vanroy
@BramVanroy
@amirouche Memory issue, perhaps? How is your memory consumption.
@damianoporta It's generally not advised to change such values during training. You can try, of course. I advise you to not down exactly at which step/epoch you changed dropout. Documentation is everything.
Damiano Porta
@damianoporta
@BramVanroy Ok, but i am not changing that value during the training (programmatically). I set a different dropout value when i want to train the model for more epochs. For example, at the moment i have trained my model for 25 epochs. Now, can i continue for the next 25 using a different dropout value OR should i start from 1 again?
Bram Vanroy
@BramVanroy
@damianoporta I wouldn't suggest it, but you can if you want.
Damiano Porta
@damianoporta
@BramVanroy ok, do you think 0.2 is a good start for dropout ?
i think i should increase it to 0.3 to "generalize" better
Bram Vanroy
@BramVanroy
@damianoporta Can't answer that. Depends completely on your dataset, task, progress, and so on. Do a grid search to optimise, i.e. try multiple values and keep the best model (e.g. 0, 0.2, 0.4).
Damiano Porta
@damianoporta
@BramVanroy ok, thank you
Rasmus Bonnevie
@Bonnevie
What is the proper way to replace one of the standard pipeline components with a custom part? I want to keep most of the pipeline, but I would like to replace the dependency parser with an alternative model, but to integrate properly I need to set the dep tags. I am looking a bit at how spacy-stanfordnlp goes about it, but I want to just set the dep tags, not replace the whole pipeline. Is there a setter method for dep or is there a way to subclass DependencyModel?