Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • 18:49
    mr-bjerre edited #4156
  • 18:47
    mr-bjerre opened #4156
  • 18:36
    adrianeboyd commented #4152
  • 16:22
    BLKSerene edited #4155
  • 16:17
    BLKSerene edited #4155
  • 16:16
    BLKSerene edited #4155
  • 16:16
    BLKSerene edited #4155
  • 16:15
    BLKSerene edited #4155
  • 16:14
    BLKSerene edited #4155
  • 16:12
    BLKSerene opened #4155
  • 15:46
    ines synchronize #4105
  • 15:36

    ines on master

    Tidy up and auto-format (compare)

  • 15:34
    honnibal commented #4128
  • 15:25
    ines synchronize #4105
  • 15:11
    ines unlabeled #4070
  • 15:09
    ines commented #4128
  • 15:08
    ines labeled #4128
  • 14:50
    ines synchronize #4105
  • 14:43
    ines closed #4120
  • 14:43
    ines commented #4120
psychosis448
@psychosis448
'ner' is in the models pipeline and does exist though
It works without issue not adding the ruler before 'ner'
skylerilenstine
@skylerilenstine
Are there any examples of people using SpaCy to create a chatbot? I know Chatterbot uses it, but that's using the similarity feature. I'm thinking more along the lines of using the nlp tools to create followup questions/statements. Like if someone types "I like noodles, they remind me of home." using the parser to return something like "Why do you think them reminding you of home makes you like them?" I'm able to code that example, but I'm just curious if anyone has made a bot this way (using specific rules) to give me ideas of other ways to followup/parse user input. Thanks!
Björn Böing
@BreakBB
@skylerilenstine Just have a look at the spaCy universe. There is Rasa NLU listed as well
Robert (Bob) Borges
@BobBorges
Hi all. I'm just starting with spacy, and I run into something I've never experienced before. If I run a script that uses spacy, it works as expected in the first instance of the kernel (using spyder), but if I run the same script again, it throws an error. In Python 2, the error is "AttributeError: 'NoneType' object has no attribute 'literal_eval'". In Python 3, it's "PicklingError: Could not pickle object as excessively deep recursion required." The problem line seems to be "nlp=spacy.load("nl_core_news_sm")". It doesn't seem to be my script, as this behavior is reproduced with the code examples off the spacy website. I've tried the English and Dutch packages, and two computers (but same OS, Linux Mint 19.1) with the same result. If I restart the kernel in spyder, the script runs as expected, but only the first time. Has anyone seen this before? How to solve it? Now day 2 of googling with no leads to go on...any ideas?
Johann Petrak
@johann-petrak
I am confused about which things in spacy are global: for example when registering extensions this is done by invoking a class method on Token, but what if I have several spacy pipelines (e.g. different languages) at the same time and I need extensions different on different pipelines?
Is there a way to declare extensions local to a pipeline?
Björn Böing
@BreakBB
@johann-petrak The docs state that the information you pass into set_extension are stored as class properties in the Underscore class to either Token, Span or Doc. The only way I can think of to allow different behaviour is reloading the modules to "reset" them, before adding the next extensions. So no, as far as I know there is currently no way to bin the extensions to a pipeline instead of the classes
Johann Petrak
@johann-petrak
@BreakBB thanks, that confirms what I thought, but found hard to believe, looks like a severe design error to me. System-global settings like this hinder flexibility extremely IMO.
Björn Böing
@BreakBB
@johann-petrak For your case this might be true, but I think for many others this design allows a straight forward usage. Moreover I am not an expert of spaCy so there might be a way and @ines or @honnibal can correct me.
martijnvanbeers
@martijnvanbeers
you could namespace your properties with the language I guess, to work around it
Nela Moore ♡
@NelaMoore_twitter
Hi guys, I wanted to ask if anyone knows if it is possible use spaCy in WP (WOO) plugin, Api for text analyse. Client using Siteground support Python. Can anyone give me any suggestions?
Rahul Shinde
@shinde-rahul

@BobBorgesm the code given in ^^

import spacy
ruzie = open("documenttext.txt", "r").read()
nlp = spacy.load("nl_core_news_sm")
ruzie = nlp(ruzie)
for token in ruzie:
    print(token.text, token.pos_, token.dep_)

is working as expected, (code executed throught terminal though)

documenttext.txt is file with some wiki data
Johann Petrak
@johann-petrak
Hmm what is the recommended way to add tokenization rules for white-space containing constructs or constructs which contain other characters which would normally cause a split into separate tokens? Lets say I want to use the English tokenizer but tokenize everything that looks like "somecharachters(morecharacters)evenmorecharacters" into a single token, how would I achieve this best?
Robert (Bob) Borges
@BobBorges
Thanks @shinde-rahul , it also works for me through terminal too. I guess that means it's a problem with spyder then.
Robert (Bob) Borges
@BobBorges
It was spyder :/
Rahul Shinde
@shinde-rahul
@BobBorges, version conflicts?
@NelaMoore_twitter the answer is yes, you can consume it through service endpoints
Robert (Bob) Borges
@BobBorges
@shinde-rahul Still not really sure. While trying to verify that spyder was the culprit, I started messing with pycharm, and either that or the spyder upgrade got me an incompatible ipython version and had my pip and pip3 pointing at the wrong versions. Trying to fix that, I broke the whole installation :( Clean install untill 4am, now I'm back at the point I was for my first post. Proceeding more cautiously this time.
Robert (Bob) Borges
@BobBorges
@shinde-rahul After clean install + virtualenv setup, above code is working as expected w/out spyder upgrade. I'm pretty sure it was the upgrade that screwed up my global env – live & learn. No idea what the original problem was, but seems like it had nothing to do with spacy.
Rahul Shinde
@shinde-rahul
@BobBorges, integration error or currupt updates may be,
Johann Petrak
@johann-petrak
Does anyone know how the Tokenizer in Spacy works and how the rules for splitting can be changed? The German tokeniser seems to split text like "dies(und)das" into 5 tokens while the English tokenizer makes one token out of this? How can I change the behaviour of the German tokeniser to match the English for just the parentheses without spaces around them and enclosed between alphabetic words?
Johann Petrak
@johann-petrak
Also, I am puzzled by the token rules that can be used in an entity ruler versus the token rules for a matcher: in a matcher, only a subset of the attributes are allowed and have to be given in all uppercase while in the entity ruler all attributes can be used with the original case? But the value for a matcher can be extended pattern syntax and for an entity rule not? Does the entity ruler also allow quantifiers?
Björn Böing
@BreakBB

@johann-petrak This section explains how the tokenizer works. You can add special cases if you have some specific edge cases which you want to handle or write a custom tokenizer with your own ruleset. But most likely you want to just add to the existing ruleset to add some regex for your paratheses situation.

About your second question: From what I see by looking at the code is, that the EntityRuler internaly uses a (Phrase)Matcher and its methods to it might be possible to use all pattern rules. If so this definitly needs be added to the docs

Johann Petrak
@johann-petrak
@BreakBB thanks, those pointers helped a lot! I think it would be great if the spacy API docs would have pointers to where those details are documented.
I understand now that German has an infix rule that splits on the parentheses while English does not.
The Matcher vs EntityRuler details are still a mystery to me: in the documentation the examples seem to show that for the entity ruler one can use the original lower case versions of the token attribute names while the matcher needs the special subset of uppercase names. So if you saw that in the code the entity ruler uses the matcher internally, there should be no such difference?
Björn Böing
@BreakBB

@johann-petrak Good to hear that they helped :+1: As @ines stated in some issue (can't find it) that the docstrings should not be too big but include the main information and mostly a link to the API docs. On those docs you should find the most useful information or further links to other parts of the docs (e.g. usage). I just created #4064 to update the Tokenizer docs.

I didn't work with the EntityRuler by now so I can't tell about the real usage or the difference to the Matcher. But I think you are right and I can't tell why there might be/is such a difference. Maybe you want to open an issue about it to get some... And you already opened #4063 ^^

Fabien B
@keschnir
Hello World I'm a complete begginer in machine learning and the end of my internship is all about NLP. I was hoping maybe you could share some links or resourses if i describe my need.
I have a model that uses entity match rules to find specific words in my texts. Now that this pipe works like a charm i'd like to find further information surrounding those words. My texts are real estate descriptions wich means i can find multiple surface areas in my texts. For example i could have something like "This appartment counts 3 rooms for a total surface of 50m². The living room is 20m², ideal for students. Gaz heating, (etc..)." I'd like to be able to know that the total area is 50 and the living room is 20 because for now i only listed the areas mentioned, but not what those areas are related to.
There are many other entities that could use the same method but i don't know where and how to start, or even the name of what i'm doing ^^
fizban99
@fizban99
@keschnir If you already have identified the surface_value (e.g. "50m²") and surface_name (e.g. "living room") as entities, One simple option would be use the sentencizer and assume that if a surface_value and a surface_name occur in the same sentence, they are related. I believe you want to extract the relationship of named entities in a "Named Entity Recognition" process. I understand its full implementation is planned for a future release of spaCy, but there are currently some workarounds such as the simple rule method I mentioned or train the the dependencies as if they were a POS tagging. An intermediate option between simple rules and full relationship training could be using the POS tagging of one of the pre-trained models to get the syntactic dependencies and assume that the surface_value and the surface_name will have a syntactic relationship.
Fabien B
@keschnir
@fizban99 thx alot, I might have found what i need in the code examples of the official doc. This example uses the POS tree wich can be retrained for my purpose.
King Chung Huang
@kinghuang_gitlab
I’d like to use spaCy for NER, but I need to clean up the text, first. One of the cleanup tasks is to expand shorthand text like vlv to valve, and fix spelling errors. Is this something I can use spaCy for, or should I be using something else to pre-process the text before spaCy?
fizban99
@fizban99
@kinghuang_gitlab spaCy is based on a non-destructive tokenizer, so it generally assumes the text is already cleaned before getting into the spaCy pipeline. Typically you would clean up your text with regular expressions, replacing, adding and removing according to your needs. If your text comes from an OCR process, it might be an iterative process until you are more or less happy with all the pre-processing rules...
King Chung Huang
@kinghuang_gitlab
@fizban99 Thanks, that helps.
jonathan-bonnaud
@jonathan-bonnaud
Hello everyone. I am using PhraseMatcher to match job titles (Software Developer, etc), I have a big list of job titles and I give it to the PhraseMatcher, this way:
title_matcher = PhraseMatcher(nlp.vocab); prepared_titles = list(nlp.pipe([title.lower() for title in JOB_TITLES])); title_matcher.add("JOB_TITLE", None, *prepared_titles)
The problem is that when the job title in the text from which I want to extract is in plural form, it doesn't match.
I tried to instanciate the matcher with PhraseMatcher(nlp_model.vocab, attr='LEMMA')but it doesn't work and even worse, it matches every single word.
Does anyway have a solution? Am I doing something wrong?
fizban99
@fizban99
@jonathan-bonnaud The attr='LEMMA' option does not work for me either. I get the same results as you if I use nlp =English() but I get no matches if I use nlp = spacy.load('en_core_web_sm'). Maybe it is a bug or maybe the LEMMA attribute is not actually supported by the PhraseMatcher? The best workaround I can think of is adding the plurals as well to the patterns...
jonathan-bonnaud
@jonathan-bonnaud
@fizban99 Thank you for your answer! Sure, well, for now, I won't support matching of plurals then...
Girraj Jangid
@Girrajjangid
Is it possible to download
fizban99
@fizban99
@jonathan-bonnaud I opened an issue about this (#4100) and it seems that the problem was that you were using an empty model and the lemmas were not set. If you load one of the models it should work. It was not working for me because i was using make_doc and it does not set the lemmas even when having a model loaded.
jonathan-bonnaud
@jonathan-bonnaud
@fizban99 Does it mean that English() is an empty model?
fizban99
@fizban99
It is empty. It provides lemmatization via lookups, tokenization and a little more. But the PhraseMatcher does not support lemmatization via lookups and expects the lemmas to be set by a statistical model.
jonathan-bonnaud
@jonathan-bonnaud
Ah okay. That's why! Thanks!
Girraj Jangid
@Girrajjangid
Is it possible to download 'en_core_web_lg' model on colab?? Please provide any source
fizban99
@fizban99
@Girrajjangid not an expert on colab, but this seems to work for me:
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load("/usr/local/lib/python3.6/dist-packages/en_core_web_lg/en_core_web_lg-2.1.0")
doc = nlp(u"displaCy uses JavaScript, SVG and CSS.")
spacy.displacy.render(doc, style='dep', jupyter=True)
Vladyslav Shcherbyna
@valdislav
Hello everybody. Who knows good examples to tackle email/message closing signatures. Basically to getting rid of such signatures, is it ok to assume that text after such signature will not exists? Is there a ways to tackle text which may happen after signature (e.g. part of other email included inside another so there will be two signature and text between them)
Mustafa Qamar-ud-Din
@mustafa-qamaruddin
I would like to contribute to Spacy model implementation stuff like brown clustering etc. how do I get started?
dugast
@chris.dugast_gitlab
Hi There: is it possible to get access to the data that is behind the spacy language models and the spacy NER models?
Mat Leonard
@mcleonard

Hi all. I'm looking at the tutorials for training a TextCategorizer model. I came across these lines:

with textcat.model.use_params(optimizer.averages):
    # evaluate on the dev data split off in load_data()
    scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)

Hoping someone can help me understand why textcat.model.use_params(optimizer.averages) is used here and what it's doing.

Damiano Porta
@damianoporta
guys i split the sentences of my documents by looking at \n characters because it is a semi-strctured document. The problem is when i have very short sentences. I give you an example. Let's suppose we have a text like: "Birthdate:\n1980-01-01" i split the text into two sentences because of the \n but the result is "birthdate:" and "1980-01-01". So the problem is, how can spacy understand (tag with the ner model) that 1980-01-01 is a birthdate? i have added a custom label called "BIRTHDATE" but, without context, should be very hard to have good accuracy. If i pass the entire text to train the model the result will be better but i always read that i should segment the documents into sentences to: 1. speed up the training 2. less time to annotate
manish maharjan
@mmanishh
Why is Spacy using Residual CNN instead of RNN in NER ?