Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    theolivenbaum
    @theolivenbaum
    hmm interesting... don't see why a 80MB would fail with OOM, but it is definitely a lot of text to parse at once :)
    Bia10
    @Bia10
    this is quite sad... quess ill have to try either custom HorizontalParsingStrategy or some custom LocationTextExtractionStrategy
    theolivenbaum
    @theolivenbaum
    you could try splitting it by page instead :)
    Bia10
    @Bia10
    the simpleParsingStrategy utterly fails on mathematical objects like subscripts/superscripts
    theolivenbaum
    @theolivenbaum
    i.e. change your TextFromPDF to return IEnumerable<string>, and replace in your loop text.Append(PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy));with yield return PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
    Bia10
    @Bia10
    its not a problem of catalyst per se
    the iText7 pdf parsing works weird ways
    Bia10
    @Bia10
    found the problem
    the iText7 somehow reduplicates previous page text onto the next one o.O....
    causing the lenght of pages to bloat
    theolivenbaum
    @theolivenbaum
    ahh :D
    that explains it!
    JBastawrose
    @JBastawrose
    hello folks, newb here.. trying to get catalyst up and running for a project but the link to the API docs seems to be broken. Basically, I would just like to load a few thousand questions and use this tool to return the most similar questions to what the user has entered. Any advice?
    theolivenbaum
    @theolivenbaum
    Hi @JBastawrose - could you share your code so far? Happy to help you with some directions
    Иван Сердюк
    @oceanfish81_twitter
    I am reminding that your project would be presented at https://msstage.com/ , http://www.dotnetfest.com/indexe.html
    JBastawrose
    @JBastawrose
    Daniel Ellison
    @zigguratt
    @theolivenbaum As @JBastawrose said, the link to the API docs is broken. Does the documentation exist somewhere else? I'm interested in using Catalyst so that I can avoid using IKVM in order to get at OpenNLP in C#.
    Daniel Ellison
    @zigguratt
    The Getting Started link is also broken. I think the main problem is that catalyst.curiosity.ai doesn't exist anymore.
    2 replies
    Siderite
    @Siderite
    Hello!
    Can you guys help me out with using the Catalyst library to detect meaning in text commands?
    I am not versed in NLP, but I would like to use your library to extract meaning from chat lines for a bot. Consider something like a text based adventure game and I want to understand when someone is saying both "throw goblin out the window" and "defenestrate goblin". I processed the phrases, but all I get is a bunch of Tokens. I can't seem to easily be able to split a text into separate sentences of have a relationship between tokens like "unlock the blue door with the red key" so that I clearly understand that the door is blue and the key is red.
    11 replies
    Daniel Ellison
    @zigguratt
    @theolivenbaum Any progress on finding the links to Catalyst documentation?
    Daniel Ellison
    @zigguratt
    I'm currently working with Apache's Java-based OpenNLP via IKVM but since I started using the opennlp.tools.parser, load times have skyrocketed. I would think that a native C# NLP library has to be faster.
    Daniel Ellison
    @zigguratt
    I'm guessing that Curiosity has little interest in its free Catalyst library anymore. I guess I'll just have to put up with a Java-based solution in OpenNLP. It's too bad: I would love to have been able to use a C#-native library to avoid so much overhead.
    theolivenbaum
    @theolivenbaum
    hi @zigguratt - unfortunately not, but the samples on how to use it are available on the github repository - is there anything specific that you're trying to achieve that is not covered there?
    I imagine the pain on using OpenNLP, tried it many years ago and the dependency on IKVM was a hard one to handle :(
    If you would like to contribute to the project, happy to add you to the repository as well - on our side we are active users of Catalyst, and welcome any improvements there!
    Daniel Ellison
    @zigguratt

    Thank you for your response, @theolivenbaum. I need to parse user input into a syntax parse tree. For example, OpenNLP parses "open one of my eyes" into

    (VP (VB open) (NP (NP (CD one)) (PP (IN of) (NP (PRP$ my) (NNS eyes)))))

    I don't see that type of thing in the examples. I'm relatively new to NLP (though I have decades of coding experience) so an API reference would be very helpful.

    I'm realizing that standard NLP doesn't handle commands very well, often seeing them as noun phrases instead of verb phrases. As a quick example, if I parse an imperative command like "open door", OpenNLP gives me (NP (JJ open) (NN door)). In other words, it sees the phrase as "an open door" instead of "open the door". I want it to parse as (VP (VB open) (NP (NN door))). Is this something Catalyst can do? Perhaps by training?

    Please excuse any misuse or misunderstanding of terms. It's all new terminology to me. I would appreciate any help or guidance here.

    Daniel Ellison
    @zigguratt

    After a lot of research I stumbled on someone with the same problem. They were advised to "hack" OpenNLP by adding a pronoun like "they" before the command to force the parser to see the input as a verb phrase. So I would give the parser "they open door" and get back

    (S (NP (PRP they)) (VP (VBP open) (NP (NN door))))

    at which point I can just extract the verb phrase.

    I assume this would work for Catalyst as well. I would still love to use it instead of IKVM/OpenNLP.

    theolivenbaum
    @theolivenbaum
    Hi @zigguratt! Unfortunately there is still some work still to be done on the dependency parser code before it is ready...
    An alternative for you for now would be to call spacy on python using the awesome pythonnet integration: https://github.com/pythonnet/pythonnet
    We use it for hosting python code on our platform, and it works quite nicely - of course you still have some limitations but IMO is a much better solution than the extremely outdated OpenNLP package
    Daniel Ellison
    @zigguratt
    Thanks, @theolivenbaum. I'm getting up to speed on spaCy now. My execution context is Unity3D. I've already installed IronPython into a Unity project and am now figuring out how to install spaCy so that it can be accessed from within Unity. Just a quick question since the answer isn't immediately obvious: would I write my parsing code in Python? With OpenNLP I can use C# to call the OpenNLP libraries. I don't have a problem coding in Python; I just need to know what needs to be done.
    theolivenbaum
    @theolivenbaum
    To be honest if you're running this in unity, I think the best approach would be to split the code and use some network interface between the two parts. You should be able to code mostly in CSharp with just some glue code to call the spacy in python and convert the data - I wonder if it is even possible to automatically wrap a spacy doc in a catalyst doc - might prototype something for that effect...
    Daniel Ellison
    @zigguratt
    That does sound cleaner, and certainly has its attractions. I considered that when starting out. My only concern would be in packaging things up for distribution. A single executable would be far better than having multiple moving parts. I'll have to investigate the packaging options for Unity. If I can pull it off while keeping installation dead simple I'll definitely look into it. If it's not obvious, not only am I new to NLP, I'm also just starting out with Unity and, in fact, C#. I do have decades of professionals software development experience however, so I'm not a total newbie.
    theolivenbaum
    @theolivenbaum
    yeah i think just that unity has some restrictions of what it supports - but i might be mistaken there - worth testing it :)
    btw the link I sent you is the wrong one: this is what you need: https://github.com/henon/Python.Included
    or this is you would manage the python installation locally: https://github.com/henon/pythonnet_netstandard
    Daniel Ellison
    @zigguratt
    Thanks for the links! I'll definitely look into it. And an extra thank you because this now has nothing to do with Catalyst. :)
    theolivenbaum
    @theolivenbaum
    image.png
    just wrote an wrapper for spacy under catalyst, works like a charm :)
    will publish a nuget package over the weeknd for it with some sample code for how to use it
    Daniel Ellison
    @zigguratt
    Nice!
    theolivenbaum
    @theolivenbaum
    There's a sample project in the repository on github... Very much wip, need to connect better with some of the data from spacy, and expose more functionality from it. If you've any suggestions feel free to send a PR!
    Daniel Ellison
    @zigguratt
    Thanks for that! I took a look at Python.Included. It's pretty fantastic but only works on Windows right now. Python.NET covers Windows, Mac, and Linux, but would require the user to have Python installed. These look like fantastic solutions for the future, but for my particular project, to use any Python-based solution I would have to set up a public server that can be queried by the Unity game engine at runtime. I'm not eager to do that, however, so it looks like I'm stuck using IKVM/OpenNLP for the time being. Ah well. It works, anyway. Thank you for your help, @theolivenbaum !
    Roberto
    @GalawynRM
    Good evening, I need some help to write my first pattern
    i would recognize patterns of this type: C01 or C01 col. 2 where technically, one letter + 1 or 2 numerig digits, and optional token (col or col.) and 1 or 2 numeric digits
    mp => mp.Add(
    new PatternUnit(P.Single().WithTokens(quadriTokens)),
    new PatternUnit(P.Single().HasNumeric())
    )
    this one, is not working. (quadriTokens contains a list of acceptable letters A->H
    and of course, the one i would write is
    mp => mp.Add(
    new PatternUnit(P.Single().WithTokens(quadriTokens)),
    new PatternUnit(P.Single().WithLength(1, 2).HasNumeric()),
    new PatternUnit(P.SingleOptional().WithTokens(colTokens)),
    new PatternUnit(P.SingleOptional().WithLength(1, 2).HasNumeric())
    )