Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Bia10
    @Bia10
    hi there, how do i stop the spotter from recognizing the [Email/Url] capture tag entities or omiting it from doing so ?
    theolivenbaum
    @theolivenbaum
    Hi @Bia10 - this is added by the tokenizer (I know, a bit weird :D), because it uses a check for emails or URLs to avoid tokenizing them - and as that was for free, added the check by default: https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Base/FastTokenizer.cs#L322
    I'll push an option to make it possible to disable it
    here: https://github.com/curiosity-ai/catalyst/blob/master/Catalyst/src/Models/Base/FastTokenizer.cs#L15
    You can set globally FastTokenizer.DisableEmailOrURLCapture = true to disable it now
    Bia10
    @Bia10
    ty btw i got another question
    so i was parsing my pdf using the iText5 and everything goes smooth
    however doing that in iText7 fails by massive memory accumulation...
    the code for iText5 pdf to string
        private static string TextFromPDF(string path)
        {
            using (var reader = new PdfReader(path))
            {
                var text = new StringBuilder();
    
                for (var i = 1; i <= reader.NumberOfPages; i++)
                {
                    text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
                }
    
                return text.ToString();
            }
        }
    iText7 code
        private static string TextFromPDF(string path)
        {
            using (var pdfReader = new PdfReader(path))
            using (var pdfDocument = new PdfDocument(pdfReader)) {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                var text = new StringBuilder();
    
                for (var i = 1; i <= pdfDocument.GetNumberOfPages(); i++) {
                    text.Append(PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy));
                }
    
                return text.ToString();
            }
        }
    Bia10
    @Bia10
    before catalyst starts processing ..
    image.png
    image.png
    after
    image.png
    this continues till i am out of memory
    this doesnt happen when iText5 is used for parsing the pdf
    memory doesnt go over 600mb max here i run thru 6gb with breeze ...
    i suspect it has something to do with the extraction strategy
    the target book i am parsing is in English and heavy with mathematics
    Bia10
    @Bia10
    basicaly i dont understand why catalyst explodes memory on using iText7 while works swiftly and correctly on iText5
    theolivenbaum
    @theolivenbaum
    we try to process all the text lazyly - i.e. using IEnumerable<Document> on the call to pipeline.Process(...)
    what I imagine that is happening is that this is not working on your case - i.e. you're first creating all documents, then parsing them all, then post-processing it. Can you share the code where you're calling the pipeline?
    Bia10
    @Bia10
    sure
    its siple program for now
        private static void Main()
        {
            const string path = @"C:\test";
            var booksText = Extract.TextFromMultiplePDFs(path);
            //Console.WriteLine(booksText.First());
            var generatedDocs = booksText.Select(text => new Document(text, Language.English));
            var spotter = SpotterFromKeyWordSet(KeyWordSet.categoryTheory);
            var processedDocs = SpotterProcessMultipleDocs(spotter, generatedDocs);
    
            foreach (var doc in processedDocs) {
                Print.DocEntities(doc);
            }
    
            Console.ReadLine();
        }
    
        private static Spotter SpotterFromKeyWordSet(IReadOnlyCollection<KeyWordSet> keyWordSets)
        {
            var tag = keyWordSets.First().Tag;
            var captureTag = keyWordSets.First().CaptureTag;
    
            //Perform entity recognition on a gazeteer-like model.
            var spotter = new Spotter(Language.Any, 0, tag, captureTag) {
                Data = { IgnoreCase = true } //In some cases, it might be better to set it to false, and only add upper/lower-case exceptions as required
            };
    
            foreach (var set in keyWordSets) {
                spotter.AddEntry(set.Keyword);
            }
    
            return spotter;
        }
    
        private static IEnumerable<IDocument> SpotterProcessMultipleDocs(IProcess spotter, IEnumerable<IDocument> docsToProcess)
        {
            var nlp = Pipeline.TokenizerFor(Language.English);
            nlp.Add(spotter); //When adding a spotter model, the model propagates any exceptions on tokenization to the pipeline's tokenizer
    
            return nlp.Process(docsToProcess);
        }
    theolivenbaum
    @theolivenbaum
    you could rewrite it like this and give it a try to see if it improves, this should also better use the pipeline due to how it implements multithreading:
            var nlp = Pipeline.TokenizerFor(Language.English);
            nlp.Add(spotter); //When adding a spotter model, the model propagates any exceptions on tokenization to the pipeline's tokenizer
    
    
            foreach (var doc in nlp.Process(generatedDocs))
           {
                Print.DocEntities(doc);
            }
    (also avoids creating a pipeline for each doc, which is way more expensive than the actual parsing :) )
    Bia10
    @Bia10
    doesnt seem to help much
    at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
    at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
    at System.Threading.Tasks.Parallel.ForWorkerTLocal
    at System.Threading.Tasks.Parallel.ForEachWorkerTSource,TLocal
    at System.Threading.Tasks.Parallel.ForEachTSource
    at Catalyst.Pipeline.<Process>d__36.MoveNext()
    at Makise.Program.SpotterProcessMultipleDocs(IProcess spotter, IEnumerable`1 docsToProcess) in C:\Users\Bia\source\repos\Makise\Program.cs:line 46
    at Makise.Program.Main() in C:\Users\Bia\source\repos\Makise\Program.cs:line 20
    Bia10
    @Bia10
    with iText7 simple parsing strategy i always get OutOfMemoryException: Array dimensions exceeded supported range. even tho i have still few GB of memory left free...
    image.png
    is there some kind of text the tokenizer wont pasrse well?
    Bia10
    @Bia10
    what i am missing from catalyst at this stage would be some nice way to see and know whats going on in the pipeline/tokenizer while they are processing.
    theolivenbaum
    @theolivenbaum
    good point... you can make it single thread and do one doc at a time by calling nlp.ProcessSingle(doc), it should give you a better overview of the problem
    theolivenbaum
    @theolivenbaum
    the only think I could imagine for the box is that you are trying to tokenize some really large text, and that is hitting some limits...
    2 things you could check:
    what is the "length" of the string you're trying to process that fails
    if you're running your code as a 32bit or 64bit process
    If you could share with me the text that causes it to fail, happy to check here what could be... either pm me or send by mail: rafael@curiosity.ai
    Bia10
    @Bia10
    okay
    after comperative run on both iText7 and iText5
    heres result
    image.png
    as i tought before the PDF is not properly parsed on iText7 with simple parsing strategy
    theolivenbaum
    @theolivenbaum
    hmm interesting... don't see why a 80MB would fail with OOM, but it is definitely a lot of text to parse at once :)
    Bia10
    @Bia10
    this is quite sad... quess ill have to try either custom HorizontalParsingStrategy or some custom LocationTextExtractionStrategy
    theolivenbaum
    @theolivenbaum
    you could try splitting it by page instead :)
    Bia10
    @Bia10
    the simpleParsingStrategy utterly fails on mathematical objects like subscripts/superscripts
    theolivenbaum
    @theolivenbaum
    i.e. change your TextFromPDF to return IEnumerable<string>, and replace in your loop text.Append(PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy));with yield return PdfTextExtractor.GetTextFromPage(pdfDocument.GetPage(i), strategy);
    Bia10
    @Bia10
    its not a problem of catalyst per se
    the iText7 pdf parsing works weird ways
    Bia10
    @Bia10
    found the problem
    the iText7 somehow reduplicates previous page text onto the next one o.O....