Python-based OCR package using recurrent neural networks.
i have the chance to know the character font i want to recognize on ocropus
so i'm generating ground truth data myself
i'm looking for a way to make excellent learning since i can generate as much data as i can
is the font size i'm using for ground data important ?
i'm considering size 18
and i'm struggling with segmentation, my ground truth data is designed to get one word per image:
first problem: segmentation process seems to ignore short words:
and by the way i dont understand the meaning of blue colored words in the above image ?
as a result of the segmentation i get pretty all the words each one in an image, but still have some exceptions like these:
one line = 2 words
that's my second problem
anyone can help plz ?
You should make sure that you are working with 300dpi images. Moreover, for training you should not only take lines with single words, but you should work with lines as similar as you exepect them to occur in a text. I.e. usually one can work with a sample text (full sentences, paragraphs) and generates its lines. For that you can use the command ocropus-linegen. You only need a suitable font and a text file for that (see example in the README). This creates directly the lines and does not need to first make the page segmentation.
did anyone try to embed ocropus in an android application ?
@zuphilip now i'm following you using ocropus-linegen with my desired font, thank you for the info !
@zuphilip how much data is enough ? i see ocropus-linegen has a maxlines default to 200 only, while i have the possibility to generate many more, i've set maxlines to 100'000, i dont know if this will help ...
For my Gothic print model https://github.com/jze/ocropus-model_fraktur I have used 3,400 lines (164,000 characters) from different book. However, for an initial model 300 lines might be sufficient. For a robust model you should not only rely on generated text but also train with real-world scan images.
Anyone know the solution to this problem tmbdev/ocropy#158 ?
I have image of a newspaper with 4 columns. Column segmentation is correct, but lines are not in serial order, left to right, column by column. It seems to jump from one column to another and back, after the first column. Any idea what could be causing this?
@urhub : There is some complicated logic in the library, it does not necessarily know what order of the document is wanted. (Ocropy might not even know columns...) You might need to figure out the xy-coordinates and have a simple column-line-grouping algorithm
In lstm.py a normalization of Unicode characters is performed. Unfortunately the NFKC normalization changes the LATIN SMALL LETTER LONG S into an ordinary s. That is a problem when dealing with old German texts. If is really necessary to use NFKC normalization or would it be sufficient to use NFC normalization?
hello ! my guess is , it should work , but it would need training a new model . did you tried training with using unicodedata.normalize('NFC', s) ? also, just for interest : why is "LATIN SMALL LETTER LONG s" -> s a problem ? is it that you want it to be extracted as "LATIN SMALL LETTER LONG s" ? or does it get confused with f ? or something else ?
@e1000 Yes, I am currently running a model training with the NFC normalization. It is often discussed how to encode the s. I recently discussed it two weeks in Berlin at an OCR workshop. So far, I thought you should encode as s. However, you can always replace the ſ with an s in the ground-truth or the resulting text if you don't want it. After the normalization it is impossible to go back. Therefore, it seems to be a good idea to include the ſ in the model. It is also necessary to compare with Tesseract 4.0 results which include the ſ.
@Azrael1 Yes, I have seen it pause before. Every time pressing Enter on the keyboard causes it to resume. This happens only during an interactive session.
Hallo everyone, I am thinking of working on documentation for Ocropy, as part of a class project. My plan is to collate everything that exists there, into one document, to begin with. Let me know if such a thing has been done already and would be a repeat. Any comments or suggestions?
@urhub We started to document some things in the wiki: https://github.com/tmbdev/ocropy/wiki . The wiki is open for everyone to contribute. Moreover, there are some links there to blog posts explaining some of ocropus workflows.
@zuphilip Thanks for that information. I will work with the wiki then. I will open a question issue to gather input on what users want to see in the wiki.
@zuphilip I want to use the methods learnt in class also, which is using Zotero and LibreOffice. I am not sure how that connects with Wiki. I will produce a LOffice document first and then look into updating the Wiki. I will use input from #256 to create the contents. Hope that is okay.
I want to add new language that have different script with latin could u tell me how I can do?
Anyone tested ocropy2
I have not but I hope to find some time soon to check out https://github.com/NVlabs/ocropus3 which bundles all the repos tmbdev developed for his next gen OCR system as well as the Slides from his DAS 2018 workshop.
can anyone suggest how to do OCR using Deep Learning or RCNN ?
Hi everyone, I am new to ocropus and was wondering why that the segmentation of my image detects 228 lines but only writes out 189 when I have already set --noise 0
What does the document look like and what lines are missing? That could be a clue.
Hi all, I was wondering what the text extraction/ blob detection method used by ocropus is called (The "compute_boxmap" function). The result from that function seems like blob detection but the implementation doesn't look like MSER, or the other common algorithms