Python-based OCR package using recurrent neural networks.
@zuphilip now i'm following you using ocropus-linegen with my desired font, thank you for the info !
@zuphilip how much data is enough ? i see ocropus-linegen has a maxlines default to 200 only, while i have the possibility to generate many more, i've set maxlines to 100'000, i dont know if this will help ...
For my Gothic print model https://github.com/jze/ocropus-model_fraktur I have used 3,400 lines (164,000 characters) from different book. However, for an initial model 300 lines might be sufficient. For a robust model you should not only rely on generated text but also train with real-world scan images.
Anyone know the solution to this problem tmbdev/ocropy#158 ?
I have image of a newspaper with 4 columns. Column segmentation is correct, but lines are not in serial order, left to right, column by column. It seems to jump from one column to another and back, after the first column. Any idea what could be causing this?
@urhub : There is some complicated logic in the library, it does not necessarily know what order of the document is wanted. (Ocropy might not even know columns...) You might need to figure out the xy-coordinates and have a simple column-line-grouping algorithm
In lstm.py a normalization of Unicode characters is performed. Unfortunately the NFKC normalization changes the LATIN SMALL LETTER LONG S into an ordinary s. That is a problem when dealing with old German texts. If is really necessary to use NFKC normalization or would it be sufficient to use NFC normalization?
hello ! my guess is , it should work , but it would need training a new model . did you tried training with using unicodedata.normalize('NFC', s) ? also, just for interest : why is "LATIN SMALL LETTER LONG s" -> s a problem ? is it that you want it to be extracted as "LATIN SMALL LETTER LONG s" ? or does it get confused with f ? or something else ?
@e1000 Yes, I am currently running a model training with the NFC normalization. It is often discussed how to encode the s. I recently discussed it two weeks in Berlin at an OCR workshop. So far, I thought you should encode as s. However, you can always replace the ſ with an s in the ground-truth or the resulting text if you don't want it. After the normalization it is impossible to go back. Therefore, it seems to be a good idea to include the ſ in the model. It is also necessary to compare with Tesseract 4.0 results which include the ſ.
@Azrael1 Yes, I have seen it pause before. Every time pressing Enter on the keyboard causes it to resume. This happens only during an interactive session.
Hallo everyone, I am thinking of working on documentation for Ocropy, as part of a class project. My plan is to collate everything that exists there, into one document, to begin with. Let me know if such a thing has been done already and would be a repeat. Any comments or suggestions?
@urhub We started to document some things in the wiki: https://github.com/tmbdev/ocropy/wiki . The wiki is open for everyone to contribute. Moreover, there are some links there to blog posts explaining some of ocropus workflows.
@zuphilip Thanks for that information. I will work with the wiki then. I will open a question issue to gather input on what users want to see in the wiki.
@zuphilip I want to use the methods learnt in class also, which is using Zotero and LibreOffice. I am not sure how that connects with Wiki. I will produce a LOffice document first and then look into updating the Wiki. I will use input from #256 to create the contents. Hope that is okay.
I want to add new language that have different script with latin could u tell me how I can do?
Anyone tested ocropy2
I have not but I hope to find some time soon to check out https://github.com/NVlabs/ocropus3 which bundles all the repos tmbdev developed for his next gen OCR system as well as the Slides from his DAS 2018 workshop.
can anyone suggest how to do OCR using Deep Learning or RCNN ?
Hi everyone, I am new to ocropus and was wondering why that the segmentation of my image detects 228 lines but only writes out 189 when I have already set --noise 0
What does the document look like and what lines are missing? That could be a clue.
Hi all, I was wondering what the text extraction/ blob detection method used by ocropus is called (The "compute_boxmap" function). The result from that function seems like blob detection but the implementation doesn't look like MSER, or the other common algorithms
can ocropy be installed/build on osx Catalina ?
I'm pretty certain you can run ocropy on OSX. It's just python. I would strongly recommend using the OCR-D toolchain because we have a much improved version of ocropy that is Python3 compatible, supports PAGE-XML and has all kinds of fixes.
@kba Just wondering, when you say a "much improved version of ocropy" is it just a python3 compatible version with some fixes for PAGE-XML? or has the core functionality of the tool changed in anyway? i.e did the method of any part of the tool change in anyway?