Hi, Is there a way to run ocropy in windows other than docker-ocropy ?
@shashank_hzl_twitter It seems possible to download Python and the required modules in windows, such that it can be run there as well, see tmbdev/ocropy#73 . However, I have no experience in this and I guess that using the docker images under windows might be much easier.
Just for completeness, the problem with ocropus-rtrain is fixed now. Thank you @jze for the report and also for your interesting blog post!
I am looking for additional ground truth data for my general purpose Fraktur model https://github.com/jze/ocropus-model_fraktur If you have public domain snippets and texts it would be nice if you could provide them to me to include them into the model.
i'm preparing data for training and when doing pageseg i find sometimes these errors: book/0585.bin.png: scale (5.198) less than --minscale; skipping
looking at the image (0585.bin.png) i find nothing strange visually
i know i can make the error pass by using --minscale 5
but am afraid this will impact my learning accuracy
i dont really know what is scale about ?
can you please help guys ?
this image will give 8.12404 scale
book/0170.bin.png: scale (8.12404) less than --minscale; skipping
the default scale is 12
@jze any idea ?
To my knowledge, that value indicates how tall the characters are. If the characters are too small it might be difficult to differentiate the different characters. If you have an original color or grayscale scan you might have better success by generating larger bitonal images. Here is an experiment I did on that topic: https://comsys.informatik.uni-kiel.de/lang/de/res/optimizing-binarization-for-ocropus/ OCRopus also has an option to work with grayscale images. However, I have not tested that in detail.
i have the chance to know the character font i want to recognize on ocropus
so i'm generating ground truth data myself
i'm looking for a way to make excellent learning since i can generate as much data as i can
is the font size i'm using for ground data important ?
i'm considering size 18
and i'm struggling with segmentation, my ground truth data is designed to get one word per image:
first problem: segmentation process seems to ignore short words:
and by the way i dont understand the meaning of blue colored words in the above image ?
as a result of the segmentation i get pretty all the words each one in an image, but still have some exceptions like these:
one line = 2 words
that's my second problem
anyone can help plz ?
You should make sure that you are working with 300dpi images. Moreover, for training you should not only take lines with single words, but you should work with lines as similar as you exepect them to occur in a text. I.e. usually one can work with a sample text (full sentences, paragraphs) and generates its lines. For that you can use the command ocropus-linegen. You only need a suitable font and a text file for that (see example in the README). This creates directly the lines and does not need to first make the page segmentation.
did anyone try to embed ocropus in an android application ?
@zuphilip now i'm following you using ocropus-linegen with my desired font, thank you for the info !
@zuphilip how much data is enough ? i see ocropus-linegen has a maxlines default to 200 only, while i have the possibility to generate many more, i've set maxlines to 100'000, i dont know if this will help ...
For my Gothic print model https://github.com/jze/ocropus-model_fraktur I have used 3,400 lines (164,000 characters) from different book. However, for an initial model 300 lines might be sufficient. For a robust model you should not only rely on generated text but also train with real-world scan images.