by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Jesper Zedlitz
    @jze
    To my knowledge, that value indicates how tall the characters are. If the characters are too small it might be difficult to differentiate the different characters. If you have an original color or grayscale scan you might have better success by generating larger bitonal images. Here is an experiment I did on that topic: https://comsys.informatik.uni-kiel.de/lang/de/res/optimizing-binarization-for-ocropus/
    OCRopus also has an option to work with grayscale images. However, I have not tested that in detail.
    The Z
    @farirat
    i have the chance to know the character font i want to recognize on ocropus
    so i'm generating ground truth data myself
    i'm looking for a way to make excellent learning since i can generate as much data as i can
    is the font size i'm using for ground data important ?
    i'm considering size 18
    The Z
    @farirat
    and i'm struggling with segmentation, my ground truth data is designed to get one word per image:
    data_0001.png
    first problem: segmentation process seems to ignore short words:
    0001.pseg.png
    and by the way i dont understand the meaning of blue colored words in the above image ?
    as a result of the segmentation i get pretty all the words each one in an image, but still have some exceptions like these:
    01019a.bin.png
    01000b.bin.png
    one line = 2 words
    that's my second problem
    anyone can help plz ?
    Philipp Zumstein
    @zuphilip
    You should make sure that you are working with 300dpi images. Moreover, for training you should not only take lines with single words, but you should work with lines as similar as you exepect them to occur in a text. I.e. usually one can work with a sample text (full sentences, paragraphs) and generates its lines. For that you can use the command ocropus-linegen. You only need a suitable font and a text file for that (see example in the README). This creates directly the lines and does not need to first make the page segmentation.
    The Z
    @farirat
    did anyone try to embed ocropus in an android application ?
    The Z
    @farirat
    @zuphilip now i'm following you using ocropus-linegen with my desired font, thank you for the info !
    @zuphilip how much data is enough ? i see ocropus-linegen has a maxlines default to 200 only, while i have the possibility to generate many more, i've set maxlines to 100'000, i dont know if this will help ...
    Jesper Zedlitz
    @jze
    For my Gothic print model https://github.com/jze/ocropus-model_fraktur I have used 3,400 lines (164,000 characters) from different book. However, for an initial model 300 lines might be sufficient. For a robust model you should not only rely on generated text but also train with real-world scan images.
    Philipp Zumstein
    @zuphilip
    See also the experience from @danvk for training http://www.danvk.org/2015/01/11/training-an-ocropus-ocr-model.html
    The Z
    @farirat
    ok now i have learning running for ~70k arabic lines
    Mc Hammer
    @sirchris7_twitter
    Hi there, i was just wondering if there is a way to speed up the training process of the orcopus-rtrain element because just now it looks like it uses just one core of my hardware at all.
    Bhavishya Pohani
    @Azrael1
    Hi, I observed that ocropy randomly pauses during training. It does not move forward after that but is still using 100% of one core
    Has anyone else experienced this issue?
    Jesper Zedlitz
    @jze
    Does you ground truth data contain the right characters? You can use my little script to find out: https://comsys.informatik.uni-kiel.de/res/ocropus-check-groundtruth/
    Bhavishya Pohani
    @Azrael1
    Anyone know the solution to this problem tmbdev/ocropy#158 ?
    Umesh Rao
    @urhub
    I have image of a newspaper with 4 columns. Column segmentation is correct, but lines are not in serial order, left to right, column by column. It seems to jump from one column to another and back, after the first column. Any idea what could be causing this?
    e1000
    @e1000
    @urhub : There is some complicated logic in the library, it does not necessarily know what order of the document is wanted. (Ocropy might not even know columns...) You might need to figure out the xy-coordinates and have a simple column-line-grouping algorithm
    e1000
    @e1000
    Jesper Zedlitz
    @jze
    In lstm.py a normalization of Unicode characters is performed. Unfortunately the NFKC normalization changes the LATIN SMALL LETTER LONG S into an ordinary s. That is a problem when dealing with old German texts. If is really necessary to use NFKC normalization or would it be sufficient to use NFC normalization?
    e1000
    @e1000
    hello ! my guess is , it should work , but it would need training a new model . did you tried training with using unicodedata.normalize('NFC', s) ? also, just for interest : why is "LATIN SMALL LETTER LONG s" -> s a problem ? is it that you want it to be extracted as "LATIN SMALL LETTER LONG s" ? or does it get confused with f ? or something else ?
    Jesper Zedlitz
    @jze
    @e1000 Yes, I am currently running a model training with the NFC normalization. It is often discussed how to encode the s. I recently discussed it two weeks in Berlin at an OCR workshop. So far, I thought you should encode as s. However, you can always replace the ſ with an s in the ground-truth or the resulting text if you don't want it. After the normalization it is impossible to go back. Therefore, it seems to be a good idea to include the ſ in the model. It is also necessary to compare with Tesseract 4.0 results which include the ſ.
    Umesh Rao
    @urhub
    @Azrael1 Yes, I have seen it pause before. Every time pressing Enter on the keyboard causes it to resume. This happens only during an interactive session.
    Umesh Rao
    @urhub
    Hallo everyone, I am thinking of working on documentation for Ocropy, as part of a class project. My plan is to collate everything that exists there, into one document, to begin with. Let me know if such a thing has been done already and would be a repeat. Any comments or suggestions?
    Philipp Zumstein
    @zuphilip
    @urhub We started to document some things in the wiki: https://github.com/tmbdev/ocropy/wiki . The wiki is open for everyone to contribute. Moreover, there are some links there to blog posts explaining some of ocropus workflows.
    Umesh Rao
    @urhub
    @zuphilip Thanks for that information. I will work with the wiki then. I will open a question issue to gather input on what users want to see in the wiki.
    Umesh Rao
    @urhub
    @zuphilip I want to use the methods learnt in class also, which is using Zotero and LibreOffice. I am not sure how that connects with Wiki. I will produce a LOffice document first and then look into updating the Wiki. I will use input from #256 to create the contents. Hope that is okay.
    Umesh Rao
    @urhub
    Hi all, I have uploaded my document at https://github.com/digiah/oldOCR/blob/master/ocropy_getting_started.pdf. Let me know of any corrections or omissions and I will fix it. Thanks.
    Philipp Zumstein
    @zuphilip
    @urhub Thank you for sharing!
    direselign
    @direselign
    I want to add new language that have different script with latin could u tell me how I can do?
    christophered
    @christophered
    Anyone tested ocropy2
    ?
    Konstantin Baierer
    @kba
    I have not but I hope to find some time soon to check out https://github.com/NVlabs/ocropus3 which bundles all the repos tmbdev developed for his next gen OCR system as well as the Slides from his DAS 2018 workshop.
    Nightfury10497
    @NightFury10497
    can anyone suggest how to do OCR using Deep Learning or RCNN ?
    Konstantin Baierer
    @kba
    tesseract 4, ocropus, ocropus2, kraken, calamari ...
    Gqia189
    @Gqia189
    Hi everyone, I am new to ocropus and was wondering why that the segmentation of my image detects 228 lines but only writes out 189 when I have already set --noise 0
    Konstantin Baierer
    @kba
    What does the document look like and what lines are missing? That could be a clue.