Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    stefanCCS
    @stefanCCS

    And now, after re-start I have got this error - maybe this also is dependent to this:

    Warning: LSTMTrainer deserialized an LSTMRecognizer!
    malloc(): invalid size (unsorted)
    make: *** [Makefile:267:

    Or, any other idea?

    helkejaa
    @helkejaa
    @bertsky I suspected that I would have go deeper in to the tools. Thanks!
    Robert Sachunsky
    @bertsky
    @helkejaa you can also use the newest master, just released as v0.10.1, which gives you -P tesseract_parameters '{"textord_noise_rejwords": "0", "textord_noise_rejrows": "0"}' etc.
    I wonder whether SetBoundingBoxComponents(self, bool include_upper_dots, bool include_lower_dots) would help in your case, though. Can you try and add tessapi.SetBoundingBoxComponents(True, True) at the top of recognize's page loop and re-run your example?
    (Also, did the dpi setting give any improvements?)
    stefanCCS
    @stefanCCS

    But without knowing your setup it is hard to give recommendations. Best to file an issue in the tesstrain repo.

    Issue created: (tesseract-ocr/tesstrain#210)

    Konstantin Baierer
    @kba

    We have just released a new version of ocrd_all with major changes in ocrd_tesserocr and ocrd_cis. In this context, we have also revised (OCR-D/ocrd-website#189) the workflow guide on OCR-D's homepage.

    Notably, ocrd_tesserocr has been refactored so that all the segmentation processors delegate to ocrd-tesserocr-recognize and behavior is controlled by the parameters segmentation_level, textequiv_level and model. See the workflow guide's section on region segmentation, https://github.com/OCR-D/tesserocr README and the --help output for details.

    This allows you to use Tesseract for region and line segmentation (and even deskewing and recognition) in one step, reducing the problem of overlap and loss of precision between API calls when running in multiple separate workflow steps. For single-step multi-level segmentation there is a new shorthand processor ocrd-tesserocr-segment, but the old single-level processors still exist, too.

    To better address the problem of overlap (even in single-level segmentation), the new option shrink_polygons=true allows annotating tight hull polygons for all new segments instead of coarse bounding boxes.

    ocrd-tesserocr-recognize now also exposes all internal Tesseract variables in a new tesseract_parameters dict.

    There is one backwards incompatibility, though: overwrite_words has been renamed and generalized to overwrite_segments.

    Another new processor is ocrd-tesserocr-fontshape, that uses tesseract's pre-LSTM models to detect font styles like italic, bold and more - very useful for works where font style conveys semantics, like dictionaries or thesauri.

    While ocrd-typegroups-classifier, a processor that can predict a wide variety of (historical) fonts/types including confidence for a page, has been part of ocrd_all for a long time, it is now also documented in the workflow guide.

    The second major update is cisocrgroup/ocrd_cis#77, which brings not only lots of fixes for robustness and speed, but also improves the incremental behaviour of segmentation, makes clipping and resegmentation optimise globally instead of locally, and allows a page level resegmentation (reassigning conflicting text lines between overlapping regions).

    Further, ocrd-segment-repair now finally tries to fix PAGE invalidities and inconsistencies automatically.

    Last but not least, we have revised the workflow recommendations, which now offer a single-processor workflow as a benchmark test for more involved workflows, and contain improved/simplified configurations for the "best" and "fast" workflows.

    Kay-Michael Würzner
    @wrznr
    This is amazing. Great job. Many, many thanks to @bertsky, @kba and the whole OCR-D team. You are awesome!
    Stefan Weil
    @stweil
    I agree with @wrznr. Thank you very much.
    Kay-Michael Würzner
    @wrznr
    image.png
    So beautiful: eine getreppte ſſ-Ligatur within antiqa.
    Clemens Neudecker
    @cneud
    I am happy to announce that the website for the 6th International Workshop on Historical Document Imaging and Processing (HIP’21) alongside #icdar2021 is now online with preliminary information about the workshop - please consider submitting original work (further info on CfP will follow early next year) and share widely!
    https://blog.sbb.berlin/hip2021
    jbarth-ubhd
    @jbarth-ubhd
    Want to train calamari: we'll have here a very good scan of a book printed with Bembo font: https://digi.hadw-bw.de/view/di016/0166 . But how to train? Text lines with random characters, or german trigramm generated (random) semiwords? With or without noise? Did already add random blur+unsharp mask before binarization, but I'm not sure how biased / "real" the training data should be.
    helkejaa
    @helkejaa

    I wonder whether SetBoundingBoxComponents(self, bool include_upper_dots, bool include_lower_dots) would help in your case, though. Can you try and add tessapi.SetBoundingBoxComponents(True, True) at the top of recognize's page loop and re-run your example?

    You mean to recognize.py? And if so, where exactly? changing dpi did not help essentially.

    Robert Sachunsky
    @bertsky
    @helkejaa yes, some between Tesseract init and main page loop, e.g. here

    changing dpi did not help essentially.

    too bad! But thanks for letting me know.

    helkejaa
    @helkejaa
    I'm not getting any different results with adding tessapi.SetBoundingBoxComponents(True, True) to recognize.py and running ocrd-tesserocr-recognize -I OCR-D-SEG-REG -O OCR-D-TESSPROB -p '{"dpi":500, "segmentation_level":"line", "overwrite_segments":true, "model":"khm+jpn"}'.
    20201215-204530.png
    helkejaa
    @helkejaa
    Perhaps somewhat similar problem, especially for ocrd-cis-ocropy-segment, can be encountered with arabic script, where upper parts of Kāf letters end up segmented as their own lines.
    20201215-220236.png
    for arabic script ocrd-tesserocr fares quite well, so this is not such a huge problem. Only that I've grown to like ocrd-cis-ocropy-segment because of spread feature.
    helkejaa
    @helkejaa
    I tried couple of dpis for these as well but no luck. Hope I'm doing it right or that there isn't some other factor which I'm not aware. Sorry for the spam, these problems keep haunting me when I time and time again work with these languages and scripts.
    Robert Sachunsky
    @bertsky
    @helkejaa don't worry – these are very legitimate and helpful questions! (It's just that we have not tried our tools on diverse sets of languages and scripts yet, and don't have the materials at hand. If you could post the original or binarized version of the above samples, I could take a look myself.)
    Are you sure that even ocrd-cis-ocropy-segment does not respond to an increased dpi override here?
    Regarding spread (i.e. tight polygonal fit): you can also run Tesseract with shrink_polygons=true now to get at least some polygonal approximate hull, and you could also combine Tesseract with Ocropy by using ocrd-cis-ocropy-resegment afterwards.
    helkejaa
    @helkejaa
    @bertsky I'm glad if this helps. OCR-D is already a dream come true for me after many painful years with ABBYY Finereader. Here are some samples from the dictionaries in question pashto-russian khmer-japanese.
    Setting -p '{"dpi":xx}' even to the point that some regions became to small to line segment did not inhibit the incorrect segmentation.
    helkejaa
    @helkejaa
    That is, while some became too small to be segmented, the ones that were barely big enough had this segmentation problem.
    Robert Sachunsky
    @bertsky
    Thanks! This did help, I can see why now: the problem is not the overall scale parameter (i.e. median size of glyph components; whose estimation is dependent on DPI), but the rather high vertical to horizontal ratio, in combination with a very large inter-line distance, which does not meet the expectations of the Ocropy segmentation (in its standard parameterization). The problem is, that in blackletter / broken fonts for Roman script, the opposite is true: follow-up lines are very close to each other, and ascenders/descender frequently overlap each other when seen from the side. So to avoid merging vertically adjacent lines, Ocropy needs to split earlier.
    I could expose the vscale parameter (a factor for various filters in vertical direction) to control this trade-off with a-priori knowledge.
    helkejaa
    @helkejaa
    @bertsky Interesting! I hope there will be some solution in the future. Thanks!
    Uwe Hartwig
    @M3ssman
    Good morning @all! Does anyone know around here when the Tesseract-Team (TesTea) will release a new stable version? The latest tag dates back to Dec 2019 ...
    Stefan Weil
    @stweil
    Who is the Tesseract team? And what is TesTea? The latest official stable version is Tesseract 4.1.1. There is still no stable Tesseract 5 because that would require freezing the Tesseract API and starting a Tesseract 6 for further API changes. Personally I'd like to eliminate the proprietary Tesseract data types GenericVector and STRING and replace them by standard types before tagging Tesseract 5.0.0. I already suggested tagging new alpha versions for Tesseract 5. Would that help?
    Konstantin Baierer
    @kba
    I do think that tagging alpha releases or even just date-versioned snapshots would help debugging and make it clearer to "casual" tesseract users that development is ongoing.

    TesTea

    Not a wise choice of abbreviation for an english-speaking audience ;-)

    Uwe Hartwig
    @M3ssman
    Pardon, please forget about my stupid attempt to create funny shortcuts. @stweil May I give any sort of assistance?
    Stefan Weil
    @stweil
    Maybe you want to comment on this discussion?
    Elisabeth Engl
    @EEngl52

    Many of you have probably already left for your well-deserved Christmas holidays. Therefore, we've decided to suspend the regular open TechCall, where various issues and topics of overriding interest are discussed, next Wednesday. Instead, we will offer you the opportunity to discuss topics or current problems, which are probably only of interest for yourself and not for the whole group.

    If you are interested in this more individual meeting, feel free to join us on Wednesday, 23.12., 2-3 pm in the usual room (https://meet.gwdg.de/b/eli-ufa-unu).

    The next regular open TechCall will then take place in the new year on 13.1.

    Until then, we wish you a Merry Christmas and a Happy New Year!

    Stefan Weil
    @stweil
    Tesseract news: there is now a tagged pre-release 5.0.0-alpha-20201224 which I suggest to use for production. The current git master includes massive changes which have broken much functionality and which are also incompatible with current tesserocr. I estimate that it will take some days to get a usable git master again.
    Stefan Weil
    @stweil
    Tesseract git master should now be usable again. The public API was stripped a lot (3 files completely removed).
    Stefan Weil
    @stweil
    Are there libraries with digitized music scores? How do you handle OCR for such images? The latest capella-scan switched from ABBYY to Tesseract and might be interesting for libraries, too.
    jpb-badw
    @jpb-badw
    Hello, I'm currently attempting an installation on a new computer. Should I follow the instructions on the OCR-D Website or the github ?
    Clemens Neudecker
    @cneud
    @jpb-badw Both instructions should be up-to-date. Note that the recommended installation for users is via Docker as specified in the Setup Guide and that one is a bit more user-friendly. If you are keen to dive a little deeper and experiment with the code, you can of course also perform a native installation, for which the ocrd_all README has the most detailled instructions. Hope it works for you!
    Elisabeth Engl
    @EEngl52
    @/all this Wednesday(!), 2-3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the following topics:
    jbarth-ubhd
    @jbarth-ubhd
    grafik.png
    I'm using tesseract v5.0.0-alpha-859-gd13e - about 0.15% of ocr is missing; no suspicious errors/warnings. Impression of the pages affected:
    Konstantin Baierer
    @kba
    Can you be elaborate what "0.15% of ocr is missing" means - compared to manual transcription or different engine? What data is missing? Could this be because of bad segmentation/cropping?
    jbarth-ubhd
    @jbarth-ubhd
    0.15% of the output files (*.hocr) are missing. Updated to 5.0.0-alpha-20201224 now...
    jbarth-ubhd
    @jbarth-ubhd
    Oops. Seems this could have been a problem with nfs/lsdf
    Elisabeth Engl
    @EEngl52
    @/all this Wednesday, 2-3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the following topics:
    Konstantin Baierer
    @kba
    Sry, typo on my part, that should read "New in 2.22.0" (2.20.0 was release in November 2020)
    jbarth-ubhd
    @jbarth-ubhd
    On Oct.2020 @cneud wrote »Dear users of sbb-textline-detector tool in favour of a newer version ...«, but the newer version is not there already, right?
    Konstantin Baierer
    @kba
    https://github.com/qurator-spk/eynollah the OCR-D interface is not ready yet, we're working on it to get it out asap.