Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Stefan Weil
    @stweil
    Which GPU hardware is recommended for use with OCR-D?
    7 replies
    Lena Hinrichsen
    @lena-hinrichsen
    @/all this Wednesday, 2–3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the topics:
    • Organizing development in the coordination project (Paul Pestov)
    • Vision for the workflow engine by GWDG (Triet Doan)
    Lena Hinrichsen
    @lena-hinrichsen
    @/all The next meeting of OCR(-D) & Co will take place on 3 September 2021, to which we invite everybody interested. At the open meeting in barcamp format, developers, users and all other interested parties can exchange information about OCR(-D) in a low-threshold manner.
    We meet at 10-11 am in a BBB room. At the beginning of each meeting, participants have the opportunity to suggest topics of interest to them, which can then be discussed and debated in small groups. Discussion points can also be entered in advance in the HedgeDoc. In addition to the Open TechCall, where the OCR-D community discusses technical topics, OCR(-D) & Co also offers participants without in-depth OCR-D knowledge the opportunity to contribute their own questions and ideas to the discussion and to openly exchange ideas with each other.
    Robert Sachunsky
    @bertsky
    BTW @kba @paulpestov @tdoan @anguelos regarding the issue of dealing with GPU resource allocation in servers (under workspace / page parallelization): At least in Tensorflow there is a feature that comes close to transparent paging (called "unified memory"), and – if used properly – it has little overhead. In my workflow server / processing server prototype I have explained this in detail.
    We have deployed and tested this in a practical system performance-wise:
    1. in TF, set per_process_gpu_memory_fraction and control via environment variable (how many processes are meant to share the same physical GPU?): https://github.com/OCR-D/ocrd_segment/blob/b27131bcadf563c947fee51aabe7656d1c185b8c/ocrd_segment/classify_formdata_layout.py#L107
    2. in TF, activate use_unified_memory and thus prevent OOM (in case you misestimated peak allocation): https://github.com/OCR-D/ocrd_segment/blob/b27131bcadf563c947fee51aabe7656d1c185b8c/ocrd_segment/classify_formdata_layout.py#L109
    3. at the start of the server, configure processors such that as many instances get GPU access as do actually fit and set the envvar accordingly: https://git.informatik.uni-leipzig.de/smarthec/smarthec_webservice#docker-compose
    Matthias Boenig
    @tboenig

    @/all https://pad.gwdg.de/s/uHnvC1wUW
    Wie auf dem OCR-D-Workshop angekündigt, möchte OCR-D ein Forum für Ground Truth organisieren. Ich schlage vor, dass dies am Donnerstag zwischen 13:00 - 14:00 Uhr stattfindet. Unser Plan ist: Wir beginnen am 16. September.
    Themen:

    • Nutzung der Ground Truth Guidelines (https://ocr-d.de/de/gt-guidelines/trans/)
    • GT Speicherung, GT Nutzung, GT Kurration
    • Transformationen von ABBYY, ALTO, ... nach Ground Truth format PAGE
    • Fragen zu PAGE

    As announced at the OCR-D workshop, OCR-D would like to organize a forum for Ground Truth. I propose that this take place on Thursday between 1:00 - 2:00 pm. Our plan is: we will start on September 16.
    Topics

    Stefan Weil
    @stweil
    tesseract-ocr now uses main for the default git branch of all repositories (see tesseract-ocr/tesseract#3554). I suggest to do that for the OCR-D repositories as well.
    Robert Sachunsky
    @bertsky

    I suggest to do that for the OCR-D repositories as well.

    I disagree. (See my arguments here.)

    Lena Hinrichsen
    @lena-hinrichsen
    @/all Unfortunately, our Open TechCall has to be cancelled this week. We look forward to seeing you all at the announced GT Call on Thursday (1:00-2:00 p.m.) instead! The next Open TechCall will take place September 29.
    Thomas Werkmeister
    @twerkmeister
    Hey there, I am just trying out ocrd and I running into some issues regarding processing speed - just fired up a binarization example ocrd-olena-binarize -I OCR-D-IMG -O OCR-D-BIN -P impl sauvola from the user guide on a small subset of my data with 400 pages or so and it seems that each page takes like 4 seconds to process including two times writing of the mets.xml. At the same time I am seeing almost no cpu / memory utilization. I am running inside docker on a Ubuntu server with 8 cores/64GB of ram. Importing was slow too. Even for this small test run it is painfully slow with binarization taking 26min, but I'd potentially would want to process multiple 100k documents and it's just out of question given the speed. Is this expected or am I likely doing something wrong? Any tips would be much appreciated.
    9 replies
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Unfortunately, next week's GT Call has to be cancelled. We look forward to seeing you all on 14 October for the next GT Call instead. Also, our Tech Call will take place next Wednesday at 2pm as scheduled. Another opportunity for exchange is, as every first Friday at 10am of the month, our Barcamp, which we will also remind you of next week.
    Lena Hinrichsen
    @lena-hinrichsen
    @/all this Wednesday, 2–3pm is our next open Tech Call. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the topics:
    • GPU for OCR(-D)
      • What models of GPU are you using?
      • What should we recommend for potential buyers?
      • What GPU-benefitting processors are you using?
      • Best Practices and experiences around efficient data transfer from/to GPU/CPU/RAM
      • Techniques for efficient scaling (multi-device) and sharing (multi-kernel/client)
    • specification/conventions for TextEquiv/@index in and outside of OCR-D, cf. OCR4all/LAREX#282
    Lena Hinrichsen
    @lena-hinrichsen
    @/all The next meeting of OCR(-D) & Co will take place tomorrow, on 1 October 2021, to which we invite everybody interested. At the open meeting in barcamp format, developers, users and all other interested parties can exchange information about OCR(-D) in a low-threshold manner.
    We meet at 10-11 am in a BBB room. At the beginning of each meeting, participants have the opportunity to suggest topics of interest to them, which can then be discussed and debated in small groups. Discussion points can also be entered in advance in the HedgeDoc. In addition to the Open TechCall, where the OCR-D community discusses technical topics, OCR(-D) & Co also offers participants without in-depth OCR-D knowledge the opportunity to contribute their own questions and ideas to the discussion and to openly exchange ideas with each other.
    Lena Hinrichsen
    @lena-hinrichsen

    Meet @NWeichselbaumer & Christoph Reske from the OCR-D module project Font Group Recognition for Improved OCR and other interesting guests tomorrow and the day after tomorrow at (German language) expert talk Typometrische Daten in Drucken des 17. Jahrhunderts: https://www.hab.de/event/typometrische-daten/

    Registration for online participation by e-mail at: beyer@hab.de / cboveland@hab.de

    jbarth-ubhd
    @jbarth-ubhd
    Running sbb-binarize, eynollah-segment, calamari-recognize [maximum-cuda] on our RTX 3090 on 196 TIFFs on 10-1 12:00. 129 finished so far. 2700 s/TIFF.
    6 replies
    jbarth-ubhd
    @jbarth-ubhd
    nvidia-smi: | 0 N/A N/A 2274068 C .../headless-tf1/bin/python3 271MiB |
    jbarth-ubhd
    @jbarth-ubhd
    calamari-recognize does not run on GPU, even with maximum-cuda
    jbarth-ubhd
    @jbarth-ubhd
    grafik.png
    gpu memory usage, red=sbb-binarize, blue=eynollah-segment, green=calamari-recognize
    jbarth-ubhd
    @jbarth-ubhd
    From tensorflow(2?) docs: »...to configure a virtual GPU device with tf.config.set_logical_device_configuration and set a hard limit on the total memory to allocate on the GPU.«
    jbarth-ubhd
    @jbarth-ubhd
    found in eynollah.py: #gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=7.7, allow_growth=True)
    jbarth-ubhd
    @jbarth-ubhd
    issue created.
    Mike Gerber
    @mikegerber

    calamari-recognize does not run on GPU, even with maximum-cuda

    It should? Could you open a GitHub issue for that as well?

    Mike Gerber
    @mikegerber

    calamari-recognize does not run on GPU, even with maximum-cuda

    It should? Could you open a GitHub issue for that as well?

    OCR-D/ocrd_calamari#68 @jbarth-ubhd

    Lena Hinrichsen
    @lena-hinrichsen
    @/all We would like to use tomorrow's Open TechCall (2–3pm Berlin Time) to collect feedback on our software documentation. What is missing, outdated or lacking in detail? We have also collected further questions for feedback on our software as a suggestion: https://pad.gwdg.de/75dyxG6gS-e0Q04_fpm-ng
    Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu !
    Lena Hinrichsen
    @lena-hinrichsen

    @/all Join us tomorrow, 1–2 pm (Berlin Time) in our next OCR-D-GT-Call. Tomorrow's discussion will be focused on problems transcribing difficult cases. Link to our Big Blue Button: https://meet.gwdg.de/b/eli-ufa-unu

    PS: You can easily look up upcomning events at our website now: https://ocr-d.de/en/community.html

    Uwe Hartwig
    @M3ssman
    FYI: We've updated our Repository for Fulltext Generation of Historical German Newspapers with the latest trainingdata set (16k) and a rather small, but double-keyed Wordlist for Tesseract 4.x Finetuned Training based on regular frk or gt4hist (https://github.com/ulb-sachsen-anhalt/ulb-zeitungsprojekt-hp1)
    AriVesalainen
    @AriVesalainen
    I'm just about to start to write my Data Science Master's Thesis at the University of Helsinki. The topic will be linked to page layout/segmentation especially in OCR. I would be interested to integrate this with OCR-D, what would be the best way to do this and to whom I should contact to agree on practicalities.
    3 replies
    Mike Gerber
    @mikegerber

    dinglehopper the OCR evaluation tool just got 50x faster!

    Many thanks to @maxbachmann of the RapidFuzz string matching library who implemented all my API wishes (support for all hashable types and an implementation of editops()). This is particulary interesting for anyone using the python-Levenshtein library: python-Levenshtein is GPL-licensed and makes it incompatible with your own non-GPL license - while RapidFuzz is MIT licensed. (Other advantages of RapidFuzz: support for all hashable types and it's also actively maintained: I had a bug fixed and fix released within an hour!)

    1 reply
    Lena Hinrichsen
    @lena-hinrichsen

    @/all We would like to use this week's Open TechCall (Wednesday, 2–3pm Berlin Time) again to collect feedback:

    • Documentation: What is missing, outdated or lacking in detail?
    • CLI: What features of processor CLI you use most often, which seldom?
    • CLI: Are there missing ocrd subcommands for tasks you do regularly?
    • CLI: How do you parallelize workflows? Do you know the .. operator and ocrd workspace merge commands?
    • Python API: Do you use it directly in scripts, e.g. instantiating a workspace or using the PAGE API? How could we improve it?
    • Evaluation: How do you evaluate the results, what kind of tools and measures do you use?

    Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu

    Mike Gerber
    @mikegerber
    @bertsky Could you point me to the discussion of putting more than one CUDA Toolkit version into a container image/system? Can't find an issue - or maybe it was just discussed here in Gitter? (Context: OCR-D/ocrd_calamari#68)
    Found it: OCR-D/ocrd_all#263
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Join us tomorrow, 1–2 pm (Berlin Time) in our next OCR-D-GT-Call. Tomorrow's discussion will be focused on Layout/Structure Ground Truth. Link to our Big Blue Button: https://meet.gwdg.de/b/eli-ufa-unu
    Mike Gerber
    @mikegerber

    In 2020 I built an updated Ubuntu (and Debian) package for olena using OCR-D's version, because it was painful to build Olena from source. Is there still interest in this? ocrd_olena seems to build it from source. I'm trying to figure out what to do with the package, e.g. setup up an APT repo or not. For myself, having the .deb online is good enough.

    Ubuntu: https://qurator-data.de/~mike.gerber/olena_2.1-0+ocrd-git+1/
    Debian: https://qurator-data.de/~mike.gerber/olena_2.1-0+ocrd-git+1-debian10/

    Mike Gerber
    @mikegerber

    dinglehopper the OCR evaluation tool just got 50x faster!

    Many thanks to @maxbachmann of the RapidFuzz string matching library who implemented all my API wishes (support for all hashable types and an implementation of editops()). This is particulary interesting for anyone using the python-Levenshtein library: python-Levenshtein is GPL-licensed and makes it incompatible with your own non-GPL license - while RapidFuzz is MIT licensed. (Other advantages of RapidFuzz: support for all hashable types and it's also actively maintained: I had a bug fixed and fix released within an hour!)

    Same author (@maxbachmann) seems to have taken up the maintenance of python-Levenshtein at https://github.com/maxbachmann/Levenshtein, which is also a great development. (But it still has the "GPL problem", of course)

    1 reply
    Lena Hinrichsen
    @lena-hinrichsen
    @/all The next meeting of OCR(-D) & Co will take place on Friday, 5 November 2021, to which we invite everybody interested. At the open meeting in barcamp format, developers, users and all other interested parties can exchange information about OCR(-D) in a low-threshold manner.
    We meet at 10-11 am in a BBB room. At the beginning of each meeting, participants have the opportunity to suggest topics of interest to them, which can then be discussed and debated in small groups. Discussion points can also be entered in advance in the HedgeDoc. In addition to the Open TechCall, where the OCR-D community discusses technical topics, OCR(-D) & Co also offers participants without in-depth OCR-D knowledge the opportunity to contribute their own questions and ideas to the discussion and to openly exchange ideas with each other.
    Lena Hinrichsen
    @lena-hinrichsen
    @/all this Wednesday, 2–3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the topics:
    • PAGE Viewer in browse-ocrd
    • Algorithms for evaluating reading order performance?
    • Bugfix release of OCR-D/core
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Join us tomorrow, 1–2 pm (Berlin Time) in our next OCR-D-GT-Call where will continue the discussion on Layout/Structure Ground Truth. Link to our Big Blue Button: https://meet.gwdg.de/b/eli-ufa-unu
    Robert Sachunsky
    @bertsky
    @hnesk just to let you know: your new PageView also works flawlessly via Broadwayd in the browser (across platforms):
    ocrd-browser-pageview-broadwayd-euler.png
    Mike Gerber
    @mikegerber
    Awesome news! I have a few small issues with PAGE Viewer and it's great to have an alternative
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Please note that our TechCall and GT Call cannot take place next week. Instead, we would like to draw your attention to the Kitodo user meeting (via Webex on 24/25 November) . At the user meeting you will also be able to meet OCR-D project participants. Among others, @stroetgen @bertsky @stweil @m-kotzyba @liljams and @tdoan2010 will report on their work in the OCR-D implementation projects Integration of Kitodo and OCR‑D for productive mass digitisation and OPERANDI: OCR-D Performance Optimisation and Integration, but there will also be some other exciting talks and workshops on the Kitodo Software. For the programme and instructions on how to register (by 19 November), please visit https://www.kitodo.org/en/anwendertreffen
    Konstantin Baierer
    @kba
    Heads-up for users of the OCR-D resource manager: There was an inconsistency when downloading to/loading from current working directory between the specification and implementation in core. When you specify --location cwd, models are looked for directly in the current working directory, i.e. a resource <fname>, is looked for in <cwd>/<fname>, not in <cwd>/ocrd-resources/<processor>/<fname> as is the case for the other location options. While it has always been implemented this way, the spec have been corrected to reflect this.
    Stefan Weil
    @stweil
    Tesseract (the software which is used for ocrd-tesserocr) release 5.0.0-rc3 is now available, and the final release 5.0.0 is planned to follow soon. Please report any remaining problems which should be fixed in 5.0.0. See https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0-rc3.
    4 replies
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Tomorrow and the day after tomorrow, vBIB21 will take place on the topic of Digital Communities. The programme includes many exciting lectures. In a Speakers Corner, moderated by @wrznr, we will discuss our community activities in OCR-D with you. Do you have any feedback on what we offer the community in OCR-D? Maybe ideas for new formats or other suggestions? You are welcome to contribute them as well – tomorrow in Speakers Corner B from 11:15 a.m. (CET). The complete programme and the links to the rooms can be found at https://www.vbib.net/vbib21-programm/ . Please note that our and most other presentations will be held in German but you are very welcome to give comments in English, of course.
    Konstantin Baierer
    @kba

    We just released a new minor version 2.28.0 of OCR-D/core, with these changes:

    Added:

    • Store parameterization of processors in METS for provenance, #747
    • ocrd workspace find --download: Add a --wait option to wait between downloads, #745
    • bashlib: Check fileGrps when parsing CLI args, #743, OCR-D/ocrd_olena#76
    • Dockerfile: Install time to have /usr/bin/time in the image, #748, OCR-D/ocrd_all#271

    Fixed:

    • ocrd-dummy: Also set pcGtsId, v0.0.2, #739

    Thanks to @bertsky for his contributions.

    If you have any question or comments, feel free to respond here or in the release announcement discussion on GitHub.

    A new release of ocrd_all will follow later today or tomorrow.

    Konstantin Baierer
    @kba

    We now also released a new version v2021-11-30 of ocrd_all.

    It includes the brand new tesseract 5.0.0, the latest OCR-D/core v2.28.0, as well as fixes to workflow_configuration and a OCR-D/ocrd_segment@bdc6771 in ocrd_segment.

    Thanks to @bertsky, @stweil and all contributors!

    The Docker images are being rebuilt and should be deployed and available in a few hours.

    jbarth-ubhd
    @jbarth-ubhd
    I've just noticed that the current calamari models here https://github.com/Calamari-OCR/calamari_models (»v5«) are not running with current ocrd_all (docker).
    Mike Gerber
    @mikegerber
    @jbarth-ubhd It's a bit of a Gordian knot because the Calamari models require a newer Calamari version which apparently requires Python 3.7 which (AFAIK) isn't supported in ocrd_all yet. For now I can offer our (Qurator SPK's) model at https://qurator-data.de/calamari-models/GT4HistOCR/2019-12-11T11_10+0100/
    Lena Hinrichsen
    @lena-hinrichsen
    @/all Look forward to the innovations in the OCR(-D) & Co format! Tomorrow at 10am (CET) you can join us in WorkAdventure. In WorkAdventure you can walk around our virtual space with an avatar and talk to others nearby (similar to Wonder, Gather etc.). Take the opportunity to get to know each other or meet up other community members and project participants. We look forward to seeing you! The first appointment for 2022 will be on 14 January, so you should not have any overlaps with your plans for winter holidays. After that, it will continue as usual on the first Friday of every month.
    Konstantin Baierer
    @kba
    @/all We have an interesting agenda for our Open Tech Call. Feel free to join us on Wednesday 2pm CET in https://meet.gwdg.de/b/eli-ufa-unu