Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Lucas Sulzbach
    @sulzbals

    Dear users of sbb-textline-detector (cc @sulzbals) - this is just an ahead notice that SBB/SPK will soon phase out the sbb-textline-detector tool in favour of a newer version of a layout detection tool which brings numerous improvements, such as e.g. more granular CLI/API access, additional layout elements being detected (marginalia, headlines, initials) and generally much improved performance. We aim to provide the new tool with OCR-D compliant interfaces via our regular GitHub https://github.com/qurator-spk - if all goes well roughly by the end of the month.

    Very interesting! I am looking forward to this new tool!

    u-mierendorff
    @u-mierendorff

    Hello :)
    I am experimenting with OCR-D for a while now, and like the modularized concept which allows me to compare different OCR pipelines.

    After starting with a simple workflow with a few steps, that works fine (tesserocr-segment-region -> tesserocr-segment-line -> tesserocr-segment-word -> tesserocr-recognize), I wanted to explore other workflows and remove some explicit segmentation steps in the beginning.

    While simpler workflows sometimes work (tesserocr-segment-region -> tesserocr-segment-line -> tesserocr-recognize{'overwrite_words': True, 'textequiv_level': 'line'}), I had problems with the following workflows:

    • A) Region Segmentation + Tesseract Recognition (produces best quality text output of all my experiments, but the PAGE only contains region segmentation and no line+word segmentation)
        ```tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK
        tesserocr-recognize -p "{'model': xxx, 'textequiv_level': 'region'}" -I OCR-D-SEG-BLOCK -O OCR-D-OCR-TESS```
    • B) Just Tesseract Recognition (I expected this to be working, as I can also run tesseract 5.0.0 on the command line without any parameters, but this does not return any recognized text at all)
        ```tesserocr-recognize -p "{'model': xxx}" -I OCR-D-IMG -O OCR-D-OCR-TESS```

    Are these cases currently not supported by the tesserocr OCR-D modules or am I doing something wrong here?

    Robert Sachunsky
    @bertsky
    @u-mierendorff Hi and welcome!
    You need to do line segmentation before you can do recognition. That's true for all OCR engines. And because of OCR-D's modularization, there is currently no processor which does both in one step (which of course would be possible e.g. with Tesseract's API). See also ocrd-tesserocr-recognize -h and the workflow guide
    stefanCCS
    @stefanCCS
    grafik.png
    Hi,
    I have an example, where ocrd-anybaseocr-crop behaves "strange".
    I have an image show above.
    And get as result a cropped images as below:
    grafik.png
    Using simple workflow like this:
    ocrd-cis-ocropy-binarize \
    -I OCR-D-IMG \
    -O OCR-D-BIN
    ocrd-anybaseocr-crop \
    -I OCR-D-BIN \
    -O OCR-D-CROP
    Any idea?
    u-mierendorff
    @u-mierendorff

    @bertsky Ah Ok, thank you. Regarding the "need to do line segmentation". When doing only region segmentation and then recognition, it properly recognizes all text, but the output file has the text attached to the region and no word/line segmentation in it.

    The reason, why I am experimenting with fewer/different segmentation steps is, that the recognition quality of the text significantly decreases with more segmentation steps. So I want to do compare this.

    So if I understand correctly, it is currently not possible what I want to achieve, but I could adapt the ocrd-tesserocr programs to make it work?

    Robert Sachunsky
    @bertsky

    @u-mierendorff

    When doing only region segmentation and then recognition, it properly recognizes all text, but the output file has the text attached to the region and no word/line segmentation in it.

    Sorry, what do you mean by it? The (non-OCR-D) Tesseract CLI?
    Yes, there is a loss of precision when modularising Tesseract like we do currently, because its API only gives us coarse bounding boxes. (And possibly also because we don't allow Tesseract to do the binarization itself, see OCR-D/ocrd_tesserocr#144).
    I have a branch here locally that does all the segmentation at once, which gives lines and words without overlap. But then if you wanted to combine (say) Tesseract region segmentation with Ocropy line segmentation, you would still have to shrinkg the bboxes to polygons and remove all the text lines again. I can make a public PR for others to experiment with this approach if you like.

    @stefanCCS
    yes, I have seen this happen too. ocrd-anybaseocr-crop has some cryptic options which I have not been able to look at yet – maybe they can help. But there's also another cropping tool, ocrd-tesserocr-crop (which is more reliable, but cannot cope at all with textual noise, i.e. facing pages)
    u-mierendorff
    @u-mierendorff

    @bertsky

    When doing only region segmentation and then recognition, it properly recognizes all text, but the output file has the text attached to the region and no word/line segmentation in it.

    Sorry, what do you mean by it? The (non-OCR-D) Tesseract CLI?
    No, I mean the tesserocr-recognize CLI tool. I apply the following process in an ocrd-workspace containing an RGB jpg image and a METs file (no other steps involved).

    tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK
    tesserocr-recognize -p "{'model': xxx, 'textequiv_level': 'region'}" -I OCR-D-SEG-BLOCK -O OCR-D-OCR-TESS

    The result is a PAGE file with only region segmentation (no lines or words) and all recognized text attached to the region as a single string in the XML. The text quality itself is better than if I do more segmentation in the previous steps (e.g. an additional tesserocr-segment-line), but I am missing segmentation information (line/words) in the output, which tesserocr-recognize must have produced at some point internally to do the recognition.

    Yes, there is a loss of precision when modularising Tesseract like we do currently, because its API only gives us coarse bounding boxes. (And possibly also because we don't allow Tesseract to do the binarization itself, see OCR-D/ocrd_tesserocr#144).
    I have a branch here locally that does all the segmentation at once, which gives lines and words without overlap. But then if you wanted to combine (say) Tesseract region segmentation with Ocropy line segmentation, you would still have to shrinkg the bboxes to polygons and remove all the text lines again. I can make a public PR for others to experiment with this approach if you like.

    Ah Ok, that explains this a bit.
    I would have thought that binarization is not required with modern tesseract 5.0.0, but if it is done implicitly anyway, then my assumption is possibly wrong.

    Sure, would be cool to look into that.

    Robert Sachunsky
    @bertsky
    @u-mierendorff ah, of course, ocrd-tesserocr-recognize also offers line segmentation + recognition in one step already. Yes, that's exactly what I mean. And yes, naturally you won't see lines or words on that hierarchy level.
    So I will experiment some more and publish a PR for a new ocrd-tesserocr-segment (i.e. without differentiation into regions, lines and words, and all-in-one).
    Tesseract 4 (and the current alpha version, 5) did not modernize segmentation in any way BTW. It only brought LSTM recognition (which comes with neural word and glyph segmentation, and indeed does not need binarized input in principle).
    Internal binarization is quite bad (Otsu), that's why OCR-D workflows should avoid it by binarizing (with good algorithms/models) externally. But as the linked PR shows, this also has drawbacks in some situations.
    Stefan Weil
    @stweil
    Still missing is support for region detection + line segmentation + recognition up to word level in one step. That could be used for very simple and fast workflows.
    Robert Sachunsky
    @bertsky
    @stweil yes, probably best to offer all those options in a single processor.
    @kba how about instead of a generalised ocrd-tesserocr-segment I simply extended ocrd-tesserocr-recognize such that if no text regions whatsoever are present, it would first do region segmentation, and if no text lines are present, it would first do line segmentation? (Remember, we are already "polymorphic" on the input side below the line level.) So the textequiv_level parameter would denote the hierarchy level only on the output side, while on the input side any level would be allowed.
    Konstantin Baierer
    @kba

    I simply extended ocrd-tesserocr-recognize such that if no text regions whatsoever are present, it would first do region segmentation, and if no text lines are present, it would first do line segmentation?

    Sounds reasonable, there obviously is a need for this :+1:

    Konstantin Baierer
    @kba
    Any CUDA users who can help @Witiko with getting calamari running on GPU in docker? OCR-D/ocrd_calamari#46
    jbarth-ubhd
    @jbarth-ubhd
    Before new sbb_textline is out, here a segmenter comparison: https://digi.ub.uni-heidelberg.de/diglitData/v/testset-ls-v3.pdf
    PS: done without region-clip. Binarization=wolf
    Konstantin Baierer
    @kba
    :clap: thanks for making and sharing these comparisons!
    stefanCCS
    @stefanCCS

    Hi, I have tried to get "browse-ocrd" to run on Windows 10.
    May aim is to run it really "native" in Windows (and not "only" in any kind of Linux-VM under Windows, which I assume should work somehow).
    Good news is, that I could build gtk under Windows natively following this: https://github.com/wingtk/gvsbuild
    (I have created a 32bit variant, but could also create a 64bit variant (at least, I think so), in case needed).
    --> if somebody needs the gtk, I can provide it - please contact me in private chat.
    Bad news is, that if I do "pip install browse-ocrd" in my Python 3.8 Windows environment, I will get the error, that the pip cannot find the include files of my freshly build gtk.
    (which is somehow expected, as this is located "somewhere else" in my file system).

    error C1083: Cannot open include file: 'msvc_recommended_pragmas.h': No such file or directory

    Question is now, how can I setup the pip environment in a way, that the includes and libs are found?

    stefanCCS
    @stefanCCS
    Well, with help of Johannes I have got the hint to use --global-option=build_ext --global-option="-I... " .
    Unfortunately, this looks like to be not compatible with "wheel" building.
    --> therefore, I will stop my effort now to get browse-ocrd "natively" run on Windows 10.
    Mike Gerber
    @mikegerber
    ...
      File "/home/noelte/ocrd_all/venv/lib/python3.6/site-packages/ocrd/decorators/__init__.py", line 81, in ocrd_cli_wrap_processor
        run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
      File "/home/noelte/ocrd_all/venv/lib/python3.6/site-packages/ocrd/processor/helpers.py", line 68, in run_processor
        processor.process()
      File "/home/noelte/ocrd_all/venv/lib/python3.6/site-packages/qurator/dinglehopper/ocrd_cli.py", line 37, in process
        gt_file = self.workspace.mets.find_files(fileGrp=gt_grp, pageId=page_id)[0]
    IndexError: list index out of range
    ocrd-dinglehopper assumes that you have files for every page id in both the GT group and in the OCR group
    Mike Gerber
    @mikegerber
    (ocrd-dinglehopper should issue a warning and skip a page if there is no matching GT or OCR file for a page. qurator-spk/dinglehopper#34)

    Confirmed. Dinglehopper does not use the standard pattern established by @finkf in ocrd_cis and re-used in various processors of searching for matching pageIds across input fileGrps, but strictly requires each pageId to be present in both input fileGrps or crashes.

    where do i find the "standard pattern"?

    Mike Gerber
    @mikegerber

    i'm debugging/reproducing an issue with the cuda docker images and i fail to get output from this processor run through ocrd process:

    % docker run --gpus all --rm -u `id -u` -v /tmp/actevedef_718448162.first-page+binarization+segmentation:/data -w /data -v /srv/data/qurator-data/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/:/models -- ocrd/all:maximum-cuda nice -n 10 ocrd process -l DEBUG --overwrite 'calamari-recognize -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR -P checkpoint /models/\*.ckpt.json'
    2020-10-16 14:43:10,731.731 DEBUG ocrd.resolver.workspace_from_url - Deriving dst_dir /data from /data/mets.xml
    2020-10-16 14:43:10,731.731 DEBUG ocrd.resolver.workspace_from_url - workspace_from_url
    mets_basename='mets.xml'
    mets_url='/data/mets.xml'
    src_baseurl='/data'
    dst_dir='/data'
    2020-10-16 14:43:10,731.731 DEBUG ocrd.resolver.download_to_directory - directory=|/data| url=|/data/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
    2020-10-16 14:43:10,731.731 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/mets.xml' (url: '/data/mets.xml')
    2020-10-16 14:43:11,697.697 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-LINE-SBB'] output_file_grp=[]
    2020-10-16 14:43:11,698.698 INFO ocrd.task_sequence.run_tasks - Start processing task 'calamari-recognize -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR -p '{"checkpoint": "/models/*.ckpt.json", "voter": "confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
    2020-10-16 14:43:11,698.698 DEBUG ocrd.processor.helpers.run_cli - Running subprocess 'ocrd-calamari-recognize --working-dir /data --mets mets.xml --log-level DEBUG --input-file-grp OCR-D-SEG-LINE-SBB --output-file-grp OCR-D-OCR --parameter {"checkpoint": "/models/*.ckpt.json", "voter": "confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001} --overwrite'
    2020-10-16 14:43:48,879.879 INFO ocrd.task_sequence.run_tasks - Finished processing task 'calamari-recognize -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR -p '{"checkpoint": "/models/*.ckpt.json", "voter": "confidence_voter_default_ctc", "textequiv_level": "line", "glyph_conf_cutoff": 0.001}''
    2020-10-16 14:43:48,881.881 INFO ocrd.cli.process - Finished

    is the output logged somewhere? the ocrd-calamari-recognize call seems ok and should produce a lot of output

    (leaving ocrd process out gives me the expected output)
    Konstantin Baierer
    @kba
    STDOUT/STDERR of processors run with ocrd process is only logged when the task fails.
    OCR-D/core#592 known issue, I'll investigate
    Mike Gerber
    @mikegerber
    yeah, that should be fixed :) in this case, the processor runs without error but i want the output anyway to investigate gpu usage etc.
    sgoettel
    @sgoettel
    Hey guys, some of you might know me already from several meetings, I'm Sebastian Göttel, working at BBAW in Berlin. I'm currently working on a digitized cemetry record from a jewish community in Germany. I've transcribed the whole record in Transkribus and exported it in ALTO-format. Now my problem is, that the generated METS-file is not working with the DFG-Viewer. My main goal is to view the register in the DFG-Viewer but I don't know how to generate o proper METS-file. I've tried a lot of recommended tools from https://www.loc.gov/, but none of them is working properly. Kay told me there should be a way to make this work with the workflow of OCR-D, so I've downloaded the core-files but I'm not sure which "tool" might be the right one to create a working METS-file to view the pages in DFG-Viewer
    My alto and jpg files are stored under https://kaskade.dwds.de/~goettel/friedhofregister_der_juedischen_gemeinde/
    Would highly appreciate it if someone could help me out here
    Kay-Michael Würzner
    @wrznr
    AFAIK @stweil managed to combine OCR-D workflows and DFG-Viewer output. I think the likely way is 1. generate empty METS with ocrd workspace init 2. bulk-add image and ALTO files and 3. perform some magic to make the whole thing DFG-viewer-compliant.
    The third step is where @stweil could be of help, I suppose.
    sgoettel
    @sgoettel

    I've created a workspace with ocrd workspace init and installed ocrd-import via VIRTUAL_ENV, the generated METS included the images and xml-files (although I'm not sure why you need page-XML-files instead of ALTO). It did work though but still not accepted by DFG-Viewer

    maybe I've done something wrong, not sure though, whoever wants to give it a shot, my files are now stored under https://kaskade.dwds.de/~goettel/strelitz/ page-XML, ALTO, jpgs, everything

    Stefan Weil
    @stweil
    @sgoettel, could you please add the generated METS file which fails to work, too?
    @wrznr, my "magic" is documented here: https://github.com/OCR-D/ocrd-website/wiki/How-to-create-ALTO-for-DFG-Viewer. It currently only covers the use case of adding ALTO to an existing (working) METS file.
    Uwe Hartwig
    @M3ssman
    @stweil @wrznr We've extended the ALTO problematics with more documentations: https://github.com/OCR-D/ocrd-website/wiki/How-to-create-searchable-Fulltext-Data-for-DFG-Viewer
    Uwe Hartwig
    @M3ssman
    @sgoettel First of all, using Transkribus' ALTO export is not a really good idea. Second: Your file storage is missing a METS-file, I only see a TEI-file. If you managed to create an empty METS-file and added the ALTO-Output to a FileGroup named "FULLTEXT", and also added these files to the METS-physical Structuremap to link them to the proper image, it's sounds quite good - regarding the METS-part and given, you do not need information regarding the logical structure of your document (this is where the navigation at the left in DFG-viewer comes from). I personally never worked with empty METS-Files, we have always a fullblown MODS-section integrated with at least a descriptive and an administrative section. I cannot imagine an empty METS/MODS-file with only METS information will do the job, since, IIUC, the DFG-Viewer also requires some MODS-metadata to be present. If you have no MODS-data and miss the logical structure, the DFG-Viewer doesn't even know the title of the work to display, e.g.
    Uwe Hartwig
    @M3ssman
    For more informations, please have a look: http://dfg-viewer.de/metadaten
    Regarding the docs, it looks like the METS-file requires a MODS or a TEI section to be present. I never worked with the latter, but out-of-blue you should add a section like <mets:tei>and put all your TEI-stuff there if you also added the proper teinamespace.
    sgoettel
    @sgoettel

    @sgoettel, could you please add the generated METS file which fails to work, too?

    Yes, I'm trying to generate it again but I've trouble starting/using ocrd-import via env, had they same problem yesterday, don't remember how I finally managed to make it work; will try again during the day. And thanks for the reply

    @sgoettel First of all, using Transkribus' ALTO export is not a really good idea.
    Okay, didn't know that. My page-XML-files are also generated from Trankribus though, not sure if this is important to mention

    Second: Your file storage is missing a METS-file, I only see a TEI-file.
    Yeah, I didn't upload the one created by Transkribus since it basically contained nothing.

    Uwe Hartwig
    @M3ssman
    @sgoettel The PAGE exported by Transkribus is per default PAGE v2013 (rather old, see). OCR-D for example is using PAGE v2019 . Depending on what you plan to do with Transkribus PAGE data, you may run into trouble.
    Konstantin Baierer
    @kba
    The problem with Transkribus' PAGE-XML is not so much the namespace (sed -i 's,2013-07-15,2019-07-15') but that they reuse the PAGE-XML namespace and add elements to it.
    sgoettel
    @sgoettel
    my TEI-Files is just for the "Deutsche Textarchiv", the only reason I've generated page-XML and ALTO is for the purpose of creating a METS
    Uwe Hartwig
    @M3ssman
    @sgoettel I guess you should try to use OCR-D like @wrznr suggest to add the files to a rather empty METS (see link from @stweil). If you managed this, you might try to edit the METS-XML and put the TEI into place
    Konstantin Baierer
    @kba
    here's a gist of how this can be achieved: https://gist.github.com/kba/407df1bf65577f1c85752b75d9c8a970
    Konstantin Baierer
    @kba
    fyi @witiko has documented his workflow, running ocrd processors with GNU parallel and plotting the execution timing as a pie chart in a gist: https://gist.github.com/Witiko/1f92c84b030f7ed2e5ff2b67a4710409
    Robert Sachunsky
    @bertsky

    @stweil yes, probably best to offer all those options in a single processor.
    @kba how about instead of a generalised ocrd-tesserocr-segment I simply extended ocrd-tesserocr-recognize such that if no text regions whatsoever are present, it would first do region segmentation, and if no text lines are present, it would first do line segmentation? (Remember, we are already "polymorphic" on the input side below the line level.) So the textequiv_level parameter would denote the hierarchy level only on the output side, while on the input side any level would be allowed.

    @u-mierendorff I have created OCR-D/ocrd_tesserocr#158 which brings all-in-one segmentation and all-in-one segmentation+recognition (as well as a dedicated text style processsor fontshape via WordFontAttributes on pre-LSTM models).

    Kay-Michael Würzner
    @wrznr
    FYI: In the context of the international Open Access Week 2020, the SLUB organizes a small, virtual workshop on Open Science topics. Those interested to join may want to check https://www.slub-dresden.de/open-science/veranstaltungen/ for details.