Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    SB2020-eye
    @SB2020-eye
    Is there a way to add padding to ocrd-anybaseocr-crop so that a little extra is kept around the edges?
    Konstantin Baierer
    @kba
    There's the parameter padding for this IIRC

    Yup:

            pad = self.parameter['padding']
            border_polygon = polygon_from_bbox(min_x - pad, min_y - pad, max_x + pad, max_y + pad)

    So try with -P padding 5 to pad with 5 pixels.

    SB2020-eye
    @SB2020-eye

    Great! I have a somewhat related question as well.

    I get a significant difference in results using ocrd-anybaseocr-crop based on whether or not I ocrd-skimage-denoise-raw first. The image includes a small vertical band of text from the opposite page, skewed on account of the curve of the opposite page. If I do not use ocrd-skimage-denoise-raw, the piece of the opposite page is kept when I crop; but if I use ocrd-skimage-denoise-raw, it is not.

    I want it cropped out (plus some padding -- thanks, @kba!). But for other aspects, I don't want the output ocrd-skimage-denoise-raw gives. (It makes the strokes of glyphs jagged, for instance, in a way that leaving it out does not.)

    So is there a way to get the crop I want but somehow "go back" to the image sans ocrd-skimage-denoise-raw? Or is there a better way to go about this altogether?

    Konstantin Baierer
    @kba

    So is there a way to get the crop I want but somehow "go back" to the image sans ocrd-skimage-denoise-raw? Or is there a better way to go about this altogether?

    All the previous versions of the images are stored as pc:AlternativeImage with a @comments attribute explaining the operation that created it in the PAGE-XML. If using the OCR-D/core workspace API, you can filter/select by these @comments to get e.g. the pc:Border from ocrd-anybaseocr-crop but use the original image (the one in pc:Page/@imageFilename) to apply to.

    Robert Sachunsky
    @bertsky
    @SB2020-eye adding to @kba's answer, yes, you can even do that on the user side (just by configuration, not necessarily programming), but that depends on your workflow. If you shared it, then I could give advice...
    SB2020-eye
    @SB2020-eye

    Thanks @kba and @bertsky . The workflow -- the relevant part -- is just this:

    ocrd-skimage-denoise-raw -I OCR-D-IMG -O OCR-D-DENOISER
    ocrd-sbb-binarize -I OCR-D-DENOISER -O OCR-D-BIN -P model default
    ocrd-anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP
    ocrd-sbb-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P model default

    (While I'm at it...I believe I picked up idea to have this last line -- a 2nd binarization -- from a workflow or other documentation. But I may have misunderstood something. I'm not sure how it could be helpful: I would have expected the result of crop to already be a binary image. Could someone please set me straight on this?)

    Robert Sachunsky
    @bertsky

    @SB2020-eye yes, 2nd binarization has been recommended in our workflow guide (from experiences with rule-based binarization), but it has since been shown to be potentially detrimental (at least for rule-based binarization). You need a 2nd binarization though, if you want to remove the raw denoising after cropping. To make that happen, insert a step working on OCR-D-CROP to remove the despeckled AlternativeImage. Either directly on the PAGE-XML files of that fileGrp, e.g. via ...

    xmlstarlet ed --inplace -N pc=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 -d '/pc:PcGts/pc:Page/pc:AlternativeImage[contains(@comments,"despeckled")]' OCR-D-CROP/*.xml

    or by an extra processor, e.g. ...

    ocrd-preprocess-image -I OCR-D-CROP -O OCR-D-CROP2 -P input_feature_filter despeckled -P output_feature_added none -P command 'cp @INFILE @OUTFILE'

    (and then replace -I OCR-D-CROP with -I OCR-D-CROP2 in the 2nd binarization).

    SB2020-eye
    @SB2020-eye

    When I run ocrd-sbb-binarize -I OCR-D-SEG-REPAIRO -O OCR-D-BIN-REGO1 -P model default -P operation_level region, I get

    UnboundLocalError: local variable 'page_image' referenced before assignment

    Here's more of the script, in case it helps:

    ocrd-tesserocr-segment-region -I OCR-D-DESKEWBi -O OCR-D-SEG-REGO -P find_tables false -P shrink_polygons true
    ocrd-segment-repair -I OCR-D-SEG-REGO -O OCR-D-SEG-REPAIRO -P plausibilize true
    ocrd-sbb-binarize -I OCR-D-SEG-REPAIRO -O OCR-D-BIN-REGO1 -P model default -P operation_level region
    SB2020-eye
    @SB2020-eye
    Also getting a long error when I run ocrd-pc-segmentation -I OCR-D-DESKEWBi -O OCR-D-SEG-REGP. The error message includes 2021-03-19 20:05:11.685318: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory. Is this something about not having GPU? I'm using Ubuntu on Windows/WSL.
    SB2020-eye
    @SB2020-eye

    @SB2020-eye yes, 2nd binarization has been recommended in our workflow guide (from experiences with rule-based binarization), but it has since been shown to be potentially detrimental (at least for rule-based binarization). You need a 2nd binarization though, if you want to remove the raw denoising after cropping. To make that happen, insert a step working on OCR-D-CROP to remove the despeckled AlternativeImage. Either directly on the PAGE-XML files of that fileGrp, e.g. via ...

    xmlstarlet ed --inplace -N pc=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 -d '/pc:PcGts/pc:Page/pc:AlternativeImage[contains(@comments,"despeckled")]' OCR-D-CROP/*.xml

    or by an extra processor, e.g. ...

    ocrd-preprocess-image -I OCR-D-CROP -O OCR-D-CROP2 -P input_feature_filter despeckled -P output_feature_added none -P command 'cp @INFILE @OUTFILE'

    (and then replace -I OCR-D-CROP with -I OCR-D-CROP2 in the 2nd binarization).

    Thanks for this guidance on crops, @bertsky!

    Konstantin Baierer
    @kba

    UnboundLocalError: local variable 'page_image' referenced before assignment

    :blush: That is a bug in the OCR-D bindings, qurator-spk/sbb_binarization#27

    tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory

    That is not the root cause, it's just a warning that you don't have GPU support set up. Can you post here/DM me the full error stacktrace?

    SB2020-eye
    @SB2020-eye

    @kba re: bug -- thanks! (almost wrote "yay" -- I was afraid I was getting to a dead end -- but I won't "yay" a bug. Just good to know. :smile: )

    And re: full error -- yes. DM-ing now.

    Konstantin Baierer
    @kba

    @kba re: bug -- thanks! (almost wrote "yay" -- I was afraid I was getting to a dead end -- but I won't "yay" a bug. Just good to know. :smile: )

    And re: full error -- yes. DM-ing now.

    It is fixed in master now, so if you want to test before the next ocrd_all release, feel free.

    Merlijn B. W. Wajer
    @MerlijnWajer_gitlab

    I've been working on converting Abbyy XML to hOCR (preserving as much as possible - individual character, confidence levels, picture block types, etc), but I'm running into a problem with tables. After reading https://groups.google.com/g/ocropus/c/-s33xn9fBGY I assumed that <tr> and <td> would indeed be allowed inside a <table class="ocr_table">. So I implemented this, but now the validator from ocr-fileformat gives:

    lxml.etree.XMLSyntaxError: Unexpected end tag : tr, line 1591, column 10

    Does anyone have some insight into how I am supposed to write hOCR tables?

    Konstantin Baierer
    @kba
    Can you post an example where the validator (hocr-spec-python?) fails? This looks like a genuine XML error (unbalanced opening/closing tags) but it might also be an oversight in the validator :grimacing:
    Merlijn B. W. Wajer
    @MerlijnWajer_gitlab
    Argh... You're right, it was just a stupid mistake on my side. I was writing <tr> for the table cells, not <td>... Sorry to waste your time.
    In any case, I'm still working on the code (so will cleanup & add doc this week), and was going to ask folks to give it a try (or hard look) if they want to, but here's the code if anyone is interested right now: https://git.archive.org/merlijn/archive-hocr-tools/-/blob/abbyy/bin/abbyy-to-hocr
    Konstantin Baierer
    @kba
    :tada: awesome, I'll have a look. We should integrate this into https://github.com/UB-Mannheim/ocr-fileformat
    Merlijn B. W. Wajer
    @MerlijnWajer_gitlab
    It's not production ready, but when it is, I think that would be great. I've tried hard to preserve as much as possible and also added baseline calculation based on least-squares (abbyy baselines make no sense to me), and in general aim to mimic the output that Tesseract generates. I believe the existing converter does lose a fair amount of information, and I had trouble figuring out how to improve it. (We are thinking of converting ~22 million Abbyy documents to hOCR, so we'll want to make sure we do it right the first time around :-) )
    mittagessen
    @mittagessen
    @MerlijnWajer_gitlab IIRC AbbyyXML baseline is just the y-coordinate. It assumes a horizontal line, so it's fairly useless for most purposes.
    6 replies
    SB2020-eye
    @SB2020-eye

    I am running:

    ocrd-tesserocr-recognize -P segmentation_level none -P textequiv_level line -I SEG -O OCR
    ocrd-tesserocr-recognize -I OCR -O GLYPH2 -P textequiv_level glyph -P segmentation_level word

    I have some questions about my results and wanted to see if anyone can assist:

    1. Is there a way to extend ("pad"?) lines?
      (I have some lines broken into two (but definitely on the same horizontal). And I have more than one line slightly cutting off part of a glyph.)
    2. Looking at an xml file in PageViewer, as I go from lines to words, why would there be some words with no box around it?
      (I briefly hyphothesized, "Maybe it thinks it's blank space?" But that hypothesis doesn't hold. Because there is an instance in which the first word in a line is left out (ie, no box around it). So the command recognized something there when run at the line level; but it didn't when run at the word level. (Didn't it?))
      (A similar phenomenon seems to happen going from word to glyph level.)
    3. Why are the boxes covering words and glyphs often cutting off their top halves (and going down into the blank space below them)?
      (I have seen this behavior in other results, too.)
    SB2020-eye
    @SB2020-eye
    Untitled.jpg
    SB2020-eye
    @SB2020-eye
    (1. 2nd line -- example of a split line)
    (2. 9th line -- example of a line (this one also split) in which there is a word with no box around it)
    (and 2nd line, 3rd word -- example of a word in which there are glyphs with no boxes around them)
    Untitled2.jpg
    SB2020-eye
    @SB2020-eye
    (3. example of the tops of words and glyphs cut off -- the line level catches the entirety of words and glyphs, plus padding on top and bottom; the word and glyph level cuts words and glyphs in half along the horizontal, yet include all the bottom padding)
    (My apologies that you can't see the words themselves and have to go on my description; the image is not open-source.)
    Kuldeep Pal
    @kuldeep27396
    Hello Guys,
    Need help with OCR, I have used a pre-trained model for layout analysis, it will detect paragraph text and tables, and images in an image. But now I want to convert that to OCR. How do that, is there anyone who can help me with that?
    Elisabeth Engl
    @EEngl52
    @/all this Wednesday, 2-3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the following topics:
    João Macedo
    @joaossmacedo
    I'm having some issues downloading sbb_textline_detection. I've tried to run "pip install ." inside a clone of https://github.com/qurator-spk/sbb_textline_detection.git . Do I need to add it to PATH or something like that? I'm using macOS
    Clemens Neudecker
    @cneud
    I am not sure anyone has experience with sbb_textline_detection on MacOS. Which version of Python/pip are you using?
    Konstantin Baierer
    @kba

    I've tried to run "pip install ." inside a clone of

    Did that succeed? Or was there an error? Are you using a venv or conda (which you should, it makes life easier)?

    SB2020-eye
    @SB2020-eye
    What is the difference between a "model" and a "ground truth dataset usable for OCR training and evaluation"? (I'm looking at the link to @cneud 's datasets which is found on the ocrd-website Wiki.)
    SB2020-eye
    @SB2020-eye
    Is there any way to manually alter OCR-D results -- like add a missed word, for example? (I guess this would have to be through some kind of interface/GUI?)
    Konstantin Baierer
    @kba

    What is the difference between a "model" and a "ground truth dataset usable for OCR training and evaluation"? (I'm looking at the link to @cneud 's datasets which is found on the ocrd-website Wiki.)

    "model" is the result of training, while "ground truth dataset" is the training input

    Is there any way to manually alter OCR-D results -- like add a missed word, for example? (I guess this would have to be through some kind of interface/GUI?)

    You can edit the PAGE-XML with a text editor or with a GUI like Transkribus or Aletheia

    SB2020-eye
    @SB2020-eye
    Thx, @kba!
    SB2020-eye
    @SB2020-eye

    Hi. I'm trying out Photoshop's new Super Resolution to try to get better slices of individual glyphs from my original image. Of course, the files are huge.

    Is there any way to get around my computer not having enough memory (CPU -- I don't have an external GPU) using sbb_binarization (even if it takes a long time to get the result)? (Killed process below:)

     ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default
    11:18:41.752 INFO processor.SbbBinarize - INPUT FILE 0 / Folio_073r-Enhanced2x
    /home/scott/src/github/OCR-D/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.6/site-packages/PIL/Image.py:2850: DecompressionBombWarning: Image size (122623200 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
      DecompressionBombWarning,
    11:18:53.934 INFO processor.SbbBinarize - Binarizing on 'page' level in page 'Folio_073r-Enhanced2x'
    11:18:55.270 INFO processor.SbbBinarize - Predicting with model /home/scott/.local/share/ocrd-resources/ocrd-sbb-binarize/default/model_bin2.h5 [1/4]
    /home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd-sbb-binarize: line 2: 20469 Killed                  /home/scott/src/github/OCR-D/ocrd_all/venv/local/sub-venv/headless-tf1/bin/ocrd-sbb-binarize "$@"
    mittagessen
    @mittagessen
    Run the algorithms at original scale and interpolate the coordinates onto the output of the superresolution method?
    Alternatively, if you're really desperate plop in a swap file. But it will be extremely slow.
    Stefan Weil
    @stweil

    New models for Tesseract: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/.

    The training was run from scratch on cleaned ground truth from GT4HistOCR, AustrianNewspapers and Fibeln. It used tesstrain with the latest default network specification which results in smaller and faster networks, so OCR is faster, too.

    Uwe Hartwig
    @M3ssman
    @stweil awesome! Is this new version usable with both Tesseract 4.1.1. and latest master-build?
    Stefan Weil
    @stweil
    Yes, those LSTM models work with all Tesseract releases >= 4.0.
    Elisabeth Engl
    @EEngl52
    @/all this Wednesday, 2-3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the following topics:
    Robert Sachunsky
    @bertsky
    Does anyone know of any spell checkers for historic (pre-1901) German – regardless how crude – perhaps even in hunspell/myspell/aspell/ispell format?
    4 replies
    SB2020-eye
    @SB2020-eye

    Hi. I just want to double-check something.

    For the sbb_binarization models found here, is the one labeled "2020-01-16" the one that is called "default" in OCR-D?

    Konstantin Baierer
    @kba
    Correct. The newest model is called default-2021-03-09 for the OCR-D/core resource manager.
    Konstantin Baierer
    @kba
    Anyone has experience with https://github.com/JaidedAI/EasyOCR ?