Yup:
pad = self.parameter['padding']
border_polygon = polygon_from_bbox(min_x - pad, min_y - pad, max_x + pad, max_y + pad)
So try with -P padding 5
to pad with 5 pixels.
Great! I have a somewhat related question as well.
I get a significant difference in results using ocrd-anybaseocr-crop
based on whether or not I ocrd-skimage-denoise-raw
first. The image includes a small vertical band of text from the opposite page, skewed on account of the curve of the opposite page. If I do not use ocrd-skimage-denoise-raw
, the piece of the opposite page is kept when I crop; but if I use ocrd-skimage-denoise-raw
, it is not.
I want it cropped out (plus some padding -- thanks, @kba!). But for other aspects, I don't want the output ocrd-skimage-denoise-raw
gives. (It makes the strokes of glyphs jagged, for instance, in a way that leaving it out does not.)
So is there a way to get the crop I want but somehow "go back" to the image sans ocrd-skimage-denoise-raw
? Or is there a better way to go about this altogether?
So is there a way to get the crop I want but somehow "go back" to the image sans ocrd-skimage-denoise-raw? Or is there a better way to go about this altogether?
All the previous versions of the images are stored as pc:AlternativeImage
with a @comments
attribute explaining the operation that created it in the PAGE-XML. If using the OCR-D/core workspace API, you can filter/select by these @comments
to get e.g. the pc:Border
from ocrd-anybaseocr-crop
but use the original image (the one in pc:Page/@imageFilename
) to apply to.
Thanks @kba and @bertsky . The workflow -- the relevant part -- is just this:
ocrd-skimage-denoise-raw -I OCR-D-IMG -O OCR-D-DENOISER
ocrd-sbb-binarize -I OCR-D-DENOISER -O OCR-D-BIN -P model default
ocrd-anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP
ocrd-sbb-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P model default
(While I'm at it...I believe I picked up idea to have this last line -- a 2nd binarization -- from a workflow or other documentation. But I may have misunderstood something. I'm not sure how it could be helpful: I would have expected the result of crop to already be a binary image. Could someone please set me straight on this?)
@SB2020-eye yes, 2nd binarization has been recommended in our workflow guide (from experiences with rule-based binarization), but it has since been shown to be potentially detrimental (at least for rule-based binarization). You need a 2nd binarization though, if you want to remove the raw denoising after cropping. To make that happen, insert a step working on OCR-D-CROP
to remove the despeckled
AlternativeImage. Either directly on the PAGE-XML files of that fileGrp, e.g. via ...
xmlstarlet ed --inplace -N pc=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 -d '/pc:PcGts/pc:Page/pc:AlternativeImage[contains(@comments,"despeckled")]' OCR-D-CROP/*.xml
or by an extra processor, e.g. ...
ocrd-preprocess-image -I OCR-D-CROP -O OCR-D-CROP2 -P input_feature_filter despeckled -P output_feature_added none -P command 'cp @INFILE @OUTFILE'
(and then replace -I OCR-D-CROP
with -I OCR-D-CROP2
in the 2nd binarization).
When I run ocrd-sbb-binarize -I OCR-D-SEG-REPAIRO -O OCR-D-BIN-REGO1 -P model default -P operation_level region
, I get
UnboundLocalError: local variable 'page_image' referenced before assignment
Here's more of the script, in case it helps:
ocrd-tesserocr-segment-region -I OCR-D-DESKEWBi -O OCR-D-SEG-REGO -P find_tables false -P shrink_polygons true
ocrd-segment-repair -I OCR-D-SEG-REGO -O OCR-D-SEG-REPAIRO -P plausibilize true
ocrd-sbb-binarize -I OCR-D-SEG-REPAIRO -O OCR-D-BIN-REGO1 -P model default -P operation_level region
ocrd-pc-segmentation -I OCR-D-DESKEWBi -O OCR-D-SEG-REGP
. The error message includes 2021-03-19 20:05:11.685318: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
. Is this something about not having GPU? I'm using Ubuntu on Windows/WSL.
@SB2020-eye yes, 2nd binarization has been recommended in our workflow guide (from experiences with rule-based binarization), but it has since been shown to be potentially detrimental (at least for rule-based binarization). You need a 2nd binarization though, if you want to remove the raw denoising after cropping. To make that happen, insert a step working on
OCR-D-CROP
to remove thedespeckled
AlternativeImage. Either directly on the PAGE-XML files of that fileGrp, e.g. via ...xmlstarlet ed --inplace -N pc=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 -d '/pc:PcGts/pc:Page/pc:AlternativeImage[contains(@comments,"despeckled")]' OCR-D-CROP/*.xml
or by an extra processor, e.g. ...
ocrd-preprocess-image -I OCR-D-CROP -O OCR-D-CROP2 -P input_feature_filter despeckled -P output_feature_added none -P command 'cp @INFILE @OUTFILE'
(and then replace
-I OCR-D-CROP
with-I OCR-D-CROP2
in the 2nd binarization).
Thanks for this guidance on crops, @bertsky!
UnboundLocalError: local variable 'page_image' referenced before assignment
:blush: That is a bug in the OCR-D bindings, qurator-spk/sbb_binarization#27
tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
That is not the root cause, it's just a warning that you don't have GPU support set up. Can you post here/DM me the full error stacktrace?
@kba re: bug -- thanks! (almost wrote "yay" -- I was afraid I was getting to a dead end -- but I won't "yay" a bug. Just good to know. :smile: )
And re: full error -- yes. DM-ing now.
It is fixed in master now, so if you want to test before the next ocrd_all release, feel free.
I've been working on converting Abbyy XML to hOCR (preserving as much as possible - individual character, confidence levels, picture block types, etc), but I'm running into a problem with tables. After reading https://groups.google.com/g/ocropus/c/-s33xn9fBGY I assumed that <tr>
and <td>
would indeed be allowed inside a <table class="ocr_table">
. So I implemented this, but now the validator from ocr-fileformat gives:
lxml.etree.XMLSyntaxError: Unexpected end tag : tr, line 1591, column 10
Does anyone have some insight into how I am supposed to write hOCR tables?
I am running:
ocrd-tesserocr-recognize -P segmentation_level none -P textequiv_level line -I SEG -O OCR
ocrd-tesserocr-recognize -I OCR -O GLYPH2 -P textequiv_level glyph -P segmentation_level word
I have some questions about my results and wanted to see if anyone can assist:
What is the difference between a "model" and a "ground truth dataset usable for OCR training and evaluation"? (I'm looking at the link to @cneud 's datasets which is found on the ocrd-website Wiki.)
"model" is the result of training, while "ground truth dataset" is the training input
Is there any way to manually alter OCR-D results -- like add a missed word, for example? (I guess this would have to be through some kind of interface/GUI?)
You can edit the PAGE-XML with a text editor or with a GUI like Transkribus or Aletheia
Hi. I'm trying out Photoshop's new Super Resolution to try to get better slices of individual glyphs from my original image. Of course, the files are huge.
Is there any way to get around my computer not having enough memory (CPU -- I don't have an external GPU) using sbb_binarization (even if it takes a long time to get the result)? (Killed process below:)
ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-BIN -P model default
11:18:41.752 INFO processor.SbbBinarize - INPUT FILE 0 / Folio_073r-Enhanced2x
/home/scott/src/github/OCR-D/ocrd_all/venv/local/sub-venv/headless-tf1/lib/python3.6/site-packages/PIL/Image.py:2850: DecompressionBombWarning: Image size (122623200 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
DecompressionBombWarning,
11:18:53.934 INFO processor.SbbBinarize - Binarizing on 'page' level in page 'Folio_073r-Enhanced2x'
11:18:55.270 INFO processor.SbbBinarize - Predicting with model /home/scott/.local/share/ocrd-resources/ocrd-sbb-binarize/default/model_bin2.h5 [1/4]
/home/scott/src/github/OCR-D/ocrd_all/venv/bin/ocrd-sbb-binarize: line 2: 20469 Killed /home/scott/src/github/OCR-D/ocrd_all/venv/local/sub-venv/headless-tf1/bin/ocrd-sbb-binarize "$@"
New models for Tesseract: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/.
The training was run from scratch on cleaned ground truth from GT4HistOCR, AustrianNewspapers and Fibeln. It used tesstrain with the latest default network specification which results in smaller and faster networks, so OCR is faster, too.
Hi. I just want to double-check something.
For the sbb_binarization models found here, is the one labeled "2020-01-16" the one that is called "default" in OCR-D?