SetBoundingBoxComponents(self, bool include_upper_dots, bool include_lower_dots)
would help in your case, though. Can you try and add tessapi.SetBoundingBoxComponents(True, True)
at the top of recognize
's page loop and re-run your example?
dpi
setting give any improvements?)
But without knowing your setup it is hard to give recommendations. Best to file an issue in the tesstrain repo.
Issue created: (tesseract-ocr/tesstrain#210)
We have just released a new version of ocrd_all with major changes in ocrd_tesserocr
and ocrd_cis
. In this context, we have also revised (OCR-D/ocrd-website#189) the workflow guide on OCR-D's homepage.
Notably, ocrd_tesserocr
has been refactored so that all the segmentation processors delegate to ocrd-tesserocr-recognize
and behavior is controlled by the parameters segmentation_level
, textequiv_level
and model
. See the workflow guide's section on region segmentation, https://github.com/OCR-D/tesserocr README and the --help
output for details.
This allows you to use Tesseract for region and line segmentation (and even deskewing and recognition) in one step, reducing the problem of overlap and loss of precision between API calls when running in multiple separate workflow steps. For single-step multi-level segmentation there is a new shorthand processor ocrd-tesserocr-segment
, but the old single-level processors still exist, too.
To better address the problem of overlap (even in single-level segmentation), the new option shrink_polygons=true
allows annotating tight hull polygons for all new segments instead of coarse bounding boxes.
ocrd-tesserocr-recognize
now also exposes all internal Tesseract variables in a new tesseract_parameters
dict.
There is one backwards incompatibility, though: overwrite_words
has been renamed and generalized to overwrite_segments
.
Another new processor is ocrd-tesserocr-fontshape
, that uses tesseract's pre-LSTM models to detect font styles like italic, bold and more - very useful for works where font style conveys semantics, like dictionaries or thesauri.
While ocrd-typegroups-classifier
, a processor that can predict a wide variety of (historical) fonts/types including confidence for a page, has been part of ocrd_all for a long time, it is now also documented in the workflow guide.
The second major update is cisocrgroup/ocrd_cis#77, which brings not only lots of fixes for robustness and speed, but also improves the incremental behaviour of segmentation, makes clipping and resegmentation optimise globally instead of locally, and allows a page level resegmentation (reassigning conflicting text lines between overlapping regions).
Further, ocrd-segment-repair
now finally tries to fix PAGE invalidities and inconsistencies automatically.
Last but not least, we have revised the workflow recommendations, which now offer a single-processor workflow as a benchmark test for more involved workflows, and contain improved/simplified configurations for the "best" and "fast" workflows.
I wonder whether
SetBoundingBoxComponents(self, bool include_upper_dots, bool include_lower_dots)
would help in your case, though. Can you try and addtessapi.SetBoundingBoxComponents(True, True)
at the top ofrecognize
's page loop and re-run your example?
You mean to recognize.py? And if so, where exactly? changing dpi
did not help essentially.
changing dpi did not help essentially.
too bad! But thanks for letting me know.
ocrd-tesserocr
fares quite well, so this is not such a huge problem. Only that I've grown to like ocrd-cis-ocropy-segment
because of spread
feature.
dpi
override here?spread
(i.e. tight polygonal fit): you can also run Tesseract with shrink_polygons=true
now to get at least some polygonal approximate hull, and you could also combine Tesseract with Ocropy by using ocrd-cis-ocropy-resegment
afterwards.
OCR-D
is already a dream come true for me after many painful years with ABBYY Finereader. Here are some samples from the dictionaries in question pashto-russian khmer-japanese.
-p '{"dpi":xx}'
even to the point that some regions became to small to line segment did not inhibit the incorrect segmentation.
vscale
parameter (a factor for various filters in vertical direction) to control this trade-off with a-priori knowledge.
GenericVector
and STRING
and replace them by standard types before tagging Tesseract 5.0.0. I already suggested tagging new alpha versions for Tesseract 5. Would that help?
TesTea
Not a wise choice of abbreviation for an english-speaking audience ;-)
Many of you have probably already left for your well-deserved Christmas holidays. Therefore, we've decided to suspend the regular open TechCall, where various issues and topics of overriding interest are discussed, next Wednesday. Instead, we will offer you the opportunity to discuss topics or current problems, which are probably only of interest for yourself and not for the whole group.
If you are interested in this more individual meeting, feel free to join us on Wednesday, 23.12., 2-3 pm in the usual room (https://meet.gwdg.de/b/eli-ufa-unu).
The next regular open TechCall will then take place in the new year on 13.1.
Until then, we wish you a Merry Christmas and a Happy New Year!
tesserocr
. I estimate that it will take some days to get a usable git master again.
ocrd workspace rename-group
OCR-D/core#655OcrdPage.get_AllAlternativeImages
OCR-D/core#654