by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Johannes Künsebeck
    @hnesk
    Thanks for the feedback. I will try to wrap this as an OCR-D-processor. I'm not sure if this will be very reusable, the de-keystoning/page-splitting process with voussoir depends on markers to find the correct perspective transform, so this is a very special processing step and not really a general purpose perspective correction tool.
    Stefan Weil
    @stweil

    page-splitting or de-keystoning are outside the scope of what we develop at OCR-D in an official capacity

    What about 19th century newspaper scans from microfilm? I think they are part of OCR-D, and they are double sided.

    Elisabeth Engl
    @EEngl52
    Newspapers (irrespective of their century) are not part of OCR-D! OCR-D focuses on VD-materials which don't include newspapers. And I can't think of a reason why the VD should change their scope of materials for a VD19. whenever it might be begun...
    Uwe Hartwig
    @M3ssman
    @stweil ULB is using a microfilm-scanner together with the proprietary software stack "QuantumScan" / "QuantumProcess" where the first coordinates the scanner machine and the later takes care for page-splitting (as part of the preprocessing).
    Stefan Weil
    @stweil
    So your primary images - the microfilm images - are also double sided. We know from our Reichsanzeiger that many pages can easily be split, but there is a certain percentage of scans where this is really difficult (no or too small gap between left and right page). Therefore I would not trust a proprietary software stack to do it always perfect.
    Uwe Hartwig
    @M3ssman
    @stweil We have qualified personal stuff that supervises this process, corrects frames and alike.
    Stefan Weil
    @stweil

    Newspapers (irrespective of their century) are not part of OCR-D! OCR-D focuses on VD-materials which don't include newspapers. And I can't think of a reason why the VD should change their scope of materials for a VD19. whenever it might be begun...

    The DFG mentions 19th century explicitly, too: "Die zu entwickelnden Lösungen sollen eine Volltextdigitalisierung von Druckwerken des 19. Jahrhunderts ebenfalls einbeziehen". OCR-D is not restricted to VD material (although that is the main focus). And isn't a newspaper printed, too? Then it is a "Druckwerk".

    Stefan Weil
    @stweil
    Also: "Am Ende des Gesamtvorhabens soll ein konsolidiertes Verfahren zur OCR-Verarbeitung von Digitalisaten des gedruckten deutschen Kulturerbes des 16. bis 19. Jahrhunderts erarbeitet worden sein." So really no newspapers? I interpreted that differently.
    Elisabeth Engl
    @EEngl52
    OCR-D should and does also cover the first half of the 19th century. But OCR-D explicitly communicates that it does not cover OCR for newspapers. As a mass digitization project it has to focus on the majority of the VD titles. And considering the amount of titles, newspapers form just a very small percentage of the early modern print production
    Kay-Michael Würzner
    @wrznr
    @stweil You also have to consider that the whole Newspaper digitization business is handled in a different funding initiative by the DFG. They are in the process of mass digitization already.
    Kay-Michael Würzner
    @wrznr
    @stweil What is the difference between OMAR and the current approach of using multiple models at recognition time?
    Elisabeth Engl
    @EEngl52

    @/all our next open TechCall takes place next Tuesday, 11-12 am. Feel free to join it if you are interested in the following topics:

    documentation / discussion

    core

    spec

    ocrd_all

    for the conference details also see https://hackmd.io/OOMgg3ZeSqK4vfKL1wRbwQ?view

    Stefan Weil
    @stweil

    What is the difference between OMAR and the current approach of using multiple models at recognition time?

    OMA = One Model to recognize them All (acronym coined ad hoc because Clemens likes acronyms, derived from the "one ring to rule them all"). That can also be a set of models used at recognition time. It just means that the old approach to choose a model based on the script(s) used in a book or other criteria is replaced by the simpler rule to always use the same model (or set of models).

    Of course this still has limitations. Clemens' model covers Latin, Greek and Hebrew scripts, our models currently only work with Latin and some Greek glyphs. So Arabic, Chinese, ... scripts still need other models.

    It's only OMA for OCR-D. And maybe there will be different models for OCR workflows with and without binarization.

    The other acronym which I mentioned was UDO (User Driven OCR), derived from PDA (Patron Driven Acquisition).
    Robert Sachunsky
    @bertsky
    @stweil regarding OMA and my question in the pilot kickoff call (on what basis do you expect generic models to do better than script/century-specific training?), is this based on the CER measurements you reported in your poster for the final phase 2 presentation at DFG in Bonn?
    If so, could you please share details on the methodology (processing workflow, evaluation) and access to the data?
    (The poster states data is from 16th to 19th century, and you used your GT4HistOCR model; but the GT4HistOCR corpus is quite unbalanced with most of the data in 19th century Fraktur, and all measurements I made in the past show a much worse accuracy on pre-19th century data...)
    Lucas Sulzbach
    @sulzbals
    Hello everyone! I have been working on a fork from sbb_textline_detector for a couple months now and I would like to share it here: https://github.com/sulzbals/gbn. It was built with layout analysis of german-brazilian publications from the 19th-20th centuries in mind, but it might be useful for other cases as well. The idea was replicating the functionality of the original project into a more modular and customizable toolset, but other changes and features in the cropping/segmentation routines were also implemented along the way. You can find more details in the README. I do not consider it a "finished" project as there is still much to be done, but it is already possible to do some testing in an OCR-D workflow with these tools.
    Clemens Neudecker
    @cneud
    Hi @sulzbals and welcome! I've been lurking your fork for some weeks now, very impressed with the way you structured and extended our rough prototype code, closely following along the OCR-D conventions, all by yourself - kudos! I've been wanting to reach out directly to maybe have a call/exchange on your work (and our plans for OCR-D and also Qurator), but did not yet manage to find the time for it. So thanks for saying hello here and looking forward to follow up soon, either here or bilaterally.
    Uwe Hartwig
    @M3ssman
    Hello @all! I'm looking for a tool that is able to read structural information like chapters and sections from METS/MODS and put this into a table of content of a PDF. Any suggestions? https://github.com/UB-Mannheim/ocrd_pagetopdf created a text layer, but actually for a single image + PAGE, is this right?
    etodt
    @etodt
    Hi, I am Eduardo Todt, professor of @sulzbals, we are working with enthusiasm in this theme and we hope to change ideas with you. Lucas is really doing it very well here, I am proud of him. In the next weeks I hope we could add another member collaborating with the team.
    Kay-Michael Würzner
    @wrznr
    Welcome @etodt, it is great news that OCR-D is reaching out internationally and we are all eager to support you (and get support from you 😉).
    Konstantin Baierer
    @kba

    structural information like chapters and sections from METS/MODS and put this into a table of content of a PDF.

    Not that I'm aware of. But with a bit of pre-processing and a toolkit like https://github.com/ocelotconsulting/hummus-toc it wouldn't be too difficult (generating that information in the first place is the hard part IMHO). Can you create a feature request in ocrd_pagetopdf so we can discuss possible solutions? We could either extend ocrd_pagetopdf or create a dedicated processor for it.

    Konstantin Baierer
    @kba
    BTW: ocrd_pagetopdf implemented a multipage parameter a few weeks ago, which allows producing a single PDF file from all the pages in a METS. In case users haven't noticed it yet, see the usage section of ocrd_pagetopdf.
    Robert Sachunsky
    @bertsky

    @kba

    Though de-keystoning is more of a shearing operation IIUC?

    yes, I guess it can be approximated as just shear in 3d. So if we had a descriptive annotation of that mapping in PAGE, then our coordinate-conversion API (based on affine transformations) could compensate.
    But I still don't see any robust/usable keystone detector or general dewarper out there:

    • I cannot get the DFKI dewarper to run at all (OCR-D/ocrd_anybaseocr#40 and https://github.com/OCR-D/ocrd_anybaseocr/issues/61),
    • the tool @hnesk talked about needs a special physical marker/aid on the page,
    • Fred Weinhaus' unperspective needs a dark background around the page
    • the tool recommended by @jbarth-ubhd makes strong assumptions on the page layout (e.g. that text lines always run across the page, so multi-column or table layouts will yield many wrong "sight lines" and thus the transform will be too cautious), and it will be hard to wrap this for OCR-D (because of the downscaling, internal thresholding, implicit cropping involved)
    • leptonica has a parametric dewarper based on textline estimation, too; jsbueno/pyleptonica#11 can be used to access modern Python/Leptonica, and it would not be hard to wrap (I think). But again, there are some strong assumptions on page layout (finding enough long text lines which cover significant parts of the page)

    @/all does anyone know other promising tools?

    Clemens Neudecker
    @cneud
    Possibly of interest here as well: yet another promising approach combining language models and layout analysis model training https://arxiv.org/pdf/1912.13318.pdf
    vchristlein
    @VChristlein
    @bertsky I don't have a real idea, but is de-warping on a page-level really needed? If the text (base-)lines are correctly identified you can dewarp them and send it to your favorite text recognizer
    Robert Sachunsky
    @bertsky
    @VChristlein I believe it is. Both for curly pages and for non-purpendicular perspective. IMHO baseline detection is always more or less fragile when you have complex (e.g. multi-column or ornamented) layouts or when textlines have steep angles towards the spine. Even if it succeeds, merely v-shifting cannot correct for the horizontal compression, so recognition is still impaired. And page segmentation (i.e. region detection) is also much more difficult without page-level dewarping.
    vchristlein
    @VChristlein
    Thank you for the explanation.If you have some dewarped/straight material, you could create easily deformed material (better live-transform and use it as augmentation) and then train a CNN to dewarp it again... And a quick search gave me the following repos that do exactly this: https://github.com/thomasjhuang/deep-learning-for-document-dewarping (although I think that a GAN architecture is somewhat strange on a global level - interesting that this works...) and https://github.com/cvlab-stonybrook/DewarpNet (looks more soqphisticated)
    Robert Sachunsky
    @bertsky

    If you have some dewarped/straight material, you could create easily deformed material (better live-transform and use it as augmentation) and then train a CNN to dewarp it again...

    Absolutely. You would think this is an easy problem nowadays. Esp. with augmenters like imgaug or ocrodeg.

    https://github.com/thomasjhuang/deep-learning-for-document-dewarping (although I think that a GAN architecture is somewhat strange on a global level - interesting that this works...)

    This is pretty much what ocrd-anybaseocr-dewarp attempted. However, they don't provide an off-the-shelf model, and the training data representation looks strange. Plus:

    NVIDIA GPU (11G memory or larger)

    ouch. (pix2pixHD with fewer GPURAM or only CPU seems to be impossible...)

    https://github.com/cvlab-stonybrook/DewarpNet (looks more soqphisticated)

    Indeed. I'll have a look – thanks a lot!

    vchristlein
    @VChristlein
    ouch. (pix2pixHD with fewer GPURAM or only CPU seems to be impossible...)
    you could switch to normal pix2pix and CPU should actually work, too...
    Robert Sachunsky
    @bertsky

    you could switch to normal pix2pix and CPU should actually work, too...

    oh? How do I do that (without rewriting the whole thing)?

    vchristlein
    @VChristlein
    and thanks for the hint with ocrodeg, that looks nice, we currently use only pytorch standard augmentations and albumentatations library...
    well, I know too little about the DFKI code, if it is pytorch, then the model is typically placed in a model file which you could exchange. I cannot recomment CPU for training but for inference it should be possible
    vchristlein
    @VChristlein
    hm exchanging might be too time-consuming, but CPU should be doable, you need to load the trained model once with a GPU, safe it to CPU and then you can use it with CPU (just remove .cuda()) - but of course GPU will be faster than CPU-processing...
    Robert Sachunsky
    @bertsky

    hm exchanging might be too time-consuming, but CPU should be doable, you need to load the trained model once with a GPU, safe it to CPU and then you can use it with CPU (just remove .cuda())

    wow, that sounds almost doable – thx!! I will try that (on a larger GPU) – if this works, we'll have at least some dewarping in OCR-D (where most users run via Docker which is currently CPU-only)

    vchristlein
    @VChristlein
    let me know if you encounter any problems, my colleague M.Seuret has done exactly this in the past and knows more about the exact details
    jbarth-ubhd
    @jbarth-ubhd
    olena: tought one k would fit almost all, but this is not the case: https://digi.ub.uni-heidelberg.de/diglitData/v/olena-sauvola-k-202007.pdf
    Robert Sachunsky
    @bertsky
    @jbarth-ubhd Sauvola in most implementations assumes normalized input (i.e. full dynamic range between 0/fg and 255/bg). So might want to use ocrd-skimage-normalize first (doing simple contrast stretching) or something more elaborate via ocrd-preprocess-image with ImageMagick.
    Also, Sauvola and Niblack have a window size parameter which must be set large enough to encompass areas with enough empty and text share to yield good fg/bg statistics, but small enough to still be localized. So practically, there is a dependency on DPI of the images.
    I guess as a rule of thumb you could say window size equals DPI as an odd number (which gives windows a base length of 1 inch). That's why ocrd-skimage-binarize makes this the default choice for window size (looking at DPI meta-data).
    See https://github.com/OCR-D/ocrd-website/wiki/Binarization-%E2%80%93-Practicioner's-View
    jbarth-ubhd
    @jbarth-ubhd
    Oh oh but the original images (all 300 dpi) are full range from 0..255. Perhaps not optimally with the peak histogram white at 220 and black at 30+90. Thanks, will try it.
    Robert Sachunsky
    @bertsky
    I see. Then I don't think normalization is still necessary. And ocrd-olena-binarize's default window size of 101 should suffice for 300 DPI images. Could you please upload the originals somewhere, so I can have a look myself?
    Robert Sachunsky
    @bertsky
    @/all a new PAGE namespace is due in 2 weeks – I don't know if there's still time for some badly wanted new features like PRImA-Research-Lab/PAGE-XML#25 – please help by voting/discussing if you can.
    jbarth-ubhd
    @jbarth-ubhd
    olena sauvola, with different blackness & whiteness & noise levels (vertical) and k from 0.025 to 0.475 see https://digi.ub.uni-heidelberg.de/diglitData/v/olena-k-20200702.png . If you open this image in gimp and set the threshold to 255, one will recognize that darker images require a higher k and brighter images require lower k. At least two horizontal stripes should have clean white background per group. First column=original, second=ground truth, then k=0.025 +=0.025. Image shrinked to 25% (original resolution 300 dpi)
    jbarth-ubhd
    @jbarth-ubhd
    olena sauvola, table with "best k" according to psnr(Ground Truth, k=0.025 to 0.475 in 0.025 increments):
    noise level 0 --> sigma = 0.1 * (linear(white)-linear(black))
           64   128   191   255 white
      0 0.275 0.275 0.325 0.350
     64   -   0.200 0.250 0.300
    128   -     -   0.125 0.200
    191   -     -     -   0.100
    
    noise level 1 --> sigma = 0.2 * (linear(white)-linear(black))
           64   128   191   255 white
      0 0.275 0.275 0.325 0.300
     64   -   0.200 0.250 0.275
    128   -     -   0.125 0.175
    191   -     -     -   0.100
    
    noise level 2 --> sigma = 0.4 * (linear(white)-linear(black))
           64   128   191   255 white
      0 0.275 0.275 0.300 0.225
     64   -   0.175 0.250 0.200
    128   -     -   0.125 0.150
    191   -     -     -   0.075
    
    noise level 3 --> sigma = 0.8 * (linear(white)-linear(black))
           64   128   191   255 white
      0 0.375 0.350 0.375 0.250
     64   -   0.175 0.300 0.225
    128   -     -   0.125 0.025
    191   -     -     -   0.025
    -> olena sauvola parameter k is not independent of foreground/background level.
    jbarth-ubhd
    @jbarth-ubhd
    Is OCR-D techcall now? Has someone the link for me?
    Elisabeth Engl
    @EEngl52
    TechCall is always on Tuesday; actually next Tuesday https://hackmd.io/OOMgg3ZeSqK4vfKL1wRbwQ?view
    I will post the agenda later on today
    Elisabeth Engl
    @EEngl52

    @/all our next open TechCall takes place next Tuesday, 11-12 am. Feel free to join it if you are interested in the following topics:

    for the conference details also see https://hackmd.io/OOMgg3ZeSqK4vfKL1wRbwQ?edit