Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Kay-Michael Würzner
    @wrznr
    Wow. Do you happen to have an OCR-D processor for it as well?
    Kay-Michael Würzner
    @wrznr

    @jbarth-ubhd This what I got for the page you posted a week ago:

    jrrthumben vnd dienſtbarkeit / ſo hartiglich geſtraf
    fet / wie auß dieſer vnd ander geſchichtbuͤchern / zus
    vernemen / Wird er on zweiffel wider vns noch an⸗
    ders jemandt / gleich als im Roſengarten ſitzen /
    vnd mit der warhafftigenlehr/Gottsdienſten/Re-
    gimentẽ / sucht vnd erbarkeit / vnſers gefallens kurtz
    weil vnd gauckelſpil treiben laſſen / Derhalben be⸗
    ker vnd wende ſich noch meniglich / vnd ſuch guad
    bey dem barmhertzigen Gott / dieweil die zu finden.
    : Zumandern/fan auß ſolchen anzeigungen auch
    etlicher maß / die warheit / vnd der recht Chriſtlich
    glaub geſterckt werden] Dieweil deñoch hierauß er
    ſcheint ¿mit welcherley grauſamer erſchrecklicher
    vnzucht / auch wütenden blutgirigẽ vnmeßlichẽ we
    ſen / Die gottloſen vnd Heyden / ſo die gnad Chriſti
    verachtẽ / von Gott geſtrafft werde, Dz alfo jrs für
    temens ſchleunige außrichtung | vnd des geivalts
    mechtige erbreterunge / jnen zu verdamlichen ſcha⸗
    den / vñ entlichen nachtheil vnd verderben gerathen
    muß/zu beſtetigung jrs gottloſenwandels vü.heuf
    fung der ſuͤnd / damit ſie auch Gotts zeitig rach er⸗
    wecken welchs zorn bleibt ob dem vnglaubigen.
    Zum dritten / dienen gleichwol der gſtalt bücher]
    zu vnterricht der lands art / der leute gewonheit / ge⸗
    brauch vnd ſitten / Auch zu erinnerung der vrſach
    durch welche menſchlich paron u reden) bißher
    dd HY

    We definitely have to work on the special characters. Otherwise, it is pretty okay.

    Also, the umlauts have to be unified.
    Kay-Michael Würzner
    @wrznr
    (Note that the last line suffers from a segmentation error.)
    jbarth-ubhd
    @jbarth-ubhd
    @wrznr which toolchain?
    Kay-Michael Würzner
    @wrznr
    olena-binarize
    anybaseocr-crop
    cis-ocropy-denoise
    cis-ocropy-deskew
    tesserocr-segment-region
    tesserocr-segment-line
    cis-ocropy-resegment
    tesserocr-recognize
    I guess that resegmentation is very, very important for this particular page:
    OCR-D-IMG-DEWARP_0001_region0000_region0000_line0019.png
    OCR-D-IMG-RESEG_0001_region0000_region0000_line0019.png
    Pls. notice the successful removal of the intruders from the line below.
    Robert Sachunsky
    @bertsky

    tesserocr-segment-region

    I recommend always adding segment-repairwith plausibilize=true and then cis-ocropy-clip afterwards.

    tesserocr-segment-line
    cis-ocropy-resegment

    note: IMO you can squash these 2 into just cis-ocropy-segment (which is polygonal right away).

    Kay-Michael Würzner
    @wrznr

    I recommend always adding ...

    It is only one region.

    note: IMO you can squash these...

    IMHO, tesserocr-segment-line is often more reliable.

    jbarth-ubhd
    @jbarth-ubhd
    grafik.png
    @wrznr: semi-manual comparison with calamari (green=better than calamari, red=worse) (no punctuation, base letter differences only)
    jbarth-ubhd
    @jbarth-ubhd
    1. Use sauvola-ms-split!
    I wonder why sauvola-ms-split "k" is so much different from sauvola without "ms"; see another comparison with different scales and k-values (0.17, 0.34, 0.68) here: https://digi.ub.uni-heidelberg.de/diglitData/jb/ocrd-binarize3.html . Sauvola-ms is loosing information on low contrast earlier than "without ms". And in many cases there are no so very different-scale fonts besides journal title on the first page. Larger fonts are often thinner, too, so I do not see where the advantage of "ms" is.
    Robert Sachunsky
    @bertsky

    I wonder why sauvola-ms-split "k" is so much different from sauvola without "ms"

    This is strange indeed. Sauvola-multiscale appears to be defunct in the lower-contrast half, regardless of k. I will investigate.
    Meanwhile, you should also know that in the original paper that introduced it (https://www.researchgate.net/publication/271923570_Efficient_Multiscale_Sauvola's_Binarization), the default choice of k was different at all scales (0.2 at lowest, 0.3 at medium scales, 0.5 at highest), and had been empirically determined as the optimum (as had Sauvola's original k=0.34). But the Olena implementation then silently creeped away from that decision over the years by overriding with equal k at all scales. (And that's also what we locked on in our OCR-D wrapper). I will fix that soon though – see discussion here.

    so I do not see where the advantage of "ms" is.

    Our recommendation was based on a slight advantage in direct OCR CER comparison on OCR-D GT (see here). In that setting, we used the default k, which really just might have been the 0.2/0.3/0.5 scheme as originally intended but I'm not sure.

    In short: please wait for the fix on ms-k in Olena upstream and our wrapper subsequently, then let's do that comparison again (and I'll also run the GT measurements again).
    Kay-Michael Würzner
    @wrznr
    @jbarth-ubhd Wich workflow led to the calamari result? Can you post the whole text?
    jbarth-ubhd
    @jbarth-ubhd
    ocrd-olena-binarize -p '{"impl":"sauvola"}'
    sbb_textline_detector with https://qurator-data.de/sbb_textline_detector/
    ocrd-calamari-recognize with https://qurator-data.de/calamari-models/
    OCR (Calamari+ALTO) https://digi.ub.uni-heidelberg.de/diglitData/jb/schiltberger1580-ocr-calamari.zip
    Kay-Michael Würzner
    @wrznr
    👌🏻
    jbarth-ubhd
    @jbarth-ubhd
    sbb_textline is perhaps oversized for this kind of text
    Stefan Weil
    @stweil
    For all who want to learn more about neural networks: OpenHPI currently offers a course on deep learning for computer vision. The course started three days ago, but it is still possible to join.
    jbarth-ubhd
    @jbarth-ubhd
    effect of k on different olena binarization methods (original test image is 3000 px wide): https://digi.ub.uni-heidelberg.de/diglitData/jb/ocrd-olena-k.png
    [ y is ratio of correct pixels; ground truth from synthetical image was generated by (E(black)+E(background))/2 < pixel Value ? 1 : 0 ]
    jbarth-ubhd
    @jbarth-ubhd
    Added "myMethod" based on local histogram analysis. Tries to find fore/background ratio and crop histogram extrema accordingly to find local threshold value. Very few assumptions, just to test what's possible statistically. Not sure if it always does converge.
    Matthias Boenig
    @tboenig
    Is there no word segmentation with calamari?
    Robert Sachunsky
    @bertsky
    @tboenig there is: you just need to set textequiv_level=word (it's line by default)
    Matthias Boenig
    @tboenig
    @bertsky Thank you.
    Kay-Michael Würzner
    @wrznr
    To all the people in Home Office: Who would be interested in a concerted effort to train HTR models for Tesseract and Calamari? I.e. setting up a model hierarchy, systematically gathering GT, training and performance evaluation. We could thus make use of the Corona time to close the gap to Transkribus which admittedly exists.
    Christian Reul
    @chreul
    @wrznr I'm in
    Clemens Neudecker
    @cneud
    Christian Reul
    @chreul
    looks good but there are no text lines, just words. am i missing something? i guess lines could be generated relatively easily, though.
    Kay-Michael Würzner
    @wrznr
    @chreul :+1:
    Stefan Weil
    @stweil
    We started transcribing German books which combine print and handwriting: https://github.com/UB-Mannheim/Fibeln.
    So yes, @wrznr, that's a good idea.
    Kay-Michael Würzner
    @wrznr
    @stweil Do you collect the information which lines are handwritten and which are printed? I somehow doubt that mixed models would be the right way to go here.
    I'd also say that the “handwritten” parts are most likely printed as well.
    Stefan Weil
    @stweil
    Some more Fibeln are nearly finished and will be on GitHub soon. And no, marking the handwritten lines still has to be done.
    Kay-Michael Würzner
    @wrznr
    @chreul @stweil Are you ready to have a call on this matter tomorrow?
    Christian Reul
    @chreul
    yes. i am available from ca. 10 to 17h with the exception of 14:30 to 15:30h
    Stefan Weil
    @stweil
    I am busy from 10h to 11h. Perhaps we can meet at 11h? https://meet.jit.si/ocr-d.
    Works best with Jitsi Meet app (mobile, tablet), Chrome or https://github.com/jitsi/jitsi-meet-electron/releases for desktop.
    Maybe @FVoigtschild and @Ma-Nuechter want to join us, too. Both are working on transcriptions and training of new models.
    Kay-Michael Würzner
    @wrznr
    11AM it is then. Wonderful!
    000000.png
    Stefan Weil
    @stweil
    New transcriptions online: https://github.com/UB-Mannheim/Weisthuemer.
    Robert Sachunsky
    @bertsky

    Note (for those who have to deal with layout data): We now have converters for segmentation from/into COCO and from/into color-coded mask images:

    • ocrd-segment-extract-pages (now exports mask images and COCO, too)
    • ocrd-segment-from-masks (new, imports mask images)
    • ocrd-segment-from-coco (new, imports COCO file)

    These also try to convert PAGE region @type and @custom=subtype. Mask images by default use color scheme from PAGEViewer. Import from PubLayNet is also supported (i.e. no need for https://github.com/bertsky/ocrd_publaynet).

    Uwe Hartwig
    @M3ssman
    @tboenig Hello Mr. Boening! We're waiting for the conference host to join @dfnconf!