Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Robert Sachunsky
    @bertsky
    Further ideas to make the comparison more fair: use tesstrain's train/val split and only take the list.eval results (@stweil could you please publish this, e.g. at your model page); Calamari is from a cross-fold training, so I guess we would actually have to do the same thing with Tesseract as well (make external split, train 5 models and use them mixed together at inference time) ...
    Robert Sachunsky
    @bertsky

    1.2% on average, 0.9 on dta19 and 2.5% on pre19 – the I/J confusion is only on dta19 subcorpus, maybe I should have used the corrected corpus for testing as well?

    follow-up: I have cross-checked Tesseract models:

    • with the old GT4HistOCR_2000000 model on the old (uncorrected) corpus, we get only 0.7% on average (and the I/J error goes away; the single largest contribution now being unstripped spaces at the end of the line).
    • with the new GT4HistOCR model on the new (corrected) corpus, we get 1.1% on average (the largest contributor to which is r/ꝛ – which is peculiar considering this was one of the major fixes, with well over 5200 more lines using ; perhaps these changes were only made after the last public model was trained, @stweil?)
    Robert Sachunsky
    @bertsky
    I have started external documentation for all of the above and will continue with further cross-checks. (But I do need your help, @stweil and @mikegerber!)
    Mike Gerber
    @mikegerber
    @bertsky I've been AFK yesterday and I'll look into it. Did you consider that the I vs J confusion for Stefan's model might not be a confusion at all, i.e. I and J being identical?
    Ah you ran it on GT4HistOCR itself
    Robert Sachunsky
    @bertsky
    @mikegerber what do you mean identical? (I know that for Fraktur there's only one glyph for both, but IIUC we want the OCR to learn J in that case.)
    Mike Gerber
    @mikegerber
    I also haven't trained on the corrected corpus (yet), so that may be worth mentioning as well
    Robert Sachunsky
    @bertsky
    Yes, on the training and test set that is. Results on OCR-D GT are much worse (as earlier reported; likely due to runtime dependencies like padding/polygonalization and binarization/masking).
    Yes, training yet another Calamari model on the new data (and with mild normalization) would be fantastic!
    Mike Gerber
    @mikegerber

    Results on OCR-D GT are much worse (as earlier reported; likely due to runtime dependencies like padding/polygonalization and binarization/masking).

    Could you point me to that earlier report, please?

    Robert Sachunsky
    @bertsky
    @mikegerber sorry, cannot find it. I'm quite sure I made a markdown table of it, either on Github or Gitter, but search functionality is limited on both. The best I could find is the above table for the binarization dependency (just 2 of 19 bagits though). I think I'll just redo that evaluation with recent processors and improved workflow (to get good line images; i.e. binarization, deskewing, dewarping, polygonalization, clipping)...
    Robert Sachunsky
    @bertsky
    @/all I have just updated the workflow server prototype OCR-D/core#652 for OCR-D (ocrd workflow server TASKS / ocrd workflow client process -m METS) to run multiple requests/workspaces in parallel via a multi-processing worker queue. It now uses uwsgi as server instead of the built-in, single-threaded Flask development/debug server.
    (You could have had parallelism and load balancing already before by running the WF server in a Docker swarm; but of course, that's still possible when you do need to cross a single machine's boundary.)
    Robert Sachunsky
    @bertsky
    @mikegerber sorry, I made a mistake in measuring CER on the Calamari model: I forgot to .strip() the lines! With this fix, CER for the Qurator GT4HistOCR model becomes 0.8% (or 0.4% when discounting all the errors related to quotation) on the old corpus and 1.1% (or 0.6% without quotation) on the corrected corpus – mostly attributable to J/I then (mind the reverse direction).
    Stefan Weil
    @stweil

    Further ideas to make the comparison more fair: use tesstrain's train/val split and only take the list.eval results (@stweil could you please publish this)

    https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021/

    Robert Sachunsky
    @bertsky
    @stweil I don't understand. Sure comparing with frak2021 would also be interesting. But IIUC that has been trained on additional material, too. Also, I was asking specifically for the train/val split used for tesstrain (to get a better idea of how this might generalise). Could you please publish your list.eval (ideally for all models)?
    Stefan Weil
    @stweil
    That's the link for list.train and list.eval used by the latest training (frak2021). https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/Fraktur_5000000/ now also provides both lists for that older training.
    Stefan Weil
    @stweil
    • with the new GT4HistOCR model on the new (corrected) corpus, we get 1.1% on average (the largest contributor to which is r/ꝛ – which is peculiar considering this was one of the major fixes, with well over 5200 more lines using ; perhaps these changes were only made after the last public model was trained, @stweil?)
    The first fixes for 'ꝛ' where made in May 2020, after training of our GT4HistOCR models.
    Robert Sachunsky
    @bertsky

    That's the link for list.train and list.eval used by the latest training (frak2021). https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/Fraktur_5000000/ now also provides both lists for that older training.

    Oh, sorry – my browser still had the old page cached. I now have list files. Thank you very much!

    Looks like frak2021 was trained on ~53k lines of AustrianNewspapers and ~3.5k Fibeln in addition to GT4HistOCR. So that's probably not a big deal when comparing with other models which have been trained solely on GT4HistOCR – but if at all possible, could you please provide the list files for tesstrain/GT4HistOCR and/or ocrd-train/data/GT4HistOCR as well?

    The first fixes for 'ꝛ' where made in May 2020, after training of our GT4HistOCR models.

    thx for clarifying!

    Mike Gerber
    @mikegerber
    @bertsky I'll make time for this, I wanted to investigate some issues like interaction of segmentations vs OCR methods for a long time anyway. We should try to make this reproducible so we can gradually fix any issues (for any component)
    @bertsky I've also evaluated @chreul's GT4HistOCR model and it seemed to be a tiny bit better. However, only so little better I had doubts it may not be significant, statistically
    Stefan Weil
    @stweil

    could you please provide the list files for tesstrain/GT4HistOCR and/or ocrd-train/data/GT4HistOCR as well?

    https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/GT4HistOCR/

    Robert Sachunsky
    @bertsky
    @stweil thx, much appreciated!
    Robert Sachunsky
    @bertsky
    @mikegerber absolutely with you! (Tesseract also has these dependencies I'm afraid, esp. horz/vert. padding, and – since it does not have any built-in augmentation – binarization.)
    aurichje
    @aurichje

    Hey everyone.
    I'm new to OCR-D but I've been experimenting with it and plan to use it for a MA project. First of all: thank you for all the amazing work! My technical knowledge is limited but thanks to the well-written docs I am able to make use of OCR-D.
    However, I did run into an issue using the eynollah segmentation processor -- I hope this is the right place to address it. Any help would much be appreciated.

    When running the workflow below, I get an inverse text line order within each region, i.e. the actual last line of the region is line_0001. Accordingly, the resulting text from a recognizer is scrambled.

    Here's the workflow I used, I'll drop in the resulting XML and a screenshot below:
    I used this image

    
    ocrd process \
      "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
      "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
      "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
      "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
      "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
      "eynollah-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P models default -P curved_line true" \
      "calamari-recognize -I OCR-D-SEG -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

    Does anyone have an idea what I'm doing wrong?

    screenshot eynollah.png
    Mike Gerber
    @mikegerber
    @aurichje Interesting, could you open an issue at https://github.com/qurator-spk/eynollah/ please?
    @aurichje I'd also like to point out that the extra cropping and deskewing is probably too much, eynollah does it itself. Also: Why binarize twice? (We either use ocrd_olena with sauvola-ms-split or sbb_binarization, so can't say much about skimage and ocropy binarization. The denoising would also be included with sbb_binarization)
    aurichje
    @aurichje
    @mikegerber Done. Thanks for pointing out the extra preprocessing steps.
    Mike Gerber
    @mikegerber
    @aurichje The colleague working on eynollah (@vahidrezanezhad) was already informed by @kba in our team channel, so I think he'll solve this soon :)
    Elisabeth Engl
    @EEngl52

    @/all this Wednesday, 2-3pm is our next open TechCall. Feel free to join us in https://meet.gwdg.de/b/eli-ufa-unu if you are interested in the following topics:

    BTW: if you have any suggestions for topics to be discussed in some future TechCall, feel free to add them anytime in this HackMD

    SB2020-eye
    @SB2020-eye

    Has anyone here ever utilized any object detection methods on old handwriting?

    (I'm still trying to cut out individual glyphs (and ligatures) from a manuscript, the script of which looks almost identical to this...

    Chad_one_word.png
    Mike Gerber
    @mikegerber
    I think someone here did HTR with specifically trained Tesseract models (maybe @wrznr and/or @stweil?)
    1 reply
    SB2020-eye
    @SB2020-eye

    ... I've continued to fall short trying binarization-based methods and/or OCR-D procedures (plus a handful of other avenues I won't go into, unless someone wants to know).

    I'm hoping someone here has some familiarity with custom object detection and/or semantic segmentation who would be willing to suggest some direction for me.)

    (I've tried out imageai with some preliminary test runs, but I can't get it to find anything. I'm not sure where the problem lies. But I'm not necessarily looking for specific help with imageai per se -- just for someone who can make suggestions, or tell me "that route won't work", or whatever. :smile: )
    Robert Sachunsky
    @bertsky

    I think someone here did HTR with specifically trained Tesseract models (maybe @wrznr and/or @stweil?)

    yes, they did successfully train HTR for https://github.com/tesseract-ocr/tesstrain/wiki/German-Konzilsprotokolle and https://github.com/tesseract-ocr/tesstrain/wiki/Fibeln.

    Robert Sachunsky
    @bertsky
    @SB2020-eye what specifically is your problem with these materials? If you have GT, you could train an OCR engine (Tesseract / Calamari / Kraken / Ocropus)... Depending on the amount of data, you may need to fine-tune from some existing model, or mix in extra training data. But this looks like a standard OCR task plus extracting the glyph segmentation from the OCR at runtime (which all engines can deliver, if not very exactly, but Tesseract has a bug IIRC).
    Or is the OCR's glyph segmentation output to imprecise? In that case, you might want to try some form of forced alignment with a classic segmentation based recognizer. Tesseract seems to be able to do this (Tesseract::ApplyBoxes with find_segmentation=true)...
    But since you are asking specifically for a object detection/segmentation model: I would avoid semantic segmentation for this kind of task, since the objects overlap / float into each other and you want to find discrete units. I would try training Mask-RCNN with glyph bboxes / pixel masks as objects. You might even be able to create the training data synthetically from OCR and then mix in some of your precious GT.
    Robert Sachunsky
    @bertsky

    Tesseract seems to be able to do this (Tesseract::ApplyBoxes with find_segmentation=true)...

    This is even available from the CLI: tesseract line.box - --psm 7 -c tessedit_resegment_from_line_boxes=1 – where line.box is a Tesseract box file with the line string and the bounding box of the line (could be the full image I suppose).

    SB2020-eye
    @SB2020-eye

    Thanks very much. One problem is that, despite you being perfectly clear, I imagine, I only understand about 50% of all that. Lol.

    So please know the question that follows isn't meant to have any impertinence at all.

    Can what you're describing help me get to this?

    Chad_one_word e.png
    ...With perhaps this in between (but only if needed)?
    SB2020-eye
    @SB2020-eye
    Chad_one_word.png
    Robert Sachunsky
    @bertsky
    @SB2020-eye Yes, we're talking about the same problem. I described two principal ways to get there:
    1. from a "legacy" (pre-LSTM) OCR's character segmentation (in Tesseract: chopper). Especially if you already know the correct character sequence (this is called forced alignment).
      Tesseract can in principle still do this, but it does not always work for various reasons:
       # create LINEIMAGE.box with the overall bbox and text result (MODELNAME can be LSTM):
       tesseract LINEIMAGE.png LINEIMAGE -c tessedit_create_wordstrbox=1 --psm 7 -l MODELNAME
       # split up text result into characters with white space in between (old word/line box format):
       sed -i 's/./& /g' LINEIMAGE.box
       # run the chopper and reclassifier from LINEIMAGE.box (MODELNAME should be pre-LSTM):
       tesseract LINEIMAGE.png LINEIMAGE_glyph -c tessedit_resegment_from_line_boxes=1 --psm 7 -l MODELNAME
      (IIUC, unfortunately, the unicharsets of both models need to match, and the resegmenter/reclassifier must be able to find the exact same text spanning the whole line, otherwise you'll get APPLY_BOX: FAILURE: can't find segmentation. Also, it's not trivial to get polygons or pixel masks from this result; not sure if possible via API or would require changes to Tesseract.)
    2. from a standalone neural object detection model. Mask-RCNN detects multiple (possibly overlapping) instances of objects, and has 3 output layers: instances' bounding boxes (as a x/y+w/h regression task), instances' pixel masks (as fully convolutional network within the bbox), and instances' classes (as crossentropy/softmax classification task). So if you feed it images of lines annotated by bboxes and fg masks for each glyph (and possibly even the glyph's codepoint as class) during training, then it should be able to learn glyph segmentation (and identification). The side-remark was that since glyph-segmented GT is rare, you could create such training material from the results of an existing OCR system with glyph coordinates output. (Of course, if the OCR makes systematic errors, you should try to fix these or filter them out of the training data, otherwise the segmentation network will try to reproduce them.)
    Robert Sachunsky
    @bertsky
    (Cannot edit my comment: the second command line is actually slightly more complicated:
    while IFS=$'#' read BOXES TEXT; do echo -n "$BOXES#"; echo $TEXT | sed 's/./& /g'; done < LINEIMAGE.box | sponge LINEIMAGE.box
    SB2020-eye
    @SB2020-eye

    Thank you, @bertsky . So, I think I understand more of #2 above, so I'd like to explore this route.

    Can anyone point out any resources laying out the steps to this approach?

    SB2020-eye
    @SB2020-eye
    (More specifically:
    1. What is a good (ideally free) tool for annotation of this sort (and some kind of guide, video, or at least basic documentation)?
    2. @bertsky , how would I go about creating the training material from the results of an existing OCR system?
    3. I am searching for Mask-RCNN and instance segmentation methods now (especially looking for something I can grasp and therefore implement). But if anyone can make any suggestions on this, I would be grateful.)
    2 replies
    Robert Sachunsky
    @bertsky
    @EEngl52 @/all Regarding TechCall tomorrow, I would like to suggest tesseract-ocr/tesstrain#261 (Tesseract using training BCER instead of testing CER for checkpointing and evaluation) as topic – hopefully we can get a discussion on how to proceed!
    Robert Sachunsky
    @bertsky
    Does anyone here know how to train Calamari synthetically? There's a class calamari_ocr.ocr.dataset.datareader.generated_line_dataset.line_generator.LineGenerator which takes TTF files and text input and produces a GT dataset, but I don't know how to use that from the CLI (or what kind of parameters around this would be a good choise)...