Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Robert Sachunsky
    @bertsky
    You would have to add it to recognize.py for now (SetVariable('tessedit_char_blacklist', 'äöü')). But we could expose arbitrary parameter assignments in the tool json...
    Kay-Michael Würzner
    @wrznr
    Do you know how it works internally?
    Robert Sachunsky
    @bertsky
    Yes I do! (I brought this from Tess3 to LSTMs...)
    It acts like a filter (blacklist) or selector (whitelist) during CTC. (If you discount characters encoded as multiple positions internally, and dictionaries, as these make the explanation more involved.)
    So it's as if you don't (blacklist) or only (whitelist) have the respective characters in your unicharset (model) at all.
    You can even combine both: with a blacklist + unblacklist.
    Kay-Michael Würzner
    @wrznr
    Does it influence the transition probabilities between characters?
    And a related question: If I want hyphens to be encoded with but cannot blacklist -, what would I do?
    Robert Sachunsky
    @bertsky
    IIUC, it distributes the probability mass of the suppressed hypotheses uniformly. So fine-tuning is of course always better than decoder changes. But since you are mixing different models, you have a good chance that another model (with the right kind of umlaut) is better than the next-best hypothesis.

    If I want hyphens to be encoded with ⸗ but cannot blacklist -, what would I do?

    Train a specific post-correction model! I did this for sſ and it "repaired" from 3% CER to 0.2% with cor-asv-ann.

    Kay-Michael Würzner
    @wrznr
    Sounds good. Do you have a Gist for applications like this?
    Robert Sachunsky
    @bertsky
    (But of course, since this is in the nature of neural post-correction, there are now some new, unrelated errors among those 0.2%.)
    Mike Gerber
    @mikegerber

    in this case GT files processed with ocrd_repair_inconsistencies
    ...which has long been renamed ocrd-segment-repair with sanitize=True and integrated into ocrd_segment

    ?

    Robert Sachunsky
    @bertsky
    I have some general descriptions/recipes in https://github.com/ASVLeipzig/cor-asv-ann-data-processing for training OCR vs GT, but nothing specific on such tailored tasks. But I will soon!
    Kay-Michael Würzner
    @wrznr
    :+1:
    @bertsky ocrd_repair_inconsistenciesocrd-segment-repair!
    Robert Sachunsky
    @bertsky
    @mikegerber Oh, I'm sorry – I confused this with an earlier incarnation of ocrd_segment. Your repo ocrd_repair_inconsistencies is of course something completely new!
    Kay-Michael Würzner
    @wrznr
    And has a totally different scope.
    Although, it has a very “broad” name. :grin:
    Mike Gerber
    @mikegerber
    hehe, i should call it ocrd-sanitizer-plausibilizer :)
    no, really that's good point, it should have a better name
    Robert Sachunsky
    @bertsky
    yes, or we wait with renaming tools and parameters until we have a good plan for a unified treatment of layout post-processing/validation/repairs (with shared components).
    Mike Gerber
    @mikegerber
    in this specific case it's easy to rename because it shouldn't be part of any pipeline, so i would just do it when inspiration for a better name strikes me. but i'm also for integration into some broader repair tool if that hypothetical tool has fine control over which repairs are done (e.g. "only reorder my xml elements to fix textequiv inconsistencies")
    Kay-Michael Würzner
    @wrznr
    @stweil Have you ever trained a model with the data provided at https://github.com/jze/ocropus-model_fraktur?
    Or @mikegerber maybe you did?
    Mike Gerber
    @mikegerber
    @wrznr not yet, have to check it out
    Kay-Michael Würzner
    @wrznr
    Actually looks very promising. But as for GT4HistOCR only binarized and the notorious nrm images.
    Kay-Michael Würzner
    @wrznr
    @mikegerber Why does the line extractor write the PAGE XML to stdout?
    Mike Gerber
    @mikegerber
    @wrznr i'm investigating qurator-spk/sbb_textline_detector#16
    Mike Gerber
    @mikegerber
    @wrznr ocrd_page_generateds.py's parse() prints to stdout unless silenceis True, not sure if that qualifies as a bug in parse()but I'm going to silence it in the OCR-D interface
    Kay-Michael Würzner
    @wrznr
    Great, thank you. I would not consider it a bug.
    Mike Gerber
    @mikegerber
    @wrznr fix is commited
    Kay-Michael Würzner
    @wrznr
    @mikegerber Thanks!
    image.png
    A page from the Börsenblatt segmented with sbb-textline-detector.
    While the general result quality is impressive, a number of interface-related questions pop up:
    1. It is called textline-detector but it (also) does region detection, why?
    2. Existing regions are silently overwritten, why?
    3. The detected lines do not respect the detected regions, why?
    Mike Gerber
    @mikegerber
    beware that the author of the segmentation is not here, what @cneud and me can answer is limited
    Kay-Michael Würzner
    @wrznr
    image.png
    Lines.
    No worries! As I wrote: The result is impressive.
    I guess that you and @cneud are responsible for the design if the OCR-D interface anyway.
    And it is great that you released this tool. Many thanks!
    Mike Gerber
    @mikegerber
    1. and 2. seem to be the same question :D i can partly answer question 2: the underlying code uses solely the image as input
    could you open an issue with the questions? otherwise i'd do it myself
    Kay-Michael Würzner
    @wrznr
    I am doing it this very moment.
    Mike Gerber
    @mikegerber
    there might some stuff going on that doesn't easily translate to "the common workflow" for example because the deskewing uses the textline detection (and then vice versa, as i understood it). region detection and line detection could maybe be separated AFAICT but i'm not 100% sure
    could you also create an issue with an upload of your sample, i see a deskewing problem in the lines. my colleague @vahidrezanezhad is normally more than happy to take a look
    Kay-Michael Würzner
    @wrznr
    Stefan Weil
    @stweil

    @stweil Have you ever trained a model with the data provided at https://github.com/jze/ocropus-model_fraktur?

    No, we are still busy with GT4HistOCR and the ÖNB data, mixing both in one model after enhancing the ÖNB texts with long s.