Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Robert Sachunsky
    @bertsky
    @mikegerber yes you are always applying NFC (and maybe that's reasonable). as to compatibility / transcription levels– there will always be different demands, so libraries should offer as many useful cases as possible. (And having a library implementation of GT transcription level normalization is a must for OCR-D). And yes, let's discuss that on Monday :-)
    @wrznr just giving Konstantin some space :-)
    Mike Gerber
    @mikegerber
    @bertsky the way it's built it should give the same results if i would use NFD - because i am always comparing grapheme clusters
    Mike Gerber
    @mikegerber
    i am not going to do it on monday, but judging from conversations i had elsewhere there's also a need for a short talk on normalization - there's a lot of confusion what it means in different contexts or use cases
    Konstantin Baierer
    @kba
    @stweil https://github.com/Doreenruirui/okralact/tree/master/install/tesseract_patch patched versions for lstmtraining in tesseract by @Doreenruirui
    Mike Gerber
    @mikegerber

    Despite the smaller issues, the trained models work quite well for Tesseract as one can see for an image from the Reichsanzeiger: https://ub-backup.bib.uni-mannheim.de/~stweil/ocrd-train/results/.

    ocrd_tesserocr 0.2.2 (couldn't update yet, also still using tesseract 4.0 in that container) segmentation results on this are horrible. native tesseract 4.1 (i have still rc4) is better but still bad. ocrd_tesserocr copes better with the bad segmentation than ocrd_calamari

    Robert Sachunsky
    @bertsky
    you need at least 0.3.0, better 0.4.0 (current master)
    Mike Gerber
    @mikegerber
    0.4.0 requires tesseract from ubuntu 19 or a newer tesseract on ubuntu 18 etc. pp. that's why i'm not there yet
    Robert Sachunsky
    @bertsky
    I don't understand. ocrd_tesserocr only requires a reasonably recent tesserocr (which is ocrd-fork-tesserocr==3.0.0rc2 from OCR-D/tesserocr as long as @sirfz does not update to 4.1). The underlying Tesseract you can always build/install manually
    Mike Gerber
    @mikegerber
    the newest ocrd_tesserocr update pulled in a new ocrd-fork-tesserocr that doesn't build with the ubuntu 18.04 tesseract, so this broke my setup for now.
    Robert Sachunsky
    @bertsky
    true, because 18.04 only ships a 4.0 pre-release! You have to build and install Tesseract manually on your system
    Mike Gerber
    @mikegerber
    as i said, i couldn't update yet
    Robert Sachunsky
    @bertsky
    update what? Are you really running Tesseract from deb? (why?)
    Mike Gerber
    @mikegerber
    update tesseract.
    Robert Sachunsky
    @bertsky
    @mikegerber well, regardless of why you cannot install system-wide, you can always install into a local PREFIX and set your (virtualenv) shell's PATH and LD_LIBRARY_PATH accordingly...
    Stefan Weil
    @stweil
    Isn't there a Tesseract 4.1 from @sirfz already?
    Mike Gerber
    @mikegerber
    @bertsky i have some hard to fix technical difficulties here that just manifested yesterday when i tried to update. just believe me for now :) i hope i can fix this next week
    Robert Sachunsky
    @bertsky
    No, there is no new release (2.4.0 is still most recent). Besides, Mike's problem seems to be Tesseract itself
    (2.4.0 is Nov 2018)
    Mike Gerber
    @mikegerber
    na, the whole story is: i use this stuff in a 18.04 container. yesterday's update forced me to update tesseract from the ubuntu supplied tesseract 4.0 prelease (i wasn't aware of that) and then a whole different problem with ubuntu's apt repository (or SBB's web proxy) blocks my further attempts to update.
    Robert Sachunsky
    @bertsky
    ok, so you have deb trouble, but installing under /usr/local should always work (just sayin)
    Mike Gerber
    @mikegerber
    on my host system i have 4.1.0rc4 and there i have a problem from hell: some AVX2 code in 4.1 versions hard resets my computer unless i disable AVX2 so i am somewhat reluctant to change anything because it involves patching or being super careful with disabling AVX2 on runtime
    Kay-Michael Würzner
    @wrznr
    Focus?
    Mike Gerber
    @mikegerber
    @bertsky no, i can't because i cannot update my container because ubuntu is broken :)
    Robert Sachunsky
    @bertsky
    ouch!!
    Mike Gerber
    @mikegerber
    but nevermind i don't think this is going anywhere
    Robert Sachunsky
    @bertsky
    thanks – I consider myself warned.
    Stefan Weil
    @stweil
    https://packages.debian.org/sid/tesseract-ocr is 4.1. for Debian - made some days ago, so rather fresh. It works for me on Debian stable, too.
    Mike Gerber
    @mikegerber
    image.png
    @stweil this is my colleague vahid's binarization + segmentation plus ocrd_calamari with the gt4histocr model
    these are the files: http://area.staatsbibliothek-berlin.de/sbb-upload/qurator/2019-08-23-reichsanzeiger/index.html - i hope they download ok because i'm "ripe for the week end"
    Konstantin Baierer
    @kba
    works for me
    Stefan Weil
    @stweil
    That looks really good. Especially the segmentation is impressive.
    Clemens Neudecker
    @cneud
    Yes! Very happy with the outputs from QURATOR colleagues @mikegerber & @vahidrezanezhad and the synergies for OCR-D. We will supply OCR-D processors for binarization and segmentation in the coming months via https://github.com/qurator-spk.
    Konstantin Baierer
    @kba

    @bertsky @finkf

    I'm getting segfaults for Image.new() in PIL. Have you ever experienced that?

    <PIL.Image.Image image mode= size=0x0 at 0x7F5B3947EF28>                                                                                                                    
    AFTER                                                                                                                                                                       
    corrupted size vs. prev_size                                                                                                                                                
    Fatal Python error: Aborted                                                                                                                                                 
    
    Current thread 0x00007f5b4ed6c740 (most recent call first):                                                                                                                 
      File "/home/kba/env/py3.6.5/lib/python3.6/site-packages/PIL/Image.py", line 2378 in new                                                                                   
      File "/home/kba/build/github.com/OCR-D/monorepo/core/ocrd_utils/ocrd_utils/__init__.py", line 373 in polygon_mask                                                         
      File "/home/kba/build/github.com/OCR-D/monorepo/core/ocrd_utils/ocrd_utils/__init__.py", line 254 in image_from_polygon                                                   
      File "/home/kba/build/github.com/OCR-D/monorepo/core/ocrd/ocrd/workspace.py", line 366 in image_from_segment                                                              
      File "/home/kba/build/github.com/OCR-D/monorepo/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 267 in _process_existing_words                                          
      File "/home/kba/build/github.com/OCR-D/monorepo/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 211 in _process_lines                                                   
      File "/home/kba/build/github.com/OCR-D/monorepo/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 183 in _process_regions                                                 
      File "/home/kba/build/github.com/OCR-D/monorepo/ocrd_tesserocr/ocrd_tesserocr/recognize.py", line 144 in process                                                          
      File "/home/kba/build/github.com/OCR-D/monorepo/ocrd_tesserocr/test/test_recognize.py", line 62 in runTest
    Clemens Neudecker
    @cneud
    Weird! A quick Google search brings up this solution unrelated to the lib that causes the segfault export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4
    Possibly this bug in Python 3.6 & 3.7 https://bugs.python.org/issue35924
    Konstantin Baierer
    @kba

    With the LD_PRELOAD, I get an additional statement

    src/tcmalloc.cc:283] Attempt to free invalid pointer 0x5010000

    (did export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4)

    https://bugs.python.org/issue35924

    That bug is about curses, the terminal windowing lib. I don't think it's related. but will check tomorrow.

    thanks for the leads
    Clemens Neudecker
    @cneud
    It apparently occurs in multiple use cases e.g. r9y9/gantts#14, pytorch/pytorch#10567
    Konstantin Baierer
    @kba
    Yeah, it's some bug in the C code. Or it might just be a corrupt binary somewhere. Gonna try a fresh virtualenv tomorrow.
    Konstantin Baierer
    @kba
    Might also be a bug in the tesserocr integration.
    Clemens Neudecker
    @cneud
    it persisted?
    Konstantin Baierer
    @kba
    Ha! Uninstalling all tesseract/leptonica remnants and reinstalling from http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu fixed it! :tada:
    also, upstream tesserocr 2.4.1 seems to work just fine, so we can scrap ocrd-fork-tesserocr I hope
    Clemens Neudecker
    @cneud
    :clap:
    Stefan Weil
    @stweil
    Buona notte!
    Robert Sachunsky
    @bertsky
    PAGE viewer for 2019 namespace has been released: https://github.com/PRImA-Research-Lab/prima-page-viewer/releases/tag/1.4