Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Konstantin Baierer
    @kba
    Hi, this is probably a good place to talk about schematron for hocr.
    I've mostly set up the tooling so far in schematron-cli.
    But gotta go for now, be back later.
    wanghaisheng
    @wanghaisheng
    i have written several schematron docs and a xsd and sch validator for cda (xml used for information exchange between different health care information system )document validation before
    Konstantin Baierer
    @kba
    Great. What I want to do is replace hocr-check, at least the structural checkswith schematron and cross-reference the assertions/reports with thehocr-spec.md` file, so we can validate hOCR in meaningful way and possibly produce analytical output for debugging etc.
    Konstantin Baierer
    @kba
    Okay, I have set this up now so that all the necessary steps can be run: Compile Schematron to XSLT, remove namespaces (XHTML), run XSLT, post-process to some easier to use format.
    Check out the schematron branch https://github.com/kba/hocr-spec/tree/schematron
    To test: run make -C vendor (which will setup schematron-cli) and ./hocr-validate <some-hocr>
    I appreciate any feedback/contributions, thanks @wanghaisheng
    Philipp Zumstein
    @zuphilip
    Small question about bounding boxes: Do the specs actually mentioned if they are measured from top or from bottom? (Tesseract seems to it from top and ocropy with ocropus-hocr from bottom...?)
    Konstantin Baierer
    @kba

    The bbox coordinates are supposed to be from top left to bottom right:

    the bounding box of the page; for pages, the top left corner must be at (0,0), so a typical page bounding box will look like bbox 0 0 2300 3200

    Philipp Zumstein
    @zuphilip
    Thanks, I have just read the first mentioned of bbox and didn't carefully look at the remaining parts. Then it looks that ocropus-hocr is doing it wrongly.
    Dominik Suszczewicz
    @54a81914e70c4fe_twitter
    Hello
    Konstantin Baierer
    @kba
    why, hello there.
    Audun Bjørkøy
    @audub
    Hello, we are trying to get a grasp on the hOCR standard. From different sources we get slightly different syntaxes. We looked for a schema, but there seem to be none? Our initial question is that some hocr files seems to have the <xml> tags wrapping the html content, while other seems to just be <html> tags. What would be the correct standard? Is there a preferred python library out there for parsing hocr files? Thanks.
    Philipp Zumstein
    @zuphilip
    There is a validation tool for hocr files: https://github.com/kba/hocr-spec-python . Some tools for parsing hocr files can be found at https://github.com/tmbdev/hocr-tools which was started by @tmbdev who also first come up with the hocr format in the context of ocropus.
    Konstantin Baierer
    @kba

    Our initial question is that some hocr files seems to have the <xml> tags wrapping the html content, while other seems to just be <html> tags.

    At the time of the specification, XHTML was the HTML standard du jour (with XML preamble) nowadays it's HTML5 (with <!DOCTYPE html>). In practice, it makes little difference for the hOCR use case, except for self-closing elements etc.

    Is there a preferred python library out there for parsing hocr files?

    as @zuphilip mentioned, you could have a look at the hocr-spec-python tool which should have fairly complete implementation of all the title properties and class properties.

    For a tokenizer/parser of title properties, have a look at https://github.com/kba/hocr-dom/blob/master/hocr-dom/src/property-parser.js. It's JS but fairly simple to port. Also there are a few python scripts on GitHub based on lxml, beautifulsoup etc.
    The main idea of hOCR is to reuse HTML with a few conventions which makes it relatively easy to parse.
    Audun Bjørkøy
    @audub
    Sorry for the late reply, thanks a lot guys! We have it all up with IIIF and UniversalViewer at the moment. Still testing performance and such, but it looks promising. Thanks again.
    Konstantin Baierer
    @kba
    Sounds great, if/once you have something running publically, I'd appreciate a link, always interesting to see cool tech in practice