make -C vendor(which will setup schematron-cli) and
Our initial question is that some hocr files seems to have the <xml> tags wrapping the html content, while other seems to just be <html> tags.
At the time of the specification, XHTML was the HTML standard du jour (with XML preamble) nowadays it's HTML5 (with
<!DOCTYPE html>). In practice, it makes little difference for the hOCR use case, except for self-closing elements etc.
Is there a preferred python library out there for parsing hocr files?
as @zuphilip mentioned, you could have a look at the hocr-spec-python tool which should have fairly complete implementation of all the
title properties and
titleproperties, have a look at https://github.com/kba/hocr-dom/blob/master/hocr-dom/src/property-parser.js. It's JS but fairly simple to port. Also there are a few python scripts on GitHub based on lxml, beautifulsoup etc.