These are chat archives for beniz/deepdetect

14th
Feb 2019
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:11
hi @beniz ! just caught jolibrain/deepdetect#543 going by and was curious if there is any info related to trained models and/or language support? i'm not up to date on CTC and NCNN so any info available would be greatly appreciated. thanks again for the great work you all do!
Emmanuel Benazera
@beniz
Feb 14 17:13
Hi @cchadowitz-pf how are things ? Yes, we've patched NCNN so that OCR runs on the RPi3 and embedded at large. Expect ~600ms for a ResNet-18 + LSTM + CTC. Basically, we are treating NCNN as an inference-only CPU light version of Caffe.
Now, on the doc + models side, there's an update soon, some of us are actually busy on it, with refreshed website, Open Source platform for training, and the release of many models, including text detection and good OCR
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:15
:+1: very cool. are there reference models/data that you're working with, or is this a model that you're building internally? i'm personally interested in multi-lingual OCR as I'm relying on Tesseract at the moment.
oh cool!!
that all sounds super exciting, i can't wait to check it all out :)
Emmanuel Benazera
@beniz
Feb 14 17:17
We have an internal OCR dataset generator we've improved from SynthText, overall, it's similar to https://code.fb.com/ai-research/rosetta-understanding-text-in-images-and-videos-with-machine-learning/
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:18
:+1: awesome
Emmanuel Benazera
@beniz
Feb 14 17:18
@cchadowitz-pf I'm not sure I have your email, if you can shoot a quick message to contact@jolibrain.com, we will be talking to people we know in order to get feedback from the website, platform and model selection.
our OCR models are trained on Caffe then converted to NCNN for customers that need embedded models. The full pipeline shall be documented, though some of the stuff will come out earlier.
If you have metrics for Tesseract on some tiny dataset you can share, that'd be interesting to us.
we will provide a model that detect text, and an OCR to run on the crops
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:22
sure, i'll send a message now. very interesting - have you done any comparisons with existing OCR like Tesseract v4?
oops didn't see that message first - i don't have any metrics i can share at the moment, but i'll see if i can come up with something.
Emmanuel Benazera
@beniz
Feb 14 17:23
We could find a tiny set of images to share.
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:24
sure, if you have a set put together it shouldn't be hard for me to get some stats on our end for comparison
Emmanuel Benazera
@beniz
Feb 14 17:25
Tesseract V4 should be using LSTMs as well. We don't have metrics vs tesseract. We have millions of generated samples, I could share a small sample.
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:25
yup Tesseract V4 is using LSTMs, though the codebase/platform isn't quite as convenient
generally speaking I'm more interested in the case of "text in the wild" vs scanned document OCR, but both are still relevant in their own way
Emmanuel Benazera
@beniz
Feb 14 17:42
@cchadowitz-pf I've put a quick tarball here: https://deepdetect.com/stuff/random_ocr_crops.tar.gz It does contain a text file with the ground truth. It's single word and multiple words auto-generated stuff by our code, and then cropped.
Our focus is 'text in the wild' as well
some samples 'impossible', but that's fine for us.
On a much larger (9k instead of 100) samples, our best model hits ~71.7% accuracy (i.e. exactly recognized string)
cchadowitz-pf
@cchadowitz-pf
Feb 14 17:50
are you computing your accuracy on character, word, or phrase, or image levels?
and i'm assuming that tarball is all alphanumeric+punctuation, right? not diacritics or right-to-left or other non english languages?
Emmanuel Benazera
@beniz
Feb 14 19:56
For the generated data = alphanumeric+punctuation (with the exception of *), left to right (though strings can be reversed), english, no accents
for the model = lower case, alphabet size is 69
Generated data uses mixes of sources, including the whole English wikipedia, can be hacked for whatever, it's a modification of Synthtext.
Emmanuel Benazera
@beniz
Feb 14 20:05
I believe there's much room for improving the models still
cchadowitz-pf
@cchadowitz-pf
Feb 14 20:47
well if it encourages you, tesseract v4 out of the box is fairly poor at images such as these. not sure what sorts of results you get from your models but tesseract doesn't get too many of these
Emmanuel Benazera
@beniz
Feb 14 21:06
mmm can't upload images here...
Since it is generated, the distribution of this test set is very close to that of the training set, for that reason we have an external test set, I'll send some pictures from it if needed.
cchadowitz-pf
@cchadowitz-pf
Feb 14 21:20
no worries, was just curious how they compared :)