These are chat archives for cltk/cltk

26th
Sep 2018
Ben Nagy
@bnagy
Sep 26 2018 02:17
I looked at LEMLAT and it seemed promising. In fact I think I looked at all of these: https://github.com/diyclassics/lemmatizer-experiments/tree/master/notebooks
I am currently very happy with the Collatinus results and speed, but the downside is that it's a huge install, and launching the server etc is a lot of moving parts for users who aren't computer people.
It will take me a little while to learn Python, but at some point I should be able to do a better wrap of sfst-python to make something which would install cleanly. The trouble is that for my own purposes that still doesn't give me a way to select lemmas when there are multiple options. This is where even a simple statistical model build from a bank of manually analyzed data would be great.
Ben Nagy
@bnagy
Sep 26 2018 02:26
Anyway, summary, for now I am going to finish this present project with Collatinus. For CLTK I think it would be good to have a data structure that could be queried by morphology somehow - so I say 'cadam 1st sing future active' and it gives me a number in [0,1] representing the percentage of that morphology for that form. If you have hints as to how I could build that then I'll check it out at some point :)
(such a structure could then be used by any of the taggers, lemmatizers, etc that return multiple options)
Kyle P. Johnson
@kylepjohnson
Sep 26 2018 17:27
  • so I say 'cadam 1st sing future active' and it gives me a number in [0,1] representing the percentage of that morphology for that form.
^^^ Do you mean, a % percentage likelihood that cadam is 1st sing future active (vs subjunctive, other ambiguities)? The trouble I see is that this likelihood would be closely tied to the training corpus, which are currently rather small and mostly confined to Classical Latin.
Ben Nagy
@bnagy
Sep 26 2018 23:31
I agree that it’s only as good as your training corpus, but as I read the code right now it’s better than nothing (the present approach seems to be ‘most represented lemma, otherwise just guess’). Getting access to a better and more varied corpus of hand verified Latin might be possible?
Ben Nagy
@bnagy
Sep 26 2018 23:39
I’m not trying to be critical here, and I know it would be at best a “use with caution” feature. For true POS tagging it will be awful. For lemmatising, imvho, probably pretty reasonable - I have seen a few verbs with collisions where one lemma is a top 500 word and the other is a dodo egg.