These are chat archives for cltk/cltk
@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:
In : normalize_middle_english("kynge kyng", alpha_conv=True)
Out: 'kynge kyng'
It might not be a common issue, but is there a way to fix this?
Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.
And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?