These are chat archives for cltk/cltk

1st
Mar 2019
Shubhangi Dutta
@celinarose
Mar 01 16:41
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
@kylepjohnson
Mar 01 17:46
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer? http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta
@celinarose
Mar 01 18:57

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
@kylepjohnson
Mar 01 19:03
That function is for normalization, not tokenization: http://docs.cltk.org/en/latest/middle_english.html#text-normalization
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta
@celinarose
Mar 01 19:26

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
@kylepjohnson
Mar 01 19:33
no it's for words. See the docs here, there is a full example: http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
Shubhangi Dutta
@celinarose
Mar 01 19:45
Ah, thank you.
Kyle P. Johnson
@kylepjohnson
Mar 01 19:53
sure keep at it and let us know if you hit any problems