Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Sep 17 07:11
    clemsciences assigned #1127
  • Sep 15 15:11
    mcorne commented #12
  • Sep 15 14:36
    mcorne opened #12
  • Sep 15 14:01
    mcorne opened #11
  • Sep 15 13:52
    mcorne opened #10
  • Sep 15 11:05
    mcorne commented #1127
  • Sep 15 11:00
    mcorne labeled #1127
  • Sep 15 11:00
    mcorne opened #1127
  • Sep 08 01:47
    diyclassics assigned #1122
  • Sep 07 22:40
    caiogeraldes commented #1126
  • Sep 07 22:15

    kylepjohnson on master

    Further additions of alphabets … (compare)

  • Sep 07 22:15
    kylepjohnson closed #1126
  • Sep 07 22:15
    kylepjohnson commented #1126
  • Sep 07 22:13
    kylepjohnson reopened #1126
  • Sep 07 21:37
    caiogeraldes commented #1126
  • Sep 07 20:17
    kylepjohnson closed #1126
  • Sep 07 19:01
    kylepjohnson commented #1126
  • Sep 07 18:51
    kylepjohnson assigned #1126
  • Sep 07 13:16
    caiogeraldes synchronize #1126
  • Sep 07 13:15
    caiogeraldes opened #1126
Shubhangi Dutta
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer? http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
That function is for normalization, not tokenization: http://docs.cltk.org/en/latest/middle_english.html#text-normalization
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
no it's for words. See the docs here, there is a full example: http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
Shubhangi Dutta
Ah, thank you.
Kyle P. Johnson
sure keep at it and let us know if you hit any problems
I want to implement couple of new things CLTK does not posses where NLTK posses in english language. I want to do some kind of research work. Is it allowed?
Please tell me, Thank you
Amr Keleg
Hi all,
I am working on porting Khoja Stemmer to python and I am wondering whether there is an advantage for writing unicode characters as \u064a instead of ي in the source code or not!
Thanks :smile:
Chatziargyriou Eleftheria
Hi @AMR-KELEG ! It is generally preferred to keep things as unambiguous as possible. Take for example the Cyrillic alphabet which contains characters which look identical to their latin counterparts, but have a completely different encoding.
Amr Keleg
@Sedictious Interesting, I see.
Thanks :smile:
Todd Cook
Ideal format is the unicode escape value with the visual representation placed in an inline documentation style comment, such as:
'\u064a' #: ي
Hi everyone!
I'm a second-year undergraduate in Computer Science who's interested in contributing to Classical language Toolkit for GSoC 2020. I'd like to know how I can get started and get familiar with the projects.
Thank you,
William Michael Short
Incidentally, how far advanced is the Anglo-Saxon stuff? I feel like the dialect issue would be a challenge.
Sunil Kumar
Hi I am looking to expand the tamil corpus for CLTK. How do I get started?
Rutvik Trivedi
@kylepjohnson Sir, will CLTK be planning to apply for GSOC this year?
Kyle P. Johnson
@Rutvik-Trivedi we don't know yet. Maybe
i wish to contribute to it. please tell me how to get started.
I know nlp but I dont know a bit of javascript.
Hi everyone