Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • 05:36
    kylepjohnson commented #994
  • 05:35
    kylepjohnson commented #994
  • 05:12
    dstelzer commented #994
  • 04:30
    dstelzer synchronize #994
  • Aug 14 04:46
    kylepjohnson commented #994
  • Aug 14 00:51
    kylepjohnson review_requested #994
  • Aug 14 00:34
    dstelzer commented #993
  • Aug 14 00:33
    dstelzer commented #993
  • Aug 14 00:33
    dstelzer commented #993
  • Aug 14 00:32
    dstelzer commented #993
  • Aug 14 00:31
    dstelzer opened #994
  • Aug 13 21:50
    kylepjohnson commented #993
  • Aug 13 21:50
    kylepjohnson labeled #993
  • Aug 13 21:50
    kylepjohnson assigned #993
  • Aug 13 21:45
    dstelzer commented #993
  • Aug 13 21:43
    dstelzer commented #993
  • Aug 13 21:42
    dstelzer commented #993
  • Aug 13 18:40
    dstelzer opened #993
  • Aug 07 13:23
    clemsciences commented #992
  • Aug 07 05:23
    kylepjohnson commented #992
Kyle P. Johnson
Hi @deeox can you DM me this link, please?
@/all GSoC announces orgs this week, so it won't be until after then that someone from the CLTK team will be able to review draft documents. Thank you
is CLTK is participating in GSoC 2019?
Sushant Mehta
CLTK wasn't selected. I'd like to contribute to CLTK during my summer.
Shubhangi Dutta
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer?
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
That function is for normalization, not tokenization:
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
no it's for words. See the docs here, there is a full example:
Shubhangi Dutta
Ah, thank you.
Kyle P. Johnson
sure keep at it and let us know if you hit any problems
I want to implement couple of new things CLTK does not posses where NLTK posses in english language. I want to do some kind of research work. Is it allowed?
Please tell me, Thank you
Amr Keleg
Hi all,
I am working on porting Khoja Stemmer to python and I am wondering whether there is an advantage for writing unicode characters as \u064a instead of ي in the source code or not!
Thanks :smile:
Chatziargyriou Eleftheria
Hi @AMR-KELEG ! It is generally preferred to keep things as unambiguous as possible. Take for example the Cyrillic alphabet which contains characters which look identical to their latin counterparts, but have a completely different encoding.
Amr Keleg
@Sedictious Interesting, I see.
Thanks :smile:
Todd Cook
Ideal format is the unicode escape value with the visual representation placed in an inline documentation style comment, such as:
'\u064a' #: ي
Hi everyone!
I'm a second-year undergraduate in Computer Science who's interested in contributing to Classical language Toolkit for GSoC 2020. I'd like to know how I can get started and get familiar with the projects.
Thank you,
William Michael Short
Incidentally, how far advanced is the Anglo-Saxon stuff? I feel like the dialect issue would be a challenge.
Sunil Kumar
Hi I am looking to expand the tamil corpus for CLTK. How do I get started?
Rutvik Trivedi
@kylepjohnson Sir, will CLTK be planning to apply for GSOC this year?
Kyle P. Johnson
@Rutvik-Trivedi we don't know yet. Maybe
i wish to contribute to it. please tell me how to get started.
I know nlp but I dont know a bit of javascript.
Hi everyone