Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
  • Jul 24 12:04
    bblumenfelder closed #1118
  • Jul 23 18:43
    todd-cook commented #1118
  • Jul 23 18:41
    todd-cook commented #1118
  • Jul 23 17:34
    bblumenfelder commented #1118
  • Jul 22 20:45
    todd-cook commented #1119
  • Jul 22 20:34
    kylepjohnson assigned #1119
  • Jul 22 20:34
    kylepjohnson labeled #1119
  • Jul 22 20:34
    kylepjohnson closed #1119
  • Jul 22 20:34
    kylepjohnson commented #1119
  • Jul 22 19:42
    pnadelofficial edited #1119
  • Jul 22 19:39
    pnadelofficial edited #1119
  • Jul 22 19:39
    pnadelofficial edited #1119
  • Jul 22 19:39
    pnadelofficial labeled #1119
  • Jul 22 19:39
    pnadelofficial opened #1119
  • Jul 20 23:05
    todd-cook commented #1118
  • Jul 20 23:03
    todd-cook commented #1118
  • Jul 20 22:33
    kylepjohnson commented #1118
  • Jul 20 22:31
    kylepjohnson commented #1118
  • Jul 20 22:18
    kylepjohnson assigned #1118
  • Jul 20 21:49
    bblumenfelder edited #1118
Indranil Biswas
@kylepjohnson I just went through the project idea and had the CLTK setup , I can be a contributer to the Sanskrit part of the extension of the project , but there is one section of the wiki that is unclear to me .
Do we have to prepare / gather / annote , datasets for the extension ?
for the data, how much preparation will be required. Concerning this last point, please remember that GSoC is about code, not data cleaning. We are not able to accept an otherwise brilliant application that also requires 6 weeks of data annotation or cleanup. If you believe you are able to do your data prep during application period or Community Bonding period, please explain that.
Soham Ghosh
"For GSoC 2019, we are not encouraging applicants to make small code contributions, but instead to use this time to learn about the CLTK and make excellent proposals. " Does this mean that we need to solve the beginner's problem to be qualified for the selection process in GSoc 2019?
Hello everyone. I would love to contribute to cltk. Can someone help me please?
Indranil Biswas
Hi ! @saurabhbazzad check out the page https://github.com/cltk/cltk/wiki/Project-ideas
@kylepjohnson hi Kyle I have send you a answer for the 6 questions by email.
Deepak Divya Tejaswi

Hey everyone! I'm Deepak Divya Tejaswi studying CSE+Economics dual major at BITS Pilani Goa. I stumbled upon this project while I was going through gsoc projects and was very fascinated by the idea and the impact of this project. I am looking forward to contribute positively towards the community.

@kylepjohnson I am interested in contributing towards adding the language Sanskrit. I hope I have answered some of the questions in the following doc. I read the latest blog article and came up with some brief answers to questions that can be viewed here

I have also included some working links to Sanskrit text datasets in the first answer.

Kyle P. Johnson
Hi @deeox can you DM me this link, please?
@/all GSoC announces orgs this week, so it won't be until after then that someone from the CLTK team will be able to review draft documents. Thank you
is CLTK is participating in GSoC 2019?
Sushant Mehta
CLTK wasn't selected. I'd like to contribute to CLTK during my summer.
Shubhangi Dutta
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer? http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
That function is for normalization, not tokenization: http://docs.cltk.org/en/latest/middle_english.html#text-normalization
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
no it's for words. See the docs here, there is a full example: http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
Shubhangi Dutta
Ah, thank you.
Kyle P. Johnson
sure keep at it and let us know if you hit any problems
I want to implement couple of new things CLTK does not posses where NLTK posses in english language. I want to do some kind of research work. Is it allowed?
Please tell me, Thank you
Amr Keleg
Hi all,
I am working on porting Khoja Stemmer to python and I am wondering whether there is an advantage for writing unicode characters as \u064a instead of ي in the source code or not!
Thanks :smile:
Chatziargyriou Eleftheria
Hi @AMR-KELEG ! It is generally preferred to keep things as unambiguous as possible. Take for example the Cyrillic alphabet which contains characters which look identical to their latin counterparts, but have a completely different encoding.
Amr Keleg
@Sedictious Interesting, I see.
Thanks :smile:
Todd Cook
Ideal format is the unicode escape value with the visual representation placed in an inline documentation style comment, such as:
'\u064a' #: ي
Hi everyone!
I'm a second-year undergraduate in Computer Science who's interested in contributing to Classical language Toolkit for GSoC 2020. I'd like to know how I can get started and get familiar with the projects.
Thank you,
William Michael Short
Incidentally, how far advanced is the Anglo-Saxon stuff? I feel like the dialect issue would be a challenge.
Sunil Kumar
Hi I am looking to expand the tamil corpus for CLTK. How do I get started?
Rutvik Trivedi
@kylepjohnson Sir, will CLTK be planning to apply for GSOC this year?
Kyle P. Johnson
@Rutvik-Trivedi we don't know yet. Maybe
i wish to contribute to it. please tell me how to get started.
I know nlp but I dont know a bit of javascript.
Hi everyone