Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Oct 16 14:58
    codecov-io commented #949
  • Oct 16 14:55
    codecov-io commented #949
  • Oct 16 14:31
    codecov-io commented #949
  • Oct 16 14:31
    clemsciences synchronize #949
  • Oct 16 14:09
    codecov-io commented #949
  • Oct 16 14:07
    codecov-io commented #949
  • Oct 16 13:43
    clemsciences review_request_removed #947
  • Oct 16 13:43
    clemsciences synchronize #949
  • Oct 16 13:40
    clemsciences review_request_removed #949
  • Oct 14 20:07
    clemsciences labeled #949
  • Oct 14 20:07
    clemsciences review_requested #949
  • Oct 14 20:07
    clemsciences opened #949
  • Oct 14 20:07
    clemsciences labeled #949
  • Oct 13 21:00
    clemsciences labeled #948
  • Oct 13 21:00
    clemsciences labeled #948
  • Oct 13 20:59
    clemsciences opened #948
  • Oct 11 15:00
    codecov-io commented #945
  • Oct 11 15:00
    wmshort synchronize #945
  • Oct 11 14:59
    wmshort commented #945
  • Oct 11 09:25
    clemsciences commented #947
Soham Ghosh
@isohamnemesis
"For GSoC 2019, we are not encouraging applicants to make small code contributions, but instead to use this time to learn about the CLTK and make excellent proposals. " Does this mean that we need to solve the beginner's problem to be qualified for the selection process in GSoc 2019?
saurabhbazzad
@saurabhbazzad
Hello everyone. I would love to contribute to cltk. Can someone help me please?
Indranil Biswas
@glitch401
Hi ! @saurabhbazzad check out the page https://github.com/cltk/cltk/wiki/Project-ideas
MC
@michiboo
@kylepjohnson hi Kyle I have send you a answer for the 6 questions by email.
Deepak Divya Tejaswi
@deeox

Hey everyone! I'm Deepak Divya Tejaswi studying CSE+Economics dual major at BITS Pilani Goa. I stumbled upon this project while I was going through gsoc projects and was very fascinated by the idea and the impact of this project. I am looking forward to contribute positively towards the community.

@kylepjohnson I am interested in contributing towards adding the language Sanskrit. I hope I have answered some of the questions in the following doc. I read the latest blog article and came up with some brief answers to questions that can be viewed here

I have also included some working links to Sanskrit text datasets in the first answer.

Kyle P. Johnson
@kylepjohnson
Hi @deeox can you DM me this link, please?
@/all GSoC announces orgs this week, so it won't be until after then that someone from the CLTK team will be able to review draft documents. Thank you
SANMITRA
@sanmitraD
is CLTK is participating in GSoC 2019?
Sushant Mehta
@SMe12435
No
CLTK wasn't selected. I'd like to contribute to CLTK during my summer.
Shubhangi Dutta
@celinarose
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
@kylepjohnson
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer? http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta
@celinarose

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
@kylepjohnson
That function is for normalization, not tokenization: http://docs.cltk.org/en/latest/middle_english.html#text-normalization
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta
@celinarose

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
@kylepjohnson
no it's for words. See the docs here, there is a full example: http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
Shubhangi Dutta
@celinarose
Ah, thank you.
Kyle P. Johnson
@kylepjohnson
sure keep at it and let us know if you hit any problems
Srinivas
@srinivasmachiraju
I want to implement couple of new things CLTK does not posses where NLTK posses in english language. I want to do some kind of research work. Is it allowed?
Please tell me, Thank you
Amr Mohamed
@AMR-KELEG
Hi all,
I am working on porting Khoja Stemmer to python and I am wondering whether there is an advantage for writing unicode characters as \u064a instead of ي in the source code or not!
Thanks :smile:
Chatziargyriou Eleftheria
@Sedictious
Hi @AMR-KELEG ! It is generally preferred to keep things as unambiguous as possible. Take for example the Cyrillic alphabet which contains characters which look identical to their latin counterparts, but have a completely different encoding.
Amr Mohamed
@AMR-KELEG
@Sedictious Interesting, I see.
Thanks :smile:
Todd Cook
@todd-cook
Ideal format is the unicode escape value with the visual representation placed in an inline documentation style comment, such as:
'\u064a' #: ي
HarshvardhanJha1
@HarshvardhanJha1
Hi everyone!
I'm a second-year undergraduate in Computer Science who's interested in contributing to Classical language Toolkit for GSoC 2020. I'd like to know how I can get started and get familiar with the projects.
Thank you,
Harshvardhan
William Michael Short
@wmshort
Incidentally, how far advanced is the Anglo-Saxon stuff? I feel like the dialect issue would be a challenge.
Sunil Kumar
@DragonPG
Hi I am looking to expand the tamil corpus for CLTK. How do I get started?