Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Nov 20 15:29
    codecov-io commented #949
  • Nov 20 15:26
    codecov-io commented #949
  • Nov 20 15:03
    codecov-io commented #949
  • Nov 20 15:02
    clemsciences synchronize #949
  • Nov 20 09:07
    codecov-io commented #947
  • Nov 20 09:05
    codecov-io commented #947
  • Nov 20 08:41
    codecov-io commented #947
  • Nov 20 08:41
    clemsciences synchronize #947
  • Nov 19 23:22
    diyclassics commented #949
  • Nov 19 23:21
    diyclassics commented #949
  • Nov 19 22:55
    diyclassics commented #949
  • Nov 19 16:11
    kylepjohnson commented #949
  • Nov 19 15:58
    codecov-io commented #952
  • Nov 19 15:58
    codecov-io commented #958
  • Nov 19 15:56
    codecov-io commented #952
  • Nov 19 15:56
    codecov-io commented #958
  • Nov 19 15:55
    codecov-io commented #950
  • Nov 19 15:35
    codecov-io commented #947
  • Nov 19 15:34
    codecov-io commented #949
  • Nov 19 15:32
    codecov-io commented #947
Kyle P. Johnson
@kylepjohnson
Hi @deeox can you DM me this link, please?
@/all GSoC announces orgs this week, so it won't be until after then that someone from the CLTK team will be able to review draft documents. Thank you
SANMITRA
@sanmitraD
is CLTK is participating in GSoC 2019?
Sushant Mehta
@SMe12435
No
CLTK wasn't selected. I'd like to contribute to CLTK during my summer.
Shubhangi Dutta
@celinarose
Hello everyone, I'm Shubhangi Dutta from IIIT Hyderabad.
I'd like to contribute to cltk. I'm trying to write a tokeniser for Middle English but I'm facing some issues with non-standardised spellings and normalisation. Does anyone have any pointers?
Kyle P. Johnson
@kylepjohnson
Hi guys, we're not part of GSoC this summer. Contributions still welcome
@celinarose have you used our default tokenizer? http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
It is not specific to ME, but our other users have not had issues. That said, we want to know about any shortcomings
Shubhangi Dutta
@celinarose

@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:

In [2]: normalize_middle_english("kynge kyng", alpha_conv=True)
Out[2]: 'kynge kyng'

It might not be a common issue, but is there a way to fix this?

Kyle P. Johnson
@kylepjohnson
That function is for normalization, not tokenization: http://docs.cltk.org/en/latest/middle_english.html#text-normalization
We usually recommend the NLTK's tokenizer: from nltk.tokenize.punkt import PunktLanguageVars
Shubhangi Dutta
@celinarose

Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.

And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?

Kyle P. Johnson
@kylepjohnson
no it's for words. See the docs here, there is a full example: http://docs.cltk.org/en/latest/middle_english.html#stopword-filtering
Shubhangi Dutta
@celinarose
Ah, thank you.
Kyle P. Johnson
@kylepjohnson
sure keep at it and let us know if you hit any problems
Srinivas
@srinivasmachiraju
I want to implement couple of new things CLTK does not posses where NLTK posses in english language. I want to do some kind of research work. Is it allowed?
Please tell me, Thank you
Amr Mohamed
@AMR-KELEG
Hi all,
I am working on porting Khoja Stemmer to python and I am wondering whether there is an advantage for writing unicode characters as \u064a instead of ي in the source code or not!
Thanks :smile:
Chatziargyriou Eleftheria
@Sedictious
Hi @AMR-KELEG ! It is generally preferred to keep things as unambiguous as possible. Take for example the Cyrillic alphabet which contains characters which look identical to their latin counterparts, but have a completely different encoding.
Amr Mohamed
@AMR-KELEG
@Sedictious Interesting, I see.
Thanks :smile:
Todd Cook
@todd-cook
Ideal format is the unicode escape value with the visual representation placed in an inline documentation style comment, such as:
'\u064a' #: ي
HarshvardhanJha1
@HarshvardhanJha1
Hi everyone!
I'm a second-year undergraduate in Computer Science who's interested in contributing to Classical language Toolkit for GSoC 2020. I'd like to know how I can get started and get familiar with the projects.
Thank you,
Harshvardhan
William Michael Short
@wmshort
Incidentally, how far advanced is the Anglo-Saxon stuff? I feel like the dialect issue would be a challenge.
Sunil Kumar
@DragonPG
Hi I am looking to expand the tamil corpus for CLTK. How do I get started?
Rutvik Trivedi
@Rutvik-Trivedi
@kylepjohnson Sir, will CLTK be planning to apply for GSOC this year?
Kyle P. Johnson
@kylepjohnson
@Rutvik-Trivedi we don't know yet. Maybe