for the data, how much preparation will be required. Concerning this last point, please remember that GSoC is about code, not data cleaning. We are not able to accept an otherwise brilliant application that also requires 6 weeks of data annotation or cleanup. If you believe you are able to do your data prep during application period or Community Bonding period, please explain that.
Hey everyone! I'm Deepak Divya Tejaswi studying CSE+Economics dual major at BITS Pilani Goa. I stumbled upon this project while I was going through gsoc projects and was very fascinated by the idea and the impact of this project. I am looking forward to contribute positively towards the community.
@kylepjohnson I am interested in contributing towards adding the language Sanskrit. I hope I have answered some of the questions in the following doc. I read the latest blog article and came up with some brief answers to questions that can be viewed here
I have also included some working links to Sanskrit text datasets in the first answer.
@kylepjohnson I went through it, yes. The converter to canonical form seems to be the closest thing to what I'm trying to do. However, it still doesn't standardise some kinds of spellings. Here's what I got trying to tokenise two forms of the same word:
In : normalize_middle_english("kynge kyng", alpha_conv=True)
Out: 'kynge kyng'
It might not be a common issue, but is there a way to fix this?
Err, apologies. What I meant was tokenisation followed by normalisation (I figured normalisation would be fairly useful for large ME corpora, especially if working with multiple corpora). I'm sorry about the misnomer, I was thinking of tokenisation followed by normalisation of multiple spellings/forms of the same word into a single token.
And Punkt is for sentence tokenisation, is it not? How would it work for something like normalisation of spellings?