These are chat archives for cltk/cltk

31st
Dec 2018
Kyle P. Johnson
@kylepjohnson
Dec 31 2018 05:51

@/all The mentors have slowly begun to get ready for GSoC 2019. We appreciate all the excitement and eagerness to contribute, coming from you all.

Before doing anything else, please read this blog post which we just published. The purpose of this is to help you (a potential GSoC student) learn from the mistakes (and successes) of previous project applications, so to improve your own application.

SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 07:25
@kylepjohnson Can text-books of schools and newspaper articles be used even though they are still owned by others ?
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:06
@Erikishiru what should be the first feature to be included. I think it is corpus. Right ?
Piyush Yadav
@Erikishiru
Dec 31 2018 08:09
yeah
we can start with Sangam literature https://en.wikipedia.org/wiki/Sangam_literature
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:11
Is there digitized print for it
Piyush Yadav
@Erikishiru
Dec 31 2018 08:12
will have to search for that
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:12
Those literature books are completely different from current speech. They have lot of complex word joins.
Piyush Yadav
@Erikishiru
Dec 31 2018 08:13
I will go to the Department of History - Delhi University, to meet a Prof.
I think they will be able to help us with some resources
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:13
yeah.
is newspapers and school text books fine ?
Piyush Yadav
@Erikishiru
Dec 31 2018 08:15
I don't think so. They will mostly be licensed. We are looking for open source resources.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:16
Okay. If you consider literature works, typical grammar is like in Thirukkural, which is quite heavy and such grammar is not even used in real life. Using those literature will not be give any considerable results when applied to real-world NLP.
And another one thing. We need only few text for corpus right upto 10MB.
Piyush Yadav
@Erikishiru
Dec 31 2018 08:20
If we get resources we can call in for expert linguists. The point is studying texts that are not in use today is that they help us study the collective history of this world. Tamil is the oldest (even preceding Sanskrit, according to some texts) so will be a good language to incorporate.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:21
Okay.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:28
We can use wikipedia in tamil version . wikipedia's content is in Creative commons policy. So it wont cause any problems right?
@Erikishiru what do you say about it
Piyush Yadav
@Erikishiru
Dec 31 2018 08:32
Yeh we can do that
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:32
how do i create a repo in cltk for this corpus ?
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:43
great
what should be the file format
Its best to use .txt
@Erikishiru there are many formats being used across many corpus. We need to define a format
Piyush Yadav
@Erikishiru
Dec 31 2018 08:46
.txt and .json are being used
look at the greek corpus for further reference.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:47
For now. I am gonna use .txt and later we can do it in json
@Erikishiru Text need be free of numerals and non-tamil characters right ?
Piyush Yadav
@Erikishiru
Dec 31 2018 08:49
Tamil numerals can be included.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:50
There was such numerals. It is not used now and so wont be useful.
It is tough to spot them in literature
Piyush Yadav
@Erikishiru
Dec 31 2018 08:52
Its not about what is being used now. This is Classical Language Tool kit
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 08:52
Okay.
what about spaces and newlines( extra)
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 09:03
@Erikishiru take a look at this https://github.com/Seenivasanseeni/cltk_tamil_corpus is this repo fine. It is small for now. I will add new articles gradually.
@Erikishiru After we get literature copies, we can add them too
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 09:15
@Erikishiru There is no _check_corpus_availability() and there is no language added directly in the list_corpora method
I tested on importing corpus. It worked. It has downloaded the contents "~/ctlk_data".
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 09:28
@Erikishiru should i merge the branches now or later ?
Piyush Yadav
@Erikishiru
Dec 31 2018 09:54
merge as you go
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 10:01
@Erikishiru take a loot at this cltk/cltk#848
and also cltk/cltk#847 to transfer ownership
Ghost
@ghost~5bd5e42dd73408ce4fad0b93
Dec 31 2018 11:32
@Seenivasanseeni is there any interpret able way to incorporate numbers in classic tamil? It's available only in Kalvettu afaik
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 11:44
@SunilKu12355774_twitter I don't understand. Can you explain ?
Ghost
@ghost~5bd5e42dd73408ce4fad0b93
Dec 31 2018 11:50
I am just confused coz ancient tamil numericals are not available in text format.
How can it be scraped?
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 12:16
@SunilKu12355774_twitter You are correct. We will try to find them but with no assurance.
Piyush Yadav
@Erikishiru
Dec 31 2018 12:51
They are available in encoding UTF-8 and we can use this for conversions https://pypi.org/project/Open-Tamil/
Ghost
@ghost~5bd5e42dd73408ce4fad0b93
Dec 31 2018 13:04
@Erikishiru does it also include ancient tamil numbers?
Piyush Yadav
@Erikishiru
Dec 31 2018 13:27
Ghost
@ghost~5bd5e42dd73408ce4fad0b93
Dec 31 2018 13:52
Yep just saw it.
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 15:11
@Erikishiru Can you review #848 ?
SeenivasanSeeni
@Seenivasanseeni
Dec 31 2018 17:48
we can also refer this https://github.com/AshokR/TamilNLP
Piyush Yadav
@Erikishiru
Dec 31 2018 20:23
https://github.com/AshokR/TamilNLP is licence read the readme file