These are chat archives for cltk/cltk

17th
Feb 2018
Jiaxin_Bai
@marcos0318
Feb 17 2018 09:31
I have used the corpus of chinese
that corpus is not really good material to learn and to do ancient chinese research
Jiaxin_Bai
@marcos0318
Feb 17 2018 09:36
first because its contents, it is not representative of ancient chinese. the contents are all Buddhist Scriptures, which are mostly used in temples, but not by the public
Hard to understand and very abstract, they are only useful when doing research on Buddism in Chinese
Jiaxin_Bai
@marcos0318
Feb 17 2018 09:44
second, no puctuation. although the original copy of ancient chinese do not have puctuation, it is very important for a corpus to have. The ancient text without puctuation is hard to read and cannot be directly understand, and adding puctuation is also a very important topic and skill in high school in China. We need the punctuation as label, when regarding our text as training data when doing machine learning.
I have found a domain that have better materials, and I am going add them in to the corpus in these days