i don’t understand the structure of the pali texts very well
right now i just used teh first p and the first h to generate the original title, but parsed all strings from the body elem as their own node:
After the hackathon and some cleanup during the past week, we now have 7347 cltk_json compliant texts in these languages: ['bengali', 'chinese', 'english', 'greek', 'hebrew', 'hindi', 'javanese', 'latin', 'middle_english', 'old_english', 'pali', 'punjabi', 'sanskrit']
Ingesting the texts into the mongo database now, and will adapt the updated version of the API to look for text corpora with the cltk_json dir them… not sure how we want to merge the captains_text_corpora with the cltk_json dirs… Hrm…
Right I just manually reviewed a handful of the capitains_text_corpora json files that looked good and added them to the database.
@lukehollis@kylepjohnson Hello Mentors, I would like to contribute to the CLTK Archive project and making of cltk_api using flask framework. Please guide me for further steps.
Hello all, if you’re interested in contributing to the cltk_frontend application, I have posted an updated copy of the full DB and updated the readme on the repo:
This dump, when restored, generates a database sized 7.950G in mongo
Some had been asking for which text corpora still needed to be converted, so I created an issue here: cltk/cltk_api#43 If you find any errors, please let me know.
Looks like in general, we’re a little under halfway through
Hello, I was wondering what the (production) status of the CLTK API(v2?) is, and whether it would be possible to work on it as part of Google Summer of Code? I was originally looking into implementing some NLP for Classical Chinese but, since CLTK core doesn't provide a straightforward way to access the three existing corpora, it wasn't clear to me whether I should use something like Capitains and work with the raw TEI XML (see cltk/cltk#560) or use the converted JSON+API instead. Since this issue seems to be cropping up across several Github issues, could working on better documentation about how to read corpora also be a project in its own right (cltk/cltk#615)?
Hello , Is there any Arabic mentors in CLTK this year ? Thanks in advance .
Kyle P. Johnson
@kevinstadler This is a great question. Thanks for digging into the issue. I'll venture a short answer here to get things started … 1) I wrote the v2 of the API, which has 2 basic functions: serving texts and doing NLP processing. However @lukehollis leads the web projects and can answer exactly how the project will leverage it in the near future. 2) Luke can speak to the workflow of JSON and TEI-formatted texts. 3) About writing a better reader, we have had lots of thoughts … but not many decisions. @diyclassics has done some work , within the core python project, to create a reader. However nothing we'd quite call official yet. 4) So you are correct in seeing this as a weak spot throughout the CLTK, however I believe we have some OK ad hoc solutions. But this topic falls as much into the frontend as back, so I think @lukehollis should give his full opinion too, about whether this is a priority for GSoC '18
@aboelhamd Yes, we are very happy to say that we do have an Arabic mentor this year.
Thank you Kyle , So what to do next ? Contact this mentor or what ?
Kyle P. Johnson
@aboelhamd good question. Please do the first two steps first, then I will introduce you to the Arabic mentor:
@aboelhamd I will actually move this conversation to the channel for the python project
@kylepjohnson thanks for the heads-up! I've already been in contact with @diyclassics about trying the new stop word module out on the existing Chinese corpora, but he said he wasn't sure who'd be suitable for mentoring a Classical Chinese project. Either way I'd probably write up some documentation just for myself about how to effectively work with TEI which could then also go into the main docs, but for this it would be good to know which is the "way to go" for corpus access. There might not be much point in creating documentation for one of the ad hoc solutions when there is already a new dedicated system for it in the pipeline, which is what I'm not clear about.
At the moment I'm aiming to write a proposal that combines improving TEI corpus access documentation with some Chinese NLP tools, but depending on what the project leads think is of higher priority I'd be happy to push the documentation component even more, to also cover polishing up the cltk/tutorials for inclusion in the main docs for example. So I'd be interested to hear what people think is more relevant at this time!
@kylepjohnson@lukehollis I've had another go at reading the Chinese corpora with MyCapytain, but no luck. I've consequently added implementing a dedicated TEIXMLReader to CLTK to my proposal, which you should be able to access and comment on via the GSoC website. I'd be grateful for any feedback that you have on the scope and level of detail of the current proposal, I'd be happy to add more technical details regarding the implementation by Tuesday if you feel that it's necessary!
We ingest this json to a postgres database with something like this: https://github.com/CtrHellenicStudies/TextServer/tree/master/src/modules/cts/texts/json depending on the goals of the organization (@kylepjohnson@pletcher@suheb@jtauber), we could continue to do something like this in the future. It sounds like especially with @jtauber and Eldarion, there may be a more robust version of offering this text with metadata in the future?
can i ask something how to test the app from cltk? iam new in programmer and i want to test it how it work thanks
Is the project still alive
Kyle P. Johnson
@Seenivasanseeni this project is on hold at the moment, thanks for asking