Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Luke Hollis
    @lukehollis
    the sefaria repo had a json dir in the root, so i was using converted_json
    Kyle P. Johnson
    @kylepjohnson
    So in https://github.com/cltk/bengali_text_wikisource -- just put your new files in under json
    For Sepharia, what was in the json dir?
    That might have been an earlier attempt as doing what you are doing now
    Luke Hollis
    @lukehollis
    that’s from Josh with the sefaria export
    they’re pretty close, but the text is nested lists instead of dict
    Kyle P. Johnson
    @kylepjohnson
    Ah yes, the jagged array …
    Well, how about we decide upon our own unique namespace for that top-level dir? Like cltk_json or sim.?
    Luke Hollis
    @lukehollis
    cltk_json sounds good to me!
    Kyle P. Johnson
    @kylepjohnson
    cool :)
    this is kinda exciting, actually. I'll reach out to you, Luke, about how to document this formally in the cltk whitepaper i have finally committed to writing
    Luke Hollis
    @lukehollis
    awesome! looking forward to hearing more about it
    ok to push the cltk_json to repos directly or would you rather check out the prs?
    Kyle P. Johnson
    @kylepjohnson
    nah, just go ahead to push away!
    Luke Hollis
    @lukehollis
    :thumbsup:
    they have these codex hackathons out your way in sf too :) highly recommended!
    Kyle P. Johnson
    @kylepjohnson
    really? Let's follow up about this, I want to go
    Luke Hollis
    @lukehollis
    yeah! first one was at github hq, i think. hrmm.. trying to find more info about it. here’s the website http://codexhackathon.com/
    Luke Hollis
    @lukehollis
    some really ugly code to convert the poeti d’italia in lingua latina: https://github.com/cltk/latin_text_poeti_ditalia/blob/master/converter.py#L35
    Luke Hollis
    @lukehollis
    i don’t understand the structure of the pali texts very well
    right now i just used teh first p and the first h to generate the original title, but parsed all strings from the body elem as their own node:
    Screen Shot 2017-02-12 at 15.12.02.png
    Luke Hollis
    @lukehollis
    After the hackathon and some cleanup during the past week, we now have 7347
    cltk_json compliant texts in these languages: ['bengali', 'chinese', 'english', 'greek', 'hebrew', 'hindi', 'javanese', 'latin', 'middle_english', 'old_english', 'pali', 'punjabi', 'sanskrit']
    Luke Hollis
    @lukehollis
    Ingesting the texts into the mongo database now, and will adapt the updated version of the API to look for text corpora with the cltk_json dir them… not sure how we want to merge the captains_text_corpora with the cltk_json dirs… Hrm…
    Right I just manually reviewed a handful of the capitains_text_corpora json files that looked good and added them to the database.
    Aahan Bhatt
    @Aahanbhatt
    @lukehollis @kylepjohnson Hello Mentors, I would like to contribute to the CLTK Archive project and making of cltk_api using flask framework. Please guide me for further steps.
    Luke Hollis
    @lukehollis
    Hello all, if you’re interested in contributing to the cltk_frontend application, I have posted an updated copy of the full DB and updated the readme on the repo:
    Here’s the link to the new DB:
    This dump, when restored, generates a database sized 7.950G in mongo
    Luke Hollis
    @lukehollis
    Some had been asking for which text corpora still needed to be converted, so I created an issue here: cltk/cltk_api#43 If you find any errors, please let me know.
    Looks like in general, we’re a little under halfway through
    Kevin Stadler
    @kevinstadler
    Hello, I was wondering what the (production) status of the CLTK API(v2?) is, and whether it would be possible to work on it as part of Google Summer of Code? I was originally looking into implementing some NLP for Classical Chinese but, since CLTK core doesn't provide a straightforward way to access the three existing corpora, it wasn't clear to me whether I should use something like Capitains and work with the raw TEI XML (see cltk/cltk#560) or use the converted JSON+API instead. Since this issue seems to be cropping up across several Github issues, could working on better documentation about how to read corpora also be a project in its own right (cltk/cltk#615)?
    aboelhamd
    @aboelhamd
    Hello , Is there any Arabic mentors in CLTK this year ?
    Thanks in advance .
    Kyle P. Johnson
    @kylepjohnson
    @kevinstadler This is a great question. Thanks for digging into the issue. I'll venture a short answer here to get things started …
    1) I wrote the v2 of the API, which has 2 basic functions: serving texts and doing NLP processing. However @lukehollis leads the web projects and can answer exactly how the project will leverage it in the near future.
    2) Luke can speak to the workflow of JSON and TEI-formatted texts.
    3) About writing a better reader, we have had lots of thoughts … but not many decisions. @diyclassics has done some work , within the core python project, to create a reader. However nothing we'd quite call official yet.
    4) So you are correct in seeing this as a weak spot throughout the CLTK, however I believe we have some OK ad hoc solutions. But this topic falls as much into the frontend as back, so I think @lukehollis should give his full opinion too, about whether this is a priority for GSoC '18
    @aboelhamd Yes, we are very happy to say that we do have an Arabic mentor this year.
    aboelhamd
    @aboelhamd
    Thank you Kyle , So what to do next ? Contact this mentor or what ?
    Kyle P. Johnson
    @kylepjohnson
    @aboelhamd good question. Please do the first two steps first, then I will introduce you to the Arabic mentor:
    Kyle P. Johnson
    @kylepjohnson
    1) Do the Beginners' exercises, with Classical Arabic: https://github.com/cltk/cltk/wiki/Beginners'-exercises
    2) Write a draft proposal according to our GSoC propsosal template: https://github.com/cltk/cltk/wiki/GSoC-proposal-template . For Arabic, you'll want to focus on what NLP processing you will be able to add for classical arabic (things like word tokenization, pos tagging, etc). If training data is required, it is critical that you explain what free data you will use.
    @aboelhamd I will actually move this conversation to the channel for the python project
    Kevin Stadler
    @kevinstadler
    @kylepjohnson thanks for the heads-up! I've already been in contact with @diyclassics about trying the new stop word module out on the existing Chinese corpora, but he said he wasn't sure who'd be suitable for mentoring a Classical Chinese project. Either way I'd probably write up some documentation just for myself about how to effectively work with TEI which could then also go into the main docs, but for this it would be good to know which is the "way to go" for corpus access. There might not be much point in creating documentation for one of the ad hoc solutions when there is already a new dedicated system for it in the pipeline, which is what I'm not clear about.
    At the moment I'm aiming to write a proposal that combines improving TEI corpus access documentation with some Chinese NLP tools, but depending on what the project leads think is of higher priority I'd be happy to push the documentation component even more, to also cover polishing up the cltk/tutorials for inclusion in the main docs for example. So I'd be interested to hear what people think is more relevant at this time!
    Kevin Stadler
    @kevinstadler
    @kylepjohnson @lukehollis I've had another go at reading the Chinese corpora with MyCapytain, but no luck. I've consequently added implementing a dedicated TEIXMLReader to CLTK to my proposal, which you should be able to access and comment on via the GSoC website. I'd be grateful for any feedback that you have on the scope and level of detail of the current proposal, I'd be happy to add more technical details regarding the implementation by Tuesday if you feel that it's necessary!
    Luke Hollis
    @lukehollis
    Hi @kevinstadler, thank you for this contribution, and I think we need as a community to make a better decision about the textserver and options there. We have a minimally-scoped cltk_json data format described here: https://github.com/cltk/cltk_api/wiki/JSON-data-format-specifications
    We ingest this json to a postgres database with something like this: https://github.com/CtrHellenicStudies/TextServer/tree/master/src/modules/cts/texts/json depending on the goals of the organization (@kylepjohnson @pletcher @suheb @jtauber), we could continue to do something like this in the future. It sounds like especially with @jtauber and Eldarion, there may be a more robust version of offering this text with metadata in the future?
    JLJJLKJ
    @Panji12miror
    can i ask something how to test the app from cltk? iam new in programmer and i want to test it how it work
    thanks
    SeenivasanSeeni
    @Seenivasanseeni
    Is the project still alive
    Kyle P. Johnson
    @kylepjohnson
    @Seenivasanseeni this project is on hold at the moment, thanks for asking