These are chat archives for cltk/cltk_api

11th
Feb 2017
Luke Hollis
@lukehollis
Feb 11 2017 19:30
Sent over a PR to the bengali_text_wikisource corpus with a converter.py file and converted_json dir of the converted json files. Not married to this approach, but this is how I’ve been getting texts into the frontend so far, and it seems for the API we need to add the converted files to repos somewhere that can be downloaded with the Corpus importer. Lmk how is best to integrate the converted json files.
Here’s the link to the PR: cltk/bengali_text_wikisource#1
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:30
hey Luke! I didn't get an alert about the PR. It looks good to me. I added a few comments to your questions (nothing definitive, though)
Luke Hollis
@lukehollis
Feb 11 2017 20:31
sweet, thanks! got another for sefaria hebrew texts
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:31
so cool!
Luke Hollis
@lukehollis
Feb 11 2017 20:32
having separate repos for the converted corpora sounds good to me!
or i suppose we could have all the converted files in one repo
that might make the api easier. the dir names in the single repo could be the corpora that are converted
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:34
I would prefer separate files, that is, the json files right next to their "original
" formats
This because we will have lang experts assigned to particular repos (Bengali experts in charge of everything bengali_ and Latinists to latin_)
Luke Hollis
@lukehollis
Feb 11 2017 20:35
that sounds good to me—i’ll create separate repos like bengali_text_wikisource__converted?
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:36
If we put everything into one repo, then there could be a drift between source text and our downstream variant
i’ll create separate repos like bengali_text_wikisource__converted
No, I don't think that's necessary either
instead, just put a json dir in the root of the repo
Luke Hollis
@lukehollis
Feb 11 2017 20:36
oh okay that sounds perfect!
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:37
See what i'm sayin?
Luke Hollis
@lukehollis
Feb 11 2017 20:37
the sefaria repo had a json dir in the root, so i was using converted_json
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:37
So in https://github.com/cltk/bengali_text_wikisource -- just put your new files in under json
For Sepharia, what was in the json dir?
That might have been an earlier attempt as doing what you are doing now
Luke Hollis
@lukehollis
Feb 11 2017 20:39
that’s from Josh with the sefaria export
they’re pretty close, but the text is nested lists instead of dict
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:39
Ah yes, the jagged array …
Well, how about we decide upon our own unique namespace for that top-level dir? Like cltk_json or sim.?
Luke Hollis
@lukehollis
Feb 11 2017 20:42
cltk_json sounds good to me!
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:42
cool :)
this is kinda exciting, actually. I'll reach out to you, Luke, about how to document this formally in the cltk whitepaper i have finally committed to writing
Luke Hollis
@lukehollis
Feb 11 2017 20:43
awesome! looking forward to hearing more about it
ok to push the cltk_json to repos directly or would you rather check out the prs?
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:44
nah, just go ahead to push away!
Luke Hollis
@lukehollis
Feb 11 2017 20:44
:thumbsup:
they have these codex hackathons out your way in sf too :) highly recommended!
Kyle P. Johnson
@kylepjohnson
Feb 11 2017 20:47
really? Let's follow up about this, I want to go
Luke Hollis
@lukehollis
Feb 11 2017 20:49
yeah! first one was at github hq, i think. hrmm.. trying to find more info about it. here’s the website http://codexhackathon.com/