These are chat archives for cltk/cltk_api

22nd
Nov 2015
Luke Hollis
@lukehollis
Nov 22 2015 18:59
Right now text saved to mongo via the ingest / coptic strategy looks a little like this: db.text.find({ work : "apophthegmata-patrum-sahidic-26-cassian"})
{ "_id" : ObjectId("565204513d024e7bdd56a39c"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ⲁ ϥ ϫⲟⲟ̇ ⲥ ⲛ̇ϭⲓ̇ ⲁ̇ⲡⲁ ⲕⲁ ", "html" : "", "genre" : "", "n" : 33 }
{ "_id" : ObjectId("565204513d024e7bdd56a39d"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ⲥⲓ̇ⲁ̇ⲛⲟⲥ · ϫⲉ ⲁ̇ ϥ ", "html" : "", "genre" : "", "n" : 34 }
{ "_id" : ObjectId("565204513d024e7bdd56a39e"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ϫⲟⲟ̇ ⲥ ⲛ̇ϭⲓ̇ ⲟⲩⲁ̇ ⲛ ⲉⲛ ⲥⲩⲛ ", "html" : "", "genre" : "", "n" : 35 }
{ "_id" : ObjectId("565204513d024e7bdd56a39f"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ⲕ̇ⲗⲏⲧⲓ̇ⲕⲟⲥ · ⲉ̇ ⲁ̇ ϥ ⲁ̇ⲡⲟ ", "html" : "", "genre" : "", "n" : 36 }
{ "_id" : ObjectId("565204513d024e7bdd56a3a0"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ⲧⲁⲥⲥⲉ ⲛ̇ ⲛⲉϥ ⲭ̇ⲣⲏⲙⲁ ", "html" : "", "genre" : "", "n" : 37 }
{ "_id" : ObjectId("565204513d024e7bdd56a3a1"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ⲧⲏⲣ ⲟⲩ ⲁ̇ ϥ ⲧⲁ̇ⲁ ⲩ ⲛ̇ ⲛ̇ ", "html" : "", "genre" : "", "n" : 38 }
{ "_id" : ObjectId("565204513d024e7bdd56a3a2"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 1, "slug" : "1" }, "text" : " ϩⲏⲕⲉ · ⲁ̇ ϥ ⲕⲁ ϩⲛ̇ ", "html" : "", "genre" : "", "n" : 39 }
{ "_id" : ObjectId("565204513d024e7bdd56a3a3"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 2, "slug" : "2" }, "text" : " ⲕⲟⲩⲓ̇ ⲛⲁ ϥ ⲉ̇ⲧⲃⲉ ⲧⲉϥ ", "html" : "", "genre" : "", "n" : 40 }
{ "_id" : ObjectId("565204513d024e7bdd56a3a4"), "work" : "apophthegmata-patrum-sahidic-26-cassian", "subwork" : { "n" : 2, "slug" : "2" }, "text" : " ⲭⲣ ⲓ̇ⲁ̇ ⲙⲁⲩⲁ̇ⲁ ϥ ̄· ⲙ̇ⲡⲉ ϥ ", "html" : "", "genre" : "", "n" : 1 }
Again, this is some of the simplest things I could think of to get the api off the ground for interacting with the client side application
Luke Hollis
@lukehollis
Nov 22 2015 19:04
It definitely misses a lot and can be vastly improved in future iterations
Rob Jenson
@ferthalangur
Nov 22 2015 19:11
So I hate to be the bearer of complication, but I think you're going to need some metadata about the text itself, or you're going to be in a world of hurt down the road.
For example, is that UTF-8, UTF-16 or UTF-32 encoded text?
Luke Hollis
@lukehollis
Nov 22 2015 19:12
metadata about the text itself is saved like this:
db.works.find()
{ "_id" : ObjectId("565204493d024e7bdd569b1d"), "title" : "IT-NB IB16 f.16 White Monastery Manuscript XL 93-94 (93:ii.6-94)", "slug" : "it-nb-ib16-f-16-white-monastery-manuscript-xl-93-94-93-ii-6-94", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.XL93-94_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044a3d024e7bdd569ba7"), "title" : "IT-NB IB2 ff. 26-27 White Monastery Manuscript YA 518-520", "slug" : "it-nb-ib2-ff-26-27-white-monastery-manuscript-ya-518-520", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.YA518-520_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044a3d024e7bdd569c61"), "title" : "IT-NB IB2 ff. 28-30 White Monastery Manuscript YA 525-530", "slug" : "it-nb-ib2-ff-28-30-white-monastery-manuscript-ya-525-530", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.YA525-530_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044a3d024e7bdd569dd7"), "title" : "IT-NB IB2 ff. 31-33 White Monastery Manuscript YA 535-540", "slug" : "it-nb-ib2-ff-31-33-white-monastery-manuscript-ya-535-540", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.YA535-40_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044a3d024e7bdd569f4d"), "title" : "FR-BN 130/5 ff 21-22 White Monastery Manuscript YA 547-50", "slug" : "fr-bn-130-5-ff-21-22-white-monastery-manuscript-ya-547-50", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.YA547-50_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044b3d024e7bdd56a050"), "title" : "FR-BN 130/4 ff. 110-111 White Monastery Manuscript YA 551-554", "slug" : "fr-bn-130-4-ff-110-111-white-monastery-manuscript-ya-551-554", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.YA551-54_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
{ "_id" : ObjectId("5652044b3d024e7bdd56a132"), "title" : "GB-OB MS.Clarendon Press b4. ff.54-57 White Monastery Manuscript ZH frgs. 1a-d", "slug" : "gb-ob-ms-clarendon-press-b4-ff-54-57-white-monastery-manuscript-zh-frgs-1a-d", "ingest_documents" : [ "/Users/lrh/cltk_data/coptic/text/coptic_text_scriptorium/abraham/shenoute.abraham.our.father_TEI/Abraham.ZHfrgmts1a_d_TEI.xml" ], "corpus" : "coptic", "authors" : [ { "slug" : "shenoute" } ] }
db.authors.find()
{ "_id" : ObjectId("56520449316733f4e2e9cd21"), "name_original" : "Shenoute", "name_english" : "Shenoute", "slug" : "shenoute" }
{ "_id" : ObjectId("56520450316733f4e2e9cd2b"), "name_original" : "anonymous", "name_english" : "anonymous", "slug" : "anonymous" }
{ "_id" : ObjectId("56520453316733f4e2e9cd3b"), "name_original" : "Besa", "name_english" : "Besa", "slug" : "besa" }
{ "_id" : ObjectId("56520457316733f4e2e9cd3e"), "name_original" : "None", "name_english" : "None", "slug" : "none" }
Also, as we've discussed here and there--this is only the portion of the API to feed text to the frontend
Rob Jenson
@ferthalangur
Nov 22 2015 19:16
Hmmm ...
So the frontend is not going to have any trouble figuring out whether to interpret those text strings as UTF-8, UTF-16 or UTF-32 characters?
Luke Hollis
@lukehollis
Nov 22 2015 19:18
I think you definitely understand unicode better than me :D
what do you think?
We could save the encoding used to open the file here:
as part of the document
But I haven't had to include the character encoding spec in APIs in the past
Rob Jenson
@ferthalangur
Nov 22 2015 19:19
Well ... most of your encoding / decoding libraries will have to know what a string of "characters" are encoded as.
Otherwise your Chinese will turn into Turkish.
Luke Hollis
@lukehollis
Nov 22 2015 19:22
We can try it! Here.. give me one sec, and I'll standup a server with this text
What's the demoing tool that allows you to tunnel traffic to your local machine...
hrmmm..
Yeah, this thing: https://ngrok.com/
one sec
Have to finish my ramen also.
Rob Jenson
@ferthalangur
Nov 22 2015 19:23
It looks like you're opening your Coptic files as UTF-8
Luke Hollis
@lukehollis
Nov 22 2015 19:24
Ah, did you run it?
Rob Jenson
@ferthalangur
Nov 22 2015 19:25
No, just glancing at the code
Luke Hollis
@lukehollis
Nov 22 2015 19:25
ooh cool okay
Will also add a quick resource endpoint to direct queries to mongo and retrieve results
Rob Jenson
@ferthalangur
Nov 22 2015 19:26
But yeah, I think that you'd be safer to include the text encoding as a parameter in the API, wherever you have a text string
Luke Hollis
@lukehollis
Nov 22 2015 19:26
What are the advantages to opening them in UTF-8 versus another?
Rob Jenson
@ferthalangur
Nov 22 2015 19:27
Well, actually, you have to know how the files were written.
Luke Hollis
@lukehollis
Nov 22 2015 19:28
Also: tangent--remember when mysql had latin1_swedish_ci as their default char encoding
nightmares..
Rob Jenson
@ferthalangur
Nov 22 2015 19:28
Yes.
Luke Hollis
@lukehollis
Nov 22 2015 19:28
ahh!
i'm sure they have their reasons
Rob Jenson
@ferthalangur
Nov 22 2015 19:28
they were storing text originally with that encoding
Luke Hollis
@lukehollis
Nov 22 2015 19:28
that's the last major unicode problem I've had to deal with
Rob Jenson
@ferthalangur
Nov 22 2015 19:29
So here's the deal ... where do your input files come from?
Luke Hollis
@lukehollis
Nov 22 2015 19:29
the cltk repos
*corpora repos
Rob Jenson
@ferthalangur
Nov 22 2015 19:29
OK, where did they come from?
Luke Hollis
@lukehollis
Nov 22 2015 19:30
Rob Jenson
@ferthalangur
Nov 22 2015 19:30
So the only way to figure out how a piece of text is encoded is to use a binary editor / analyzer
And have a knowledge of how that particular language could be encoded
Rob Jenson
@ferthalangur
Nov 22 2015 19:31
Many European languages encoding(latin_1,$x) === encoding(utf-8,$x)
yessssss.
Luke Hollis
@lukehollis
Nov 22 2015 19:32
Okay, let me try some of the first ingest documents in there and see what I can do
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 20:52
Oh man, while you guys have fun with encodings, I've been in TEI hell. I really don't know if this is the right way to go about things, but here's my thinking:
Earlier, I was parsing the TEI with the idea of serving it to the API. However this became so dang complicated that I threw my hands up. So now, I'm trying to break down the files into something very simple, which I'll be able to serve rather easily.
The simple data structure looks like this:
{'author': author_name,
  'text': [
    {'book': 1,
     'chapters':
       [{'chapter': 1, 'text': real_text}, …]
    }
  ]
}
@lukehollis chardet is terrific!
Luke Hollis
@lukehollis
Nov 22 2015 20:56
Hey that looks perfect
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 20:56
You think so? Heh, I've been feeling kinda sheepish after failing the first time parsing the TEI
Luke Hollis
@lukehollis
Nov 22 2015 20:57
yeah, I mean, if we can event get them boiled down to the most essential metadata for accessing the data, we can query them for the web frontend
I think on Thursday I wanted the API to do more things than maybe just the frontend app needs--and it can at some point do all those things--but if we get parsing even to the data model up there, we can display things pretty easily
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 20:58
And concerning metadata, some of this stuff is a mess! For instance, Ammianus's metadata says it is: book-chapter-section. However only some sections are numbered. Eg, it gives section # for 6, then 13, then 30
Luke Hollis
@lukehollis
Nov 22 2015 20:58
chardet looks great!
Oh man
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 20:59
Yeah, go for it. I am not knowledgable about any of the Coptic files, however if you get stuck on something I can try to be of service
Luke Hollis
@lukehollis
Nov 22 2015 20:59
That's going to be rough...
okay awesome
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:00
Ha. We're up to it
Luke Hollis
@lukehollis
Nov 22 2015 21:00
Do you think it's worthwhile to attempt to infer metadata about those section numbers with out the number attribuet?
this is something I was running into with coptic texts as well
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:02
That's my question/concern exactly … I don't know. If metadata says it has sections, however they're not well marked up in the actual text … then that becomes complicated
Complicated enough that, in these cases, I lose interest in worrying about sections at all
Luke Hollis
@lukehollis
Nov 22 2015 21:03
oh yeah?
Okay
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:03
Or save it for another time, perhaps …
Luke Hollis
@lukehollis
Nov 22 2015 21:03
Yeah
yeah--I mean, that's a good perspective--we can do something more smart with those sections when we have a functioning prototype and see if they're really necessary or not
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:04
There's no right answer. My concern is that doing it "right" will be so hard that we'll jepordize getting something at all done
Luke Hollis
@lukehollis
Nov 22 2015 21:04
Yes.
Yes--that's really true
The main potential problem that I'm wondering about if it's at the level of line numbers for a poem
because I think we might at some point have to query lines 100 - 200 or somesuch
and if we lose the metadata for the line number, it seems like that line would simply be missing from the database
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:06
(1) about the sections as in Ammianus -- they cannot be caught programmatically, I don't think. So that means we'll need a human to go through and add this markup for us.
Luke Hollis
@lukehollis
Nov 22 2015 21:06
But we can jump that hurdle if it actually happens--how is it they say?--premature optimization is the root of all eveil
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:07
(2) For poetry the situation is better. For poetry I decided upon this datastructure:
    {'author': 'Vergil',
     'text': [
       {'book': 1,
        'line': ['aaaaa', 'bbbbb', 'cccc']
        }
     ]
    }
Luke Hollis
@lukehollis
Nov 22 2015 21:08
awesome! that's perfect
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:08
Line number can be found via the index of 'line' field
Luke Hollis
@lukehollis
Nov 22 2015 21:09
that sounds great
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:09
You sure? If it sucks don't be shy in saying so!
Luke Hollis
@lukehollis
Nov 22 2015 21:09
No way, that's all the more that I think we need at this point
maybe a work title
'work' : {
     'slug' : "aeneid"
}
you know, somesuch
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:10
Ah, yes let me hunt that out
So far I can parse 13 files total.
Luke Hollis
@lukehollis
Nov 22 2015 21:13
nice! :D we're getting there
coptic was very similar across files, but I'm sure I'm missing a lot there
I'll make sure I merge your latest
Kyle P. Johnson
@kylepjohnson
Nov 22 2015 21:15
Wacky stuff! Yeah, merge it but give it a few hours before working with the output.
Thanks @ferthalangur . I'll have something tangible for your feedback sometime today, I hope
Rob Jenson
@ferthalangur
Nov 22 2015 21:57
@lukehollis makes a very good point. Don't let Perfect be the enemy of the Good. We're going to run into boundary cases and have to refine the procedure. Better to get a small sample on the first cut, and then generalize it better later.
Rob Jenson
@ferthalangur
Nov 22 2015 22:15
Which Ammianus were you vexed by @kylepjohnso?
Y'all might want to add some of the following to your .gitignore files:
Luke Hollis
@lukehollis
Nov 22 2015 23:47
Hey that gitignore looks great!