These are chat archives for cltk/cltk_api

17th
Mar 2016
Luke Hollis
@lukehollis
Mar 17 2016 00:09 UTC
Okay interesting..
Looks good, but what we have now is structured so that the key is always an integer
We can design for this use case though
Kyle P. Johnson
@kylepjohnson
Mar 17 2016 01:42 UTC
Yes, I know this isn't perfect ... but I don't know what would be better. There's something else, I'm sure.
The central problem is: How can we
... How can we convey
(Sorry, iPhone app keeps sending my message too soon!)
How can we mark the beginning of a new speaker? In my NLP work I could care less who the speaker is, however for reading this is obviously necessary!
Luke, when you're ready to work on this, I expect you'll have better ideas
Luke Hollis
@lukehollis
Mar 17 2016 04:41 UTC
Oh man, that's a good question! Hrm... well, it seems like if it works for now, I'm all for it.
We can work for a more complex solution after we get to a simple one, I think
One reflection with this also: we're currently sorting on all the keys for book-chapter, chapter-section, book-line, etc.
So when we have something like
{
   'text' : {
      '1' : {
         '1' : "This is the first line",
         '2' : "This is the second line",
         ... 
      }
   }
}
Our current queries look like this:
Luke Hollis
@lukehollis
Mar 17 2016 04:47 UTC
Texts.find({work:"the_work"}, {sort:{ n_1 : 1, n_2 : 1}})
Which means generally sort by the first nested keys and then sort by the second nested keys
The simplest possible solution (maybe not optimal?) that I could imagine if we wanted to keep sorting on the keys would just be to do something like this:
Luke Hollis
@lukehollis
Mar 17 2016 04:54 UTC
{  
      "1":{
                    "speaker" : "Actor 1",
                     "text" : "This is the first line"
       },
      "2":{
                    "speaker" : "Actor 2", 
                     "text" : "This is the second line"
      },
      "3":{
                    "speaker" : "Actor 2", 
                     "text" : "This is the third line"
      },
}
But, we can ingest data from either data format--whatever makes the most sense for the document conversion.
When we move things to Text objects in the database, I think we'll just have to add a speaker property that we can use across genres when appropriate
I was thinking about this back in the day with the speakers in the Eclogues
Sameer Chaudhari
@sameeriitkgp
Mar 17 2016 05:12 UTC
I agree with Luke.
Sameer Chaudhari
@sameeriitkgp
Mar 17 2016 12:15 UTC
@kylepjohnson -- If you agree, we can create an issue for this actor-line
I think I can take care of the document conversion and the database part.
What do you think @lukehollis ?
Kyle P. Johnson
@kylepjohnson
Mar 17 2016 15:00 UTC
@SameerIITKGP and @lukehollis I agree, too. I like you solution better much than mine
@SameerIITKGP I think you can go for it. I thank you both for being so helpful. I knew this would be a problem and had avoided it
Luke Hollis
@lukehollis
Mar 17 2016 17:50 UTC
sounds good! I think we should just add the speaker parameter to the Text objects collection schema and add the check for ingesting "actor-line" documents in the text sync
Kyle P. Johnson
@kylepjohnson
Mar 17 2016 18:37 UTC
Wonderful! Seeing drama in the frontend will be powerful
Luke Hollis
@lukehollis
Mar 17 2016 18:39 UTC
totally! that'll be good
Was wondering about involving the cltk core more in the API
So when you GET api.cltk.org:5000/core/stemmer?input="quid faciat laetas segetes quo sidere...etc."
it will run something like this:
def stem_input(input):
   s = Stemmer()
   return s.stem(input)
and you will get back "Quid fac laet seget quo sid ... etc."
Seems like pretty low-hanging fruit if we could keep it simple.
Idk about the path /core/stemmer... maybe we could just make that match the directory structure of the cltk core package
Luke Hollis
@lukehollis
Mar 17 2016 18:46 UTC
This is in my head as I look more at getting scansion and text reuse on the frontend
Kyle P. Johnson
@kylepjohnson
Mar 17 2016 20:21 UTC
@lukehollis In and of itself, this isn't wrong.
If you are OK testing out some ideas now in the API and don't mind if it gets redone, definitely go for implementing the routing you're thinking of.
I think I'll need to sit down with pen and paper to come to a good API proposal that will be flexible enough for all processing and all texts
Comment/criticism about your URI in particular: I think we'll be happier posting text as data objects, then calling the processing type in URI
Kyle P. Johnson
@kylepjohnson
Mar 17 2016 20:26 UTC
Because there will come a time when we could be posting very large data packets to the API
Luke Hollis
@lukehollis
Mar 17 2016 21:48 UTC
Ah sure, that sounds great! well, I'll create some issues for this
makes sense with the long uris sometimes breaking things in the future
Luke Hollis
@lukehollis
Mar 17 2016 22:51 UTC
Okay, created an issue to explore the stemmer and cltk api so that we can start moving forward there: cltk/cltk_api#20