These are chat archives for cltk/cltk_api

18th
Mar 2016
Luke Hollis
@lukehollis
Mar 18 2016 05:04
There have been a few requests for more comprehensive web application architecture diagramming, so here's what I propose:
I'm not married to any of this, but I think that this solves the problem of leveraging the CLTK advanced NLP functionality to render data to the frontend interface.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 13:19
This looks very good
I'll point people here too if they need
Manvendra Singh
@manu-chroma
Mar 18 2016 13:26
@lukehollis what is the use of mongoDB database here ?
can you explain it's function briefly
can't the meteor application query api in realtime and serve whatever is required rather than getting the whole set of data and storing it the DB
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 13:33
@manu-chroma querying in real time might be too slow, so Luke is planning on pre-processing some text
We may move away from this, but it's the best way to get started for us
Luke Hollis
@lukehollis
Mar 18 2016 13:36
Yes, that's all
We have to keep the text somewhere so for each Text object created we can request definitions, commentary, translations, etc. from the API.
It'd be awesome if we could just get it all at once!
And in the future, we may be able to, but that seems like it would require the Flask API application also having a database
Also, Meteor needs a Mongo database for User account, user notes/comments, and other things that are only on the frontend
Manvendra Singh
@manu-chroma
Mar 18 2016 13:40
And in the future, we may be able to, but that seems like it would require the Flask API application also having a database @lukehollis I had this doubt in my mind. Thanks for your response!
Luke Hollis
@lukehollis
Mar 18 2016 13:42
Ah, sure--well, I think there's a lot of ways that we can do it..
Manvendra Singh
@manu-chroma
Mar 18 2016 13:42
To retrieve data faster, we can use a very simple hack of querying various data from API async fashion. I don't know if that's implemented already in the frontend.
Luke Hollis
@lukehollis
Mar 18 2016 13:43
The nice part about only rendering from MongoDb is we can leverage the Meteor websocket DDP connection
and then put our queries in our react .jsx templates
Manvendra Singh
@manu-chroma
Mar 18 2016 13:44
That sure will make site more responsive while the data is being fetched in real-time
Luke Hollis
@lukehollis
Mar 18 2016 13:44
yeah! and should be easy to program
Manvendra Singh
@manu-chroma
Mar 18 2016 13:46
Caching can also be implemented in front end with aid of the mongoDB
If I'm not wrong, it will help in reducing the load on API by some amount
Luke Hollis
@lukehollis
Mar 18 2016 13:47
Yes, I think you're correct
Actually, I imagine that we'll only have to update content rarely.. and won't really have to do the interval sync I proposed
We'll more have to load all content from the API once
Manvendra Singh
@manu-chroma
Mar 18 2016 13:48
We'll more have to load all content from the API once
@lukehollis What do you mean by this ?
Luke Hollis
@lukehollis
Mar 18 2016 13:50
and then just put a webhook on the CLTK git repos that tells our API to update the corpus with git and then the frontend application to invalidate it's cache of resources for that document and reingest them
we can make it as granular as we need to
Manvendra Singh
@manu-chroma
Mar 18 2016 13:51
@lukehollis Webhook idea is a great one.
Luke Hollis
@lukehollis
Mar 18 2016 13:51
The text server sync gets everything from the API and stores it in the Mongodb database: https://github.com/cltk/cltk_frontend/blob/master/server/text-server-sync.js
It seems like once we have all the XML we want to serve converted to JSON
we'll only really have to do this once.
If we want to keep adding documents, we can keep it on the interval sync
In this script, does the server gets data from all endpoints of the API ?
Luke Hollis
@lukehollis
Mar 18 2016 13:54
Yes
Everything
Manvendra Singh
@manu-chroma
Mar 18 2016 13:55
Alright
Going by the architecture, I think we'll have to mainly optimize frontend for the load.
Luke Hollis
@lukehollis
Mar 18 2016 13:56
Even when we get all our JSON/XML there.. it's still not that much data, I think--compared to what most of the technologies that are involved here have handled other places
Manvendra Singh
@manu-chroma
Mar 18 2016 13:57
@lukehollis will there be a NGINX server in accepting all requests on behalf of frontend as well ?
Luke Hollis
@lukehollis
Mar 18 2016 13:58
Sure! sounds good--if you have any ideas there, let us know--one thing I forgot to add on our diagram is the NGINX server that proxies requests to our frontend and API applications
haha, good timing
Manvendra Singh
@manu-chroma
Mar 18 2016 13:58
Haha
Then we can set up routes in the NGINX server config file to serve all media related content.
Luke Hollis
@lukehollis
Mar 18 2016 13:59
yes, the nginx will take care of a lot of the heavy lifting here
that sounds great!
if we ever need to scale (fingers crossed we do!), I think one of the easiest workflows for that when we have our docker container in place looks like this "getting started" example: http://kubernetes.io/docs/hellonode/
w/Kubernetes
Manvendra Singh
@manu-chroma
Mar 18 2016 14:00
I agree.
Docker containers should be developed in parallel .
Scaling using docker swarm would be so much convenient!
http://kubernetes.io/ is indeed the way to go
Luke Hollis
@lukehollis
Mar 18 2016 14:03
oh interesting! haven't used docker swarm before, but it looks good
anyway, some workflow like this with our application container seems like it would be good
This link should be present on the README as well
Luke Hollis
@lukehollis
Mar 18 2016 14:05
thanks--I added it to the Challenges paragraph
if you'd like to adapt or build on this to show how docker/kubernetes/etc. might be used when we're ready, I can share it with you
Manvendra Singh
@manu-chroma
Mar 18 2016 14:07
I think it might be early for me to model something given that the containers are in early dev stage.
I'll try to read more about these technologies and see if I can add something useful eventually
Luke Hollis
@lukehollis
Mar 18 2016 14:20
Cool sounds good--higher priorities now anyway
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 15:32
If you guys want to make images for Kubernetes go for it. I'm not married to Docker
Manvendra Singh
@manu-chroma
Mar 18 2016 15:33
Haha
Manvendra Singh
@manu-chroma
Mar 18 2016 16:02
Kubernetes is an open source container cluster manager by Google. Now each container in the cluster can be a docker container.
Comparison should be Kubernetes vs Docker Swarm
And not Kubernetes vs Docker Containers
Manvendra Singh
@manu-chroma
Mar 18 2016 16:40
@kylepjohnson @lukehollis Have a look https://github.com/fabric/fabric
Can be used for automation and deployment
Rob Jenson
@ferthalangur
Mar 18 2016 17:52
Don't forget Amazon ECS.
Manvendra Singh
@manu-chroma
Mar 18 2016 17:53
+1
Rob Jenson
@ferthalangur
Mar 18 2016 17:53
Manu is quite right. The containers are Docker containers in all three cases. Kube/Swarm/ECS are the Cloud infrastructure to run them, connect them, spin them up and down as needed.
You know, I just finished writing up a ridiculously long email about the #https #tls #certificate stuff and I'm wondering whether it's all known stuff by the intended recipients @lukehollis , @kylepjohnson and @manu-chroma
Eh ... I'll send it anyway.
Manvendra Singh
@manu-chroma
Mar 18 2016 17:57
We'll tell you that when we read it :smile:
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 17:58
no, we probably do not know all we should about the subject, please do send
Rob Jenson
@ferthalangur
Mar 18 2016 17:58
The TL;DR is ... The "outside world" should be interfacing with our stuff through nginx on port 80 or 443 (preferably 443 ... there is a Q there for you Kyle about that).
Exposing a Meteor app, or gunicorn unicorn server to the Internet is unwise.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 17:59
I'm good with forcing everything to 443
Rob Jenson
@ferthalangur
Mar 18 2016 18:00
OK. cool.
You manage the cltk.org domain on DreamHost?
I'm picking apart the pieces on the net, but some stuff is obfuscated.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:01
Yes, I manager the domain thru Dreamhost
Rob Jenson
@ferthalangur
Mar 18 2016 18:01
Are there going to be more pieces possibly in the future?
I think I found {NULL, docs, api}.cltk.org
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:03
I don't know. such as?
ah yes.
right now we have docs. and api.
When we decide upon a name for Luke
's app, it could go under a subdomain
Rob Jenson
@ferthalangur
Mar 18 2016 18:03
I don't know ... latin.pejoratives.cltk.org ... greek.pejoratives.cltk.org ... it all depends on what pieces might need a direct interface.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:03
Why do you ask? Do we need to consider this for SSL certs?
heh
Rob Jenson
@ferthalangur
Mar 18 2016 18:04
Yes. There is a "sweet spot" when buying SSL certs individually versus a Wildcard
The other consideration is that if it does end up needing to be scaled behind a Load Balancer (LB), some architectures put SSL certs on the LB and on the individual servers
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:05
Rob, have you seen Encryption everywhere?
Rob Jenson
@ferthalangur
Mar 18 2016 18:06
Yep.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:07
For it, are what we get wildcard certs or individual?
Sorry, I meant Let's Encrypt
Rob Jenson
@ferthalangur
Mar 18 2016 18:07
individual ... and they are the lowest trust SSL certs.
That's what I thought.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:07
cool, thanks
If we got with Let's Encrypt, will we need a different cert for each server?
Rob Jenson
@ferthalangur
Mar 18 2016 18:09
Yes. You can bind multiple names in the same cert, but each server must have a cert that matches all the names that might be used to access it.
I have to look a little harder at this. There is no reason why we can't use L.E. certs.
I'm a little wigged out about the "automatic renewal" process, but I have to look at it more closely. I don't think ISRG would publish something that is going to be a security nightmare from the getgo.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:13
thanks, your experience here is super valuable. Other things that could come up, but I haven't really considered, are possible load balancing servers, cdns, devops stuff like that
Rob Jenson
@ferthalangur
Mar 18 2016 18:17
Ah ... believe it or not ... I forgot something in that email. :)
@lukehollis has convinced me that I need a Pro Bono project in my portfolio, so I will be kibbitzing in that area. Keep in mind that all I know about NLP comes from sitting next to a bunch of linguists and LISP programmers working on a project that I wasn't cleared to be curious about. :)
Rob Jenson
@ferthalangur
Mar 18 2016 18:22
Kyle ... you had asked @manu-chroma about whether you could take over managing the Lets Encrypt certs he was setting up. It might be good, if the mail interface for Dreamhost lets you set up email aliases or lists, to create accounts for the projects using Role email addresses, and then point those addresses at one or more of the people doing the work at the time, and yourself. Then you can redelegate / reallocate as people move in and out of the projects.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:23
Hi Rob, anything and everything you have to offer is completely welcome. very appreciated
Rob Jenson
@ferthalangur
Mar 18 2016 18:23
e.g.: sysadmin@cltk.org or admin@cltk.org for subscribing to Cloud services and such ..
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:23
good idea. I'll get on it this weekend
Rob Jenson
@ferthalangur
Mar 18 2016 18:23
webmaster@cltk.org for any messages generated by the applications
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:24
anything else you think we should be doing, but aren't, let us know
Rob Jenson
@ferthalangur
Mar 18 2016 18:24
Will do.
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:26
@ferthalangur just pinged you about something
Rob Jenson
@ferthalangur
Mar 18 2016 18:26
Ah ... I did think of two things, looking at the server that @manu-chroma set up ..
Manvendra Singh
@manu-chroma
Mar 18 2016 18:26
@ferthalangur Can you please give your thoughts on the ideas @lukehollis and I were discussing regarding the architecture of cltk webapp as whole just before you joined the conversation in this very thread. Thanks!
Rob Jenson
@ferthalangur
Mar 18 2016 18:28
You might want to create a second account, for yourself, on the server, in case you get locked out of cltk user / lose the password.
Give that user sudo with NOPASSWD privs, instead of SSH'ing in as root. This is kinda "best practices" these days (Security best practices has as much FOTM (Flavor of the Month) in it as development frameworks or anything else in IT these days).
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:37
@ferthalangur I'll write to you with our server info
(this weekend)
Rob Jenson
@ferthalangur
Mar 18 2016 18:41
@lukehollis If the API is going to accept arbitrary-length strings as input parameters, there are at least two considerations: (1) Escaping and sanitizing metacharacters that might cause problems for the function they get passed to. Remember our convo a couple of nights ago about "tainted" input. (2) Are there length restrictions to URIs? Technically ... no. However, practically, every component that handles your request must also not fail on a really long URI.
Example ... disk caching: Many cache schemes use the key (i.e., the URI) intact. If it's a memcache, your lookup table will be doing comparisons of really long strings or the hash of really long strings. If it's a disk cache, depending on implementation, it might be writing filenames with the entire query string as the last component, and there are actual limits to how long a file name or a pathname can be before things "break"
@kylepjohnson ... your "todo" list for this weekend is quite impressive!
Kyle P. Johnson
@kylepjohnson
Mar 18 2016 18:48
:weary:
Luke Hollis
@lukehollis
Mar 18 2016 22:23
ahh! :)
Yes, let me think more about how the API might work with the input parameters
Luke Hollis
@lukehollis
Mar 18 2016 22:53
I think Kyle is correct that if we're not careful about the way we handle input parameters in GETs, we'll run into those "URI too long" errors
working on an API right now for a different project that is running into those errors :/
Our workaround is to move our inhouse Node.js application to use the Meteor framework's native websockets DDP layer
So then we didn't have to think about it as much
For the Flask API, I think it makes much more sense to remain as a RESTful API instead of websocket connection
We can configure certain parameters in NGINX, I believe, to allow very long URIs, but even then sometimes local networks (like at universities) enforce maximum limits for URIs
We could ask our users to segment their content if they run into that issue on their local network...
Luke Hollis
@lukehollis
Mar 18 2016 22:59
As far as validating instead of sanitizing input values to ensure our code is secure, this is what the stemmer function looks like:
We could write up a list of expectations of an input value and then validate against that..?
If we're only talking about the use case that our frontend application will need, during the development period, we could restrict the API to a few IP addresses.