Community of DevSecOps Talks podcast https://devsecops.fm
actually bumped into that for Nomad but it looks like in Kworld it is also a problem
@veggiemonk is it something you worked around? Or how do people in MLOps go about that?
it seems like usual approach of long running pods won’t work here since there is no resouce sharing
i.e. you would have to update app to run like a batch job so it releases resources when there is nothign to calculate
Am I thinking in the rigth direction?
That one of the reason why MLOps is much more difficult than streamlined software development. You cannot (currently) request a fraction or share a GPU. So the game becomes "How cheap can you run that workload?". The only way to minimize cost is to scaling nodes according to the demand, meaning to have a node pool with autoscaling. If the workload can be interrupted, preemptible instances (spot instances on AWS) are much cheaper but the availability might make the application suffer. Take also into account that the magical drivers need to be installed, usually a daemonSet will do the trick. Also, all datacenters are not made equal. A DC might have a specific GPU type, it does not automatically means that another zone in the same region will have that specific GPU. Also the compute density is also some kind of magical formula that needs to be figured out. It means that having 4 nodes with 4 CPUs/16GB RAM/1GPU might not be as cost effective as one node with 16 CPUs/64GB RAM/4 GPUs, depending on the usage. To top it off, GPU requires special pet-like care. Instances with GPU can be terminated at anytime for maintenance. So it is all the problems of "normal" distributed systems but with special hardware, crazy resources consumption, huge of amount of data, various cycles (training, integrating the model the app in CI, inference/serving the model). So just making it work is insanely expensive and cost optimization is quite important unless there is an infinite amount of money available.
in short, it is a challenge
Yes, spot instances make a lot of sense. You can pre-install drivers, i.e. start from custom AMI
though in current setup we are trying to take advantage of GPU’s we got in private DC
so they are available 24/7
we just need to make a good use of them
and in Nomad as in K8S you can only allocate a whole GPU
what interesting is that some of the cards like Titan RTX are recognized as more than one
I guess they have more GPU’s on board (I know that must sound silly but I’m so far away from hardware...)
How do you measure one GPU? One card not necessarily means one GPU, right?
This is the new way of security and we could look more in then some pod
Me too, I haven't build a computer in decades. I left that behind because it is quite an expensive and time consuming hobby. I have no good resources on GPU hardware utilization.
I'm not even sure that on-prem is really a financially viable solution when you see the time and expertise it takes to care for those GPUs. I haven't seen any good numbers on that. I think if you have a massive team of data scientist hammering GPUs day-in day-out, it might make sense. After all, at scale everything is different. BUT if you have high scale, you're going to pay for a team of highly specialized datacenter operator who know GPUs. They also forget that the data needs to come from somewhere and the amount of data is now usually counted in PB(> TB > GB). I'd like to see the networking cost of that. It may actually be cheaper to outsource that work to a cloud provider, so I'm still skeptical about the "saving money" part.
I heard that the new RTX 30 series is a considered a major evolution in the architecture of GPU. Here is the NVIDIA keynote : http://youtube.com/watchv=QKx-eMAVK70 As a colleague put it: "If the stock price did not really improve after the announcement, it might not be such a big deal." It has some truth in it, I'm not sure if it applies in COVID times.
I hope that the task of managing those on-prem cluster will never fall on me 🙏
On a different topic, have you seen this tool: https://googlecontainertools.github.io/kpt configuration-as-data (vs configuration-as-code) .... I did not understand the difference at first but it is a very interesting concept.... I'd love to talk about it once I know more
Q: How is kpt different from other solutions?
A: Rather than expressing configuration as code, kpt represents configuration packages as data, in particular as YAML or JSON objects adhering to the kubernetes resource model
FAQ does not really give you much
I would love the web-site to tell what that is for and what problem it is intended to solve
from FAQ I get a feeling that it is somekind of GitOps tool thingy for K8S
I was advocating for Docker Swarm more than kubernetes a few years ago because Devs and Ops were overwhelmed and needed a smaller solutions. Especially since it should not require a massive overhaul of everything by moving from Swarm to K8s.... I don't think it is worth it now unless Docker Swarm has a clear set of features that are difficult to implement in Kubernetes.