Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Andrey Devyatkin
    @Andrey9kin
    i.e. you would have to update app to run like a batch job so it releases resources when there is nothign to calculate
    Am I thinking in the rigth direction?
    Julien Bisconti
    @veggiemonk
    That one of the reason why MLOps is much more difficult than streamlined software development.
    You cannot (currently) request a fraction or share a GPU. So the game becomes "How cheap can you run that workload?".
    The only way to minimize cost is to scaling nodes according to the demand, meaning to have a node pool with autoscaling.
    If the workload can be interrupted, preemptible instances (spot instances on AWS) are much cheaper but the availability might make the application suffer. Take also into account that the magical drivers need to be installed, usually a daemonSet will do the trick.
    Also, all datacenters are not made equal. A DC might have a specific GPU type, it does not automatically means that another zone in the same region will have that specific GPU. Also the compute density is also some kind of magical formula that needs to be figured out. It means that having 4 nodes with 4 CPUs/16GB RAM/1GPU might not be as cost effective as one node with 16 CPUs/64GB RAM/4 GPUs, depending on the usage. To top it off, GPU requires special pet-like care. Instances with GPU can be terminated at anytime for maintenance. So it is all the problems of "normal" distributed systems but with special hardware, crazy resources consumption, huge of amount of data, various cycles (training, integrating the model the app in CI, inference/serving the model). So just making it work is insanely expensive and cost optimization is quite important unless there is an infinite amount of money available.
    in short, it is a challenge
    Andrey Devyatkin
    @Andrey9kin
    Yes, spot instances make a lot of sense. You can pre-install drivers, i.e. start from custom AMI
    though in current setup we are trying to take advantage of GPU’s we got in private DC
    so they are available 24/7
    we just need to make a good use of them
    and in Nomad as in K8S you can only allocate a whole GPU
    what interesting is that some of the cards like Titan RTX are recognized as more than one
    I guess they have more GPU’s on board (I know that must sound silly but I’m so far away from hardware...)
    How do you measure one GPU? One card not necessarily means one GPU, right?
    Andrey Devyatkin
    @Andrey9kin
    Do you have some good resources to check out?
    This is the new way of security and we could look more in then some pod
    Julien Bisconti
    @veggiemonk
    Me too, I haven't build a computer in decades. I left that behind because it is quite an expensive and time consuming hobby.
    I have no good resources on GPU hardware utilization.
    I'm not even sure that on-prem is really a financially viable solution when you see the time and expertise it takes to care for those GPUs. I haven't seen any good numbers on that. I think if you have a massive team of data scientist hammering GPUs day-in day-out, it might make sense. After all, at scale everything is different. BUT if you have high scale, you're going to pay for a team of highly specialized datacenter operator who know GPUs. They also forget that the data needs to come from somewhere and the amount of data is now usually counted in PB(> TB > GB). I'd like to see the networking cost of that. It may actually be cheaper to outsource that work to a cloud provider, so I'm still skeptical about the "saving money" part.
    I heard that the new RTX 30 series is a considered a major evolution in the architecture of GPU. Here is the NVIDIA keynote : http://youtube.com/watchv=QKx-eMAVK70
    As a colleague put it: "If the stock price did not really improve after the announcement, it might not be such a big deal." It has some truth in it, I'm not sure if it applies in COVID times.
    If you to a proof that running a data center is usually more expensive than the cloud, here is a twitter thread written by the person who built the physical datacenters of Google: https://threadreaderapp.com/thread/1102401615263223809.html
    I hope that the task of managing those on-prem cluster will never fall on me 🙏
    Julien Bisconti
    @veggiemonk
    On a different topic, have you seen this tool: https://googlecontainertools.github.io/kpt
    configuration-as-data (vs configuration-as-code) .... I did not understand the difference at first but it is a very interesting concept....
    I'd love to talk about it once I know more
    Andrey Devyatkin
    @Andrey9kin

    Q: How is kpt different from other solutions?

    A: Rather than expressing configuration as code, kpt represents configuration packages as data, in particular as YAML or JSON objects adhering to the kubernetes resource model

    FAQ does not really give you much
    I would love the web-site to tell what that is for and what problem it is intended to solve
    from FAQ I get a feeling that it is somekind of GitOps tool thingy for K8S
    Julien Bisconti
    @veggiemonk
    I agree that the website is bad at explaining. The ascii cinema should be converted to documentation: https://googlecontainertools.github.io/kpt/reference/pkg/
    I hate to have to play a video to understand something.
    I see very little different from git submodule command
    Andrey Devyatkin
    @Andrey9kin
    Okay
    That is different
    Andrey Devyatkin
    @Andrey9kin
    Andrey Devyatkin
    @Andrey9kin
    @veggiemonk @mattiashem episode 15 is ready to go out devsecopstalks/website#18 Any final edits?
    Mattias Hemmingsson
    @mattiashem
    Nice looks great! push to button (merge )
    Andrey Devyatkin
    @Andrey9kin
    it is out now
    Julien Bisconti
    @veggiemonk
    Great
    Mattias Hemmingsson
    @mattiashem
    can i share this one
    Andrey Devyatkin
    @Andrey9kin
    sure
    though this link can’t be seen without podcaster login
    if you want spotify link
    Mattias Hemmingsson
    @mattiashem
    I try thanks :-)
    Julien Bisconti
    @veggiemonk
    I was advocating for Docker Swarm more than kubernetes a few years ago because Devs and Ops were overwhelmed and needed a smaller solutions. Especially since it should not require a massive overhaul of everything by moving from Swarm to K8s.... I don't think it is worth it now unless Docker Swarm has a clear set of features that are difficult to implement in Kubernetes.
    Mattias Hemmingsson
    @mattiashem
    wow Docker Swarm still out there :-)
    Andrey Devyatkin
    @Andrey9kin
    Did you read the article? ;)
    it seems that they want to keep docker swarm interface but move backend to k8s
    Mattias Hemmingsson
    @mattiashem
    yee and run swarm inside k8s ? to make it simple for devs ...