Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Grant Rostig
    @grantrostig

    μSystem Project http://plg.uwaterloo.ca/~usystem/

    The following is a brief summary of the μSystem project, being developed at the University of Waterloo. The project is looking at development of a highly-concurrent shared-memory programming system.

    Why concurrency?
    Why shared memory?
    Horizontal and Vertical Shared Memory
    Projects
        Concurrency in C
        μC++
        Real-time μC++
        Concurrent Monitoring, Visualization and Debugging (MVD)
        uDatabase
    Future
    Contact
    Teaching
    Source
    ok, slack lies about what is at that point. Comparison of the performance of their REST web server "Acorn" vs Apache and Node.js
    11% faster than httpd, 79% faster than node.
    Ed Browne
    @dabro
    #include<os> is a "unikernel" https://en.wikipedia.org/wiki/Unikernel
    Grant Rostig
    @grantrostig

    @dabro I looked at unikernel for at least 20 hours a few years ago. I basically decided that linux servers are more efficient in CPU and memory, partly via shared libraries etc. I don't think the security advanatage is that important to me. Also, they are slow to start similar to Amazon AWS lambdas. Very cool idea though!!

    What about these unikernel processes interest you? Where and what is "slack lies" in your comment above?

    Acorn being faster that httpd doesn't mean much to me. There are much faster linux webservers, correct?
    Ed Browne
    @dabro
    Ah, sorry, meant Gitter lies. That spot in youtube video is a performance comparison graph
    Concerning unikernel startup performance, you would need to show me the stats. Unikernel virtual machines start in microseconds to milliseconds; for all I know, AWS lambdas are the same tech. Certainly many of the unikernels are designed to run in Xen environments. But you're right; the principle win is the combined low-latency, high security model, with no competing technology within reach as far as I know. If you don't need that security, in terms of minimal trusted code base and near-complete hardware isolation, then other tech like OS containers may be very competitive.
    Ed Browne
    @dabro
    You'd also need to show me stats on Linux servers being more efficient; I don't see how at least at the micro level, since you've stripped out all process management, scheduling, generic resource management, etc.
    I see unikernels as a natural end-point in application isolation. First we had processes, then super-processes--ie "containers" with namespaces and resource quotas--and now "hyper"-processes where the OS kernel is essentially slimmed down to a "micro-kernel" hypervisor.
    Ed Browne
    @dabro
    This shouldn't be conceptually mixed up with intra-process concurrency; whether you are in a traditional OS process or one of these souped-up versions, either shared-memory and/or minimal-context-switch concurrency enabled by coroutines/green threads within a single protection domain (virtual memory, processor ring, limited number of OS and hardware threads) will be crucial for performance and code modularity.
    But I think OS/POSIX threads are an abomination as an application concurrency primitive. Each app should get around one OS thread per hardware thread, so it can use all the hardware resources, then manage its own concurrency scheduling internally without involving the kernel.
    Ed Browne
    @dabro
    As the guy above points out, this is what hardware VMs do today to give a guest multiple "cores", meaning additional kernel context switch scheduling within the guest (which assumes it is directly managing hardware) adds needless complexity and performance problems. Just get the guest OS out of the way, providing minimal support. ie unikernels.
    I'm not averse to standardized API's where an app could expose its concurrency to the kernel for extraordinary circumstances, extra safety, or to optimize inter-app scheduling or IPC; just against each thread switch requiring the kernel to be involved, which means each process has to have the kernel loaded into its memory domain.
    The unikernel model replicates the 1990's "micro-kernel" architecture that lost out to unikernels like Linux and Windows due to performance. The idea was to have most of the OS isolated in user-space processes, to minimize the cost of porting OS' between architectures and to increase security. In fact, Windows NT started this way, and major security vulnerabilities were introduced when code meant for service hosts was pushed into the kernel for performance reasons.
    Ed Browne
    @dabro
    However, there's a really big difference between then and now. Micro-kernels had to match unikernel performance on shared-memory machines, where a monolithic, highly coupled kernel could accomplish major performance gains by avoiding multiple data copies and serialization costs by mutating another worker thread's state directly. Today, however, we're in a cloud environment where serialization and message passing is almost completely unavoidable. I don't know of another way to pick up the performance unikernels made possible.
    Unikernel performance came at a major cost: complexity and correctness issues. This meant only a few organizations had the resources to even play: Microsoft, Linux consortium/community, Apple, and eventually Google with Android (living off Linux and Java). Everyone else fell by the wayside due to the shear size and maintenance/testing burden; the Linux kernel is 20M LOC! That's insane; there's no way to genuinely secure that (and by genuine, I mean reasonably, not perfectly).
    Grant Rostig
    @grantrostig
    Unikernel performance came at a major cost: complexity and correctness issues. This meant only a few organizations had the resources to even play: Microsoft, Linux
    Linux is not unikernel: https://en.wikipedia.org/wiki/Unikernel
    Xen is one of the technologies I looked at.
    Grant Rostig
    @grantrostig
    The one that most interested me is/was http://osv.io/ because it is written in C++ (not java as the xen website states)
    osv has http://www.seastar-project.org/ with is high performance.
    But you're right; the principle win is the combined low-latency, high security model, with no competing technology within reach as far as I know. If you don't need that security, in terms of minimal trusted code base and near-complete hardware isolation, then other tech like OS containers may be very competitive.
    I had not considered the latency issue when comparing unikernels and linux containers. I suppose a hardware interrupt would probably get handled faster in a unikernel. Interesting. I wonder how much faster? However if the unikernel process is not in memory then it is much slower to start. It probably even has to allocate a VM just to start. Start latency is important too.
    Grant Rostig
    @grantrostig

    You'd also need to show me stats on Linux servers being more efficient; I don't see how at least at the micro level, since you've stripped out all process management, scheduling, generic resource management, etc.

    I did say the shared libraries was the point where linux is more efficient in terms of memory re-use. Also the issue of having VMs allocated to a unikernel that does not fully use the CPU but yet has a hold of one presumably waiting for an interrupt to tell it that it has work to do. Perhaps I'm misunderstanding something but that is what makes timesharing really great. Other processes can run on the CPU. Now, if you had a unikernel fully (or almost fully) utilized, then the context switching of Linux, would start to play a larger factor. But what daemon is fully utilized? Anyway I have no stats to support my ideas and I may misunderstand the context switching and VM lanching and switching, but I'm going with my limited knowledge.

    By the way, I dislike typing. I prefer to talk about these sorts of things. You did loose me in some of the other detail you typed. To get to the bottom of things I would prefer audio. My phone number is 512 Threenine 4 36twoooone. I don't have your number.

    Lastly, my performance considerations are comparing bare metal performance of unikernels versus baremetal performance of linux. Cerainly not AWS virtual machines. VMs still mystify me, especially how they would give you a multi-core machine. That is an area I would like to know more about too. I guess the video started to talk about that, but I don't want to have to listen to his whole talk for just one sentence or two on that subject. I did scan the talk and it was all stuff I had heard before, but with less humility and balance. LOL :)
    Ed Browne
    @dabro
    Ouch, I meant monolithic kernel performance in reference to how Linux, Windows, BSD, and OSX accomplished performance vs micro-kernels. My bad.
    Ed Browne
    @dabro
    Concerning VMs, your typical host (Xen, KVM, VMWare...) allocate a host thread per guest core per machine. So in VirtualBox, if you create a 4-core guest, VB will create four host OS threads. Typically, these will be pinned to a bare-metal core to improve cache performance; however, when blocked on IO etc they will be scheduled out for other host threads. So eg two VM's will timeshare the underlying cores efficiently, just as if they were two normal host processes; a blocked guest will not usually squat on the hardware, unless it's doing busy-waiting or some such.
    Ed Browne
    @dabro
    Beyond that, I'm as unclear as you :smile: or more so. I don't think guest threads are typically visible to the underlying host, even on a processor that has hardware virtualization with page and IO virtualization. When a guest has a "core" running on hardware, it is able to do its own thread scheduling management, deciding which guest kernel or app thread to load onto the "core" next. This involves managing thread queues it keeps in its own virtual memory space, distinct from the host's queues. Typically, an OS will have something to do, so if one of the guest threads blocks, the guest kernel scheduler will be context switched onto the host's thread which is providing the guest "core," which will then put the next worker thread onto the same underlying host thread. Eventually, the host machine will fire a timer interrupt which will cause the host to context switch the whole guest "core" thread (just an underlying host thread) off onto a host wait queue.
    Ed Browne
    @dabro
    A key issue there for performance is how to handle the guest's own interrupts, IO, virtual memory pages, and other resource tables usage. These are all typically privileged instructions, and if they're trapped and translated each time by the host it's terribly slow. So IBM, Intel, AMD, ARM etc do flatten the guests' tables into the host's underlying hardware tables. However, through various implementations they are kept tagged separately, and each guest can use privileged instructions to only modify its own portions of the tables. I think host hypervisors however can modify all. So for example, a host's virtual page table will hold the tables for all the guests, and the hardware will automatically map and allow a guest to only use pages assigned to it, similar to a bare machine allowing a process to write directly to virtual memory assigned to it, with the hardware making automatic translations using cached mappings.
    Ed Browne
    @dabro
    Similarly, if a guest tries to write to a real IO device to which it has been granted access, the hardware will automatically map the write to the correct host buffer. https://en.wikipedia.org/wiki/X86_virtualization#Intel-VT-d.
    Ed Browne
    @dabro
    As far as cache management across context switches of any kind, I simply don't know. I think some of that is architecture dependent; a ways back, different architectures chose differently whether to tag cache lines based on virtual or physical addresses. However, I think either way they typically invalidated cache lines with a page table swap involved in process switching. I don't know if optimizations have been introduced, so that if memory is shared across two processes and already in L1/2/3 cache, it doesn't get flushed when switching processes. Without that, I'm not sure that OS level virtualization via containers would be anymore performant than hardware virtualization. Yes, you would have increased memory efficiency from shared library management, but that's only raw RAM; who's worried about that today for code ? On a 16 GB machine library code space is a pitance, and grows even less significant as I scale up to be able to hold max data in memory for fast queries.
    Ed Browne
    @dabro
    Even if I put 4-8 guests on a typical host machine (32GB+), each with their own 1-2 GB+ OS/app footprint, it's going to be data that is the problem. Obviously, OS-based radically slimmed that down to a few MB/app min, but unikernels drastically reduce that footprint even further. This is like going from OS threads at 1-10 MB, through green threads at 1-8KB (Go, Erlang, Scala Akka, Haskell), to 10-1000B coroutines.
    We can talk if you'd like. I don't like phones, but can VTC on skype or some such. I typically have a hard time thinking and speaking simultaneously.
    BTW, concerning coroutines, I found this on the Boost.Context site:

    A execution_context provides the means to suspend the current execution path and to transfer execution control, thereby permitting another execution_context to run on the current thread. This state full transfer mechanism enables a execution_context to suspend execution from within nested functions and, later, to resume from where it was suspended. While the execution path represented by a execution_context only runs on a single thread, it can be migrated to another thread at any given time.

    A context switch between threads requires system calls (involving the OS kernel), which can cost more than thousand CPU cycles on x86 CPUs. By contrast, transferring control among them requires only fewer than hundred CPU cycles because it does not involve system calls as it is done within a single thread.

    Ed Browne
    @dabro
    Boost's coroutines I think are "stackful", meaning each coroutine gets its own stack, which allows yielding from nested functions. C++20 seems to be going to "stackless", which are lighter-weight, and can even elide any heap allocation depending on lifetime analysis, becoming as cheap as a standard function call. At the cost of only being able to yield from inside that one coroutine.
    Finally, I found that you should be able to play with C++20 coroutines in clang 3.9 via -fcoroutines-ts CLI option. However, wasn't able to find that option in the versions I have on my machine.
    Ed Browne
    @dabro
    Another good article on Boost.Fibers (stack-based asynchronous “coroutines”):
    Ed Browne
    @dabro
    @grantrostig See minute 31:30 above for good discussion of using futures and promises to conduct async callbacks
    Ed Browne
    @dabro
    @grantrostig another good description of futures and promises, given by the designer of the C++20 coroutine proposal. https://youtu.be/8C8NnE1Dg4A?t=40m22s. Context is how to redesign futures for a C++ with coroutines.
    Grant Rostig
    @grantrostig
    @dabro , interesting, thanks for posting.