Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Trevor L. McDonell
    @tmcdonell
    Vulkan/Metal/CUDA all use LLVM internally, just with different source languages; we need to figure out how to skip the middle-man and directly generate the LLVM IR their frontends would produce. more or less, anyway... insert hand-waving
    Robbert van der Helm
    @robbert-vdh
    I was thinking more from a usability point of view! With Vulkan you could use Accelerate in your fancy todo list or recipe sharing app without the user having to install CUDA or an LLVM toolchain themselves, since everything needed to run Vulkan applications is included with the GPU drivers*.
    statusfailed
    @statusfailed_gitlab
    @tmcdonell the code right now is pretty huge, so I'll try to make a test case
    Trevor L. McDonell
    @tmcdonell
    @robbert-vdh ah. yeah needing to install LLVM separately is maybe annoying. on windows I don’t know if there is a package distribution which is easy to get (chocolaty?). I have considered switching to just calling clang on the command line to compile, rather than linking directly to the LLVM library, since you can get that binary directly from llvm.org. But I’m not sure if it would improve things substantially enough (at all?) to bother with the change.
    @statusfailed_gitlab great, thanks!
    Slaus Blinnikov
    @SlavMFM
    @tmcdonell was asking about OpenCL support to just quickly check if my algorithm gonna work ~200 faster on my 5700xt GPU than it's C++ version currently doing on CPU, because I can't compile it with GPU targeting neither with SYCL(ComputeCPP) nor OpenMP/OpenACC. Rumors around web that AMD dropped SPIR-V support entirely (couldn't find an official statement though): seems true due to 5700xt being very popular GPU and up'n'running for 1.5 years already and still not having it. Seems like the whole RDNA (2?) architecture isn't supported across all of these techs even under AMD's own tech-stack: RadeonOpenCompute/ROCm#887 . So LLVM targeting modern (2 years old) AMD GPUs looks pretty foggy ^^
    @tmcdonell 5700xt under OpenCL is da beast though! > 1000 times faster than CPU and >1TB random access memory throughput, but writing OpenCL is very tedious, so was hoping for accelerate-hs escape route ^^. I'm wondering if is it possible to directly target OpenCL code generation (instead of LLVM/SPIR)? Shouldn't it be even simpler due to OpenCL being "slightly" higher level representation than LLVM? Where can I know about code generations/conversions more? Thanks in advance.
    Trevor L. McDonell
    @tmcdonell
    @SlavMFM that has been a consistent problem with AMD in fact; they never commit to a technology stack and make the support libraries necessary to actually use it. They’ve conceded the compute space entirely to NVIDIA because of this. Which is unfortunate.
    Trevor L. McDonell
    @tmcdonell
    The problem with OpenCL is (a) it’s basically dead, to be subsumed by Vulkan, I believe (which is more a graphics API), and (b) we actually used to generate CUDA C code, but going that route is much more of a pain than you would expect; the semantics of LLVM actually match much better to a functional language, despite being relatively low level. OpenCL being a lowest-common-denominator language also makes it difficult to get low-level control (e.g. accessing specific instructions is usually done with target-specific extensions). Anyway... I really would like to support AMD devices (I bought one recently to start work on this) but AMD themselves are pretty terrible at communicating how to actually do this.
    Troels Henriksen
    @athas
    AMD now seems to follow an interesting strategy of only supporting their compute-line of GPUs in ROCm. I can't tell if it's intentional or lack of resources. It's really crazy not to have any entry-level way of doing AMD compute, and I can't believe even AMD would be that stupid.
    I think LLVM can target all kinds of AMD architectures though, since that is what the fully operational Vulkan and OpenGL frontends use. It's just the compute parts that are nowhere in sight.
    Trevor L. McDonell
    @tmcdonell
    generating [Open]C[L] from the Accelerate AST is not so difficult though; it’s just tricky to do it robustly.
    Troels Henriksen
    @athas
    What is the challenging aspect of it?
    Trevor L. McDonell
    @tmcdonell
    @athas yeah, I’m not too hopefully to use ROCm as a target actually, at least until it’s more fully-baked
    Trevor L. McDonell
    @tmcdonell
    making sure there are no implicit conversions for example (OpenCL int /= haskell int, as i’m sure you are familiar with recently) and dealing with aggregate types I remember was a source of bugs. that was all a long time ago, maybe I’m not such a bad programmer anymore. probably other things too but I don’t remember. It’s all things which can be solved with extensive testing, but (at the time) I couldn’t reflect the C types into the haskell type system to catch all those bugs for me, so they kept slipping in...
    (actually I used to silently try to use 32-bit ints on the GPU whenever I could-loop counters and such-because they are so much faster, but eventually abandoned that because it just isn’t robust)
    Troels Henriksen
    @athas
    Yes, the implicit conversions are super annoying, but I think I eventually got rid of those by using a phantom-typed expression language that I then transform to C.
    Some could still sneak in, I guess. My main problem with generating C code has been supporting complex control flow, e.g. jumping out of multiple loops. That requires goto in C, and works fine with NVIDIA, but often triggers compiler bugs on AMD.
    Trevor L. McDonell
    @tmcdonell
    @athas ah yes you are right, that was a problem too; that impedance mismatch going from HS expressions to C statements
    statusfailed
    @statusfailed_gitlab
    Is there a way to evaluate an Exp a? Usually when I type in an expression of that type at the REPL, it shows me a value, but sometimes I get a big pretty-printed expression
    Trevor L. McDonell
    @tmcdonell
    the only way to evaluate things is with run (and it's variants) which all evaluate array expressions. but you can create a scalar (one element) array with unit
    statusfailed
    @statusfailed_gitlab
    Ah cool, ok!
    Trevor L. McDonell
    @tmcdonell
    what you are seeing is the show instance for Exp (functions and expressions), and I guess the simplifier is able to reduce it down to a single value in some cases. There is a show instance for Acc (functions and expressions) which does the same thing too.
    Robbert van der Helm
    @robbert-vdh
    @statusfailed_gitlab I use this in my tests:
    evalExp :: Elt a => Exp a -> a
    evalExp e = head . A.toList $ run (unit e)
    statusfailed
    @statusfailed_gitlab
    Ah nice :-)
    I will steal that- I also want to write unit tests for my expressions :D
    @tmcdonell actually the simplifier seems really clever- I have only run into a couple cases where it's not able to reduce into a single value
    this particular one has a 'coerce' at the top, maybe that's why?
    (in fact, for a long time I thought the Show instance for Exp a was actually evaluating the expression, not just pretty-printing it)
    Troels Henriksen
    @athas
    What is the easiest way to compile accelerate-examples without any CUDA stuff? Setting the llvm-ptx flag to false does not seem to do the trick.
    Troels Henriksen
    @athas
    I figured it out: an llvm-ptx: false flag on both accelerate-examples and accelerate-fft.
    Trevor L. McDonell
    @tmcdonell
    yes, I was just about to say that. sorry I didn't get your message in time!
    Troels Henriksen
    @athas
    Do you know if Accelerate does something particularly fancy to the nbody example when compiled with the llvm-cpu backend? It is much faster than I would expect (runtime does not seem to scale quadratically with n).
    Trevor L. McDonell
    @tmcdonell
    no, there's no special code path for the cpu backend
    Troels Henriksen
    @athas
    Does Accelerate do the equivalent of a C compiler's -ffast-math?
    Trevor L. McDonell
    @tmcdonell
    I haven't looked at the generated code in a while (possibly never for the cpu backend, that was implemented when we were still generating CUDA!)
    yes, it does do that
    Troels Henriksen
    @athas
    Oh, cool, that makes sense.
    I'm asking because I have a student who is finishing up a thesis on a multicore backend for Futhark, and I am helping him benchmark Accelerate to compare to a more mature backend. Performance is mostly identical for compute-bound programs, but sometimes Accelerate is way faster on pretty straightforward code (like nbody), which goes away if I recompile the Futhark-generated code with -ffast-math.
    Have you had any trouble in practice with using -ffast-math? I have been too paranoid to use it.
    Trevor L. McDonell
    @tmcdonell
    it came up once before, which is why these compensated sum functions exist (which effectively disable -ffast-math)
    *which also
    Troels Henriksen
    @athas
    Ah, cool! I also considered whether one could exploit the fact that if a user asks for parallel summation of floats, then they tacitly claim that float addition is commutative, and thus clearly they don't mind losing a bit of accuracy.
    But it also looks like -ffast-math does stuff like use CPU instructions for e.g. square roots, rather than calling the math library. I'm less sure how to handle that.
    Trevor L. McDonell
    @tmcdonell
    yeah, I'm a bit in two minds about it as well, but as you said there is a sort of tacit agreement here. at least in LLVM -ffast-math is an alias for a few different options, so you could choose to enable only the ones you are comfortable with
    (and on a per-instruction basis)
    Trevor L. McDonell
    @tmcdonell
    @SlavMFM not sure if you are still in the channel, but out of curiosity what OS(s) are you running? it helps planning where to spend development effort etc., if we want to start on an AMD target
    Slaus Blinnikov
    @SlavMFM
    @tmcdonell oh, a bit embarrassing ^^, I hope not to draw attention away from other important directions! I have Ubuntu Linux, but distro doesn't matter I guess, because I had to update kernel to v.5.4 because it was the only way to get OpenCL working: https://askubuntu.com/questions/1209725/how-to-get-opencl-support-for-navi10-gpus-from-amd/1211465#1211465 .
    Trevor L. McDonell
    @tmcdonell
    good to know, thank you! (:
    Callan McGill
    @Boarders
    If I wanted to work with a vector of an arbitrary but statically known size with accelerate how would I do that? For example in the k-means example it sticks with doing it for tuples but how would one work with arbitrary known sized vectors (even if just up to the tuple size accelerate supports)?