Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Trevor L. McDonell
    @tmcdonell
    by the way, the is just shorthand for indexing a zero-dimensional array with (!)
    statusfailed
    @statusfailed_gitlab
    somehow I'd totally missed you could even index arrays
    lmao
    oh right, you can't do different indexes at different elements though, that's what I thought ^^
    Trevor L. McDonell
    @tmcdonell
    different indexes at different elements? what do you mean?
    Robbert van der Helm
    @robbert-vdh
    @statusfailed_gitlab Have you tried something like this for a simple lookup table?
    import           Data.Array.Accelerate         as A
    import qualified Data.Array.Accelerate.LLVM.PTX
                                                   as PTX
    import qualified Prelude                       as P
    
    xs :: Vector Int
    xs = fromList (Z :. 8) $ P.cycle [0, 1, 2, 3]
    
    -- | The lookup table corresponding to 'incMod4'.
    incMod4Table :: Vector Int
    incMod4Table = fromList (Z :. 4) [1, 2, 3, 4]
    
    -- | Use the indices from the second argument to look up eleemtns from the
    -- lookup table.
    lookup
        :: (Elt a, Shape sh)
        => Acc (Vector a)
        -> Acc (Array sh Int)
        -> Acc (Array sh a)
    lookup table = map (table !!)
    
    main :: P.IO ()
    main = P.print P.$! PTX.runN lookup incMod4Table xs
    statusfailed
    @statusfailed_gitlab
    re: different indexes at different elements, I actually thought it was not possible to do exactly what Robbert is doing above ^
    :D
    Robbert van der Helm
    @robbert-vdh
    You can also do some interesting things when you can combine that with permute. I've used that before to combine elements in a vector with an existing matrix by looking up which coordinates those elements should map to in another array.
    Slaus Blinnikov
    @SlavMFM
    Hello. Reading accelerate-llvm docs, it targets either CPU or CUDA putting AMD GPUs aside, am I right? Is there a way to target AMD GPUs as accelerate backend? Thanks in advance.
    Trevor L. McDonell
    @tmcdonell
    Hello @SlavMFM! yes that is (unfortunately) correct, at least for the moment. There is an open issue AccelerateHS/accelerate-llvm#7 where we have begun to discuss this, but we have very limited resources and haven't been able to develop this yet. If you have time or resources to help contribute to this that would be greatly appreciated!
    statusfailed
    @statusfailed_gitlab
    @tmcdonell the developer of the haskell vulkan library is pretty active, it might be worth reaching out?
    oh I guess that's not really the same approach as in the thread you posted though
    Unrelated: I have a (deterministic) accelerate program which is giving different results with the PTX and Native backends- is this a library bug, or are there some unexpected ways this could happen?
    Trevor L. McDonell
    @tmcdonell
    yep, probably we will need something like the Vulkan bindings to control the device; I imagine it will serve a similar function as the CUDA bindings package of mine. the other main issue is how to generate the code in the first place; AMD are comparatively terrible with documentation...
    it could be a bug... can you make make an issue with a test case? (or send to me directly if you don’t want to share the code)
    Robbert van der Helm
    @robbert-vdh
    A Vulkan backend would be really cool to have since that would allow Accelerate to work with any GPU (including Intel iGPUs) on all platforms without depending on the LLVM or CUDA toolchains. Although for MacOS you'd still need something like MoltenVK. But yeah, mapping Accelerate's AST to GLSL (or directly to SPIR-V, but that sounds like more a lot effort without any real benefits) sounds like a pretty huge undertaking, and then you'd still have to do the rest of the plumbing and device management.
    Trevor L. McDonell
    @tmcdonell
    Vulkan/Metal/CUDA all use LLVM internally, just with different source languages; we need to figure out how to skip the middle-man and directly generate the LLVM IR their frontends would produce. more or less, anyway... insert hand-waving
    Robbert van der Helm
    @robbert-vdh
    I was thinking more from a usability point of view! With Vulkan you could use Accelerate in your fancy todo list or recipe sharing app without the user having to install CUDA or an LLVM toolchain themselves, since everything needed to run Vulkan applications is included with the GPU drivers*.
    statusfailed
    @statusfailed_gitlab
    @tmcdonell the code right now is pretty huge, so I'll try to make a test case
    Trevor L. McDonell
    @tmcdonell
    @robbert-vdh ah. yeah needing to install LLVM separately is maybe annoying. on windows I don’t know if there is a package distribution which is easy to get (chocolaty?). I have considered switching to just calling clang on the command line to compile, rather than linking directly to the LLVM library, since you can get that binary directly from llvm.org. But I’m not sure if it would improve things substantially enough (at all?) to bother with the change.
    @statusfailed_gitlab great, thanks!
    Slaus Blinnikov
    @SlavMFM
    @tmcdonell was asking about OpenCL support to just quickly check if my algorithm gonna work ~200 faster on my 5700xt GPU than it's C++ version currently doing on CPU, because I can't compile it with GPU targeting neither with SYCL(ComputeCPP) nor OpenMP/OpenACC. Rumors around web that AMD dropped SPIR-V support entirely (couldn't find an official statement though): seems true due to 5700xt being very popular GPU and up'n'running for 1.5 years already and still not having it. Seems like the whole RDNA (2?) architecture isn't supported across all of these techs even under AMD's own tech-stack: RadeonOpenCompute/ROCm#887 . So LLVM targeting modern (2 years old) AMD GPUs looks pretty foggy ^^
    @tmcdonell 5700xt under OpenCL is da beast though! > 1000 times faster than CPU and >1TB random access memory throughput, but writing OpenCL is very tedious, so was hoping for accelerate-hs escape route ^^. I'm wondering if is it possible to directly target OpenCL code generation (instead of LLVM/SPIR)? Shouldn't it be even simpler due to OpenCL being "slightly" higher level representation than LLVM? Where can I know about code generations/conversions more? Thanks in advance.
    Trevor L. McDonell
    @tmcdonell
    @SlavMFM that has been a consistent problem with AMD in fact; they never commit to a technology stack and make the support libraries necessary to actually use it. They’ve conceded the compute space entirely to NVIDIA because of this. Which is unfortunate.
    Trevor L. McDonell
    @tmcdonell
    The problem with OpenCL is (a) it’s basically dead, to be subsumed by Vulkan, I believe (which is more a graphics API), and (b) we actually used to generate CUDA C code, but going that route is much more of a pain than you would expect; the semantics of LLVM actually match much better to a functional language, despite being relatively low level. OpenCL being a lowest-common-denominator language also makes it difficult to get low-level control (e.g. accessing specific instructions is usually done with target-specific extensions). Anyway... I really would like to support AMD devices (I bought one recently to start work on this) but AMD themselves are pretty terrible at communicating how to actually do this.
    Troels Henriksen
    @athas
    AMD now seems to follow an interesting strategy of only supporting their compute-line of GPUs in ROCm. I can't tell if it's intentional or lack of resources. It's really crazy not to have any entry-level way of doing AMD compute, and I can't believe even AMD would be that stupid.
    I think LLVM can target all kinds of AMD architectures though, since that is what the fully operational Vulkan and OpenGL frontends use. It's just the compute parts that are nowhere in sight.
    Trevor L. McDonell
    @tmcdonell
    generating [Open]C[L] from the Accelerate AST is not so difficult though; it’s just tricky to do it robustly.
    Troels Henriksen
    @athas
    What is the challenging aspect of it?
    Trevor L. McDonell
    @tmcdonell
    @athas yeah, I’m not too hopefully to use ROCm as a target actually, at least until it’s more fully-baked
    Trevor L. McDonell
    @tmcdonell
    making sure there are no implicit conversions for example (OpenCL int /= haskell int, as i’m sure you are familiar with recently) and dealing with aggregate types I remember was a source of bugs. that was all a long time ago, maybe I’m not such a bad programmer anymore. probably other things too but I don’t remember. It’s all things which can be solved with extensive testing, but (at the time) I couldn’t reflect the C types into the haskell type system to catch all those bugs for me, so they kept slipping in...
    (actually I used to silently try to use 32-bit ints on the GPU whenever I could-loop counters and such-because they are so much faster, but eventually abandoned that because it just isn’t robust)
    Troels Henriksen
    @athas
    Yes, the implicit conversions are super annoying, but I think I eventually got rid of those by using a phantom-typed expression language that I then transform to C.
    Some could still sneak in, I guess. My main problem with generating C code has been supporting complex control flow, e.g. jumping out of multiple loops. That requires goto in C, and works fine with NVIDIA, but often triggers compiler bugs on AMD.
    Trevor L. McDonell
    @tmcdonell
    @athas ah yes you are right, that was a problem too; that impedance mismatch going from HS expressions to C statements
    statusfailed
    @statusfailed_gitlab
    Is there a way to evaluate an Exp a? Usually when I type in an expression of that type at the REPL, it shows me a value, but sometimes I get a big pretty-printed expression
    Trevor L. McDonell
    @tmcdonell
    the only way to evaluate things is with run (and it's variants) which all evaluate array expressions. but you can create a scalar (one element) array with unit
    statusfailed
    @statusfailed_gitlab
    Ah cool, ok!
    Trevor L. McDonell
    @tmcdonell
    what you are seeing is the show instance for Exp (functions and expressions), and I guess the simplifier is able to reduce it down to a single value in some cases. There is a show instance for Acc (functions and expressions) which does the same thing too.
    Robbert van der Helm
    @robbert-vdh
    @statusfailed_gitlab I use this in my tests:
    evalExp :: Elt a => Exp a -> a
    evalExp e = head . A.toList $ run (unit e)
    statusfailed
    @statusfailed_gitlab
    Ah nice :-)
    I will steal that- I also want to write unit tests for my expressions :D
    @tmcdonell actually the simplifier seems really clever- I have only run into a couple cases where it's not able to reduce into a single value
    this particular one has a 'coerce' at the top, maybe that's why?
    (in fact, for a long time I thought the Show instance for Exp a was actually evaluating the expression, not just pretty-printing it)
    Troels Henriksen
    @athas
    What is the easiest way to compile accelerate-examples without any CUDA stuff? Setting the llvm-ptx flag to false does not seem to do the trick.
    Troels Henriksen
    @athas
    I figured it out: an llvm-ptx: false flag on both accelerate-examples and accelerate-fft.
    Trevor L. McDonell
    @tmcdonell
    yes, I was just about to say that. sorry I didn't get your message in time!
    Troels Henriksen
    @athas
    Do you know if Accelerate does something particularly fancy to the nbody example when compiled with the llvm-cpu backend? It is much faster than I would expect (runtime does not seem to scale quadratically with n).