Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Troels Henriksen
    @athas
    Is this on the master branch?
    Troels Henriksen
    @athas
    Ah, I see what is going on. It's kind of an edge case where the compiler ends up performing 27 rotates sequentially, whereas your hand-written tabulate_3d becomes a single GPU kernel. The memory traffic is the same, so I wonder if it's just the kernel launch overhead.
    Troels Henriksen
    @athas
    Interesting! This is inspiring a new optimisation.
    I think what is happening is that copying each array individually duplicates the (relatively) expensive index calculations, which are only done once with tabulate_3d.
    Orestis
    @omalaspinas_twitter
    Nice!
    Troels Henriksen
    @athas
    @omalaspinas_twitter the root issue is that the Futhark compiler is quite keen to ensure arrays that are produced in loops have no index transformations. If possible, you can also try to rework your code structure such that the rotates are done at the start of the loop instead.
    Orestis
    @omalaspinas_twitter
    which loop to you mean?
    doing streaming before collision?
    (could certainly be done)
    Troels Henriksen
    @athas
    Yes, exactly. It might give you something.
    Orestis
    @omalaspinas_twitter
    wow
    must check but just got a factor 2.....
    O_o
    nice
    now rotate and indexing give the same performance.
    Troels Henriksen
    @athas
    Not bad.
    Would be nice if it weren't necessary, though.
    Orestis
    @omalaspinas_twitter
    I could also try to remove completely the loop and have it performed from the C code. (Each call from the C code would be one collde-stream combination. would that be ok? or would there be an overhead?
    Troels Henriksen
    @athas
    In its current form it would be OK, but you would not gain anything.
    Orestis
    @omalaspinas_twitter
    yeah it's just out of curiosity.
    Orestis
    @omalaspinas_twitter
    Hello. I am planning to try to add an mpi layer to my lattice Boltzmann code (to experiment with multigpu). To do so I could copy from GPU memory to CPU (and back) only an envelope (or halo) that is located all around a 3d matrix. Is there an easy elegant way to do it? One possibility would certainly be to also return it from the entry point of my futhark lib but I was wondering if there was something more efficient that existed.
    I want to minimize as much as possible the copy to/from CPU for obvious efficiency reasons...
    Troels Henriksen
    @athas
    Copying the halo in a bunch of slices is the best you can do, I think. As long as the overall grid is big enough, it shouldn't be a performance problem.
    Orestis
    @omalaspinas_twitter
    ok thanks!
    Ryan Huang
    @NPN

    Hi, I'm new to GPGPU programming and Futhark. It's been interesting, but I've hit a roadblock trying to optimize an algorithm. The problem is somewhat involved, so apologies for the length of this post.

    Seam carving is an algorithm for "content-aware image resizing". I'm trying to implement "non-cumulative seam carving", an adaptation of the algorithm for GPUs described in this paper. Here's the relevant description: (pp. 3-4)

    Non-cumulative seam carving algorithm

    To summarize, there are two steps:

    1. Create the index map. For each pixel, the map's value is the index of the adjacent pixel in the next row with the least energy.
    2. Following the paths created by the indexes, sum up the energy for each seam (which runs from the top to the bottom of the image).

    Creating the index map is just a map over the energy array. But their algorithm for summing seam energy uses a tree reduction that modifies SumMap and OffsetMap in place. It doesn't seem like Futhark can do this, as reduce does not operate on unique arrays. So, my current implementation creates temporary arrays: (energy_sum and next_index)

    -- Note that I use a 2D array while the paper uses a 1D array, so my
    -- indexes are x-coordinates.
    let create_index_map [h][w] (energy: [h][w]u32): [h][w]i32 =
      map (\y ->
        let row = energy[y + 1]
        let mn i1 i2 = unsafe if row[i1] < row[i2] then i1 else i2
        in map (\x ->
          let min = x
          let min = if x != 0     then mn min (x - 1) else min
          let min = if x != w - 1 then mn min (x + 1) else min
          in min
        ) (iota w)
      ) (iota (h - 1)) ++ [replicate w 0]
    
    let sum_seam_energy [h][w] (energy: [h][w]u32) (index: [h][w]i32): [w]u32 =
      let op (e1, i1) (e2, i2) =
        let energy_sum = map2 (\e i -> e + unsafe e2[i]) e1 i1
        let next_index = map (\i -> unsafe i2[i]) i1
        in (energy_sum, next_index)
      let ne = ((replicate w 0), (iota w))
      let as = zip energy index
      in (reduce op ne as).1
    
    entry main [h][w] (energy: [h][w]u32): [w]u32 =
      sum_seam_energy energy (create_index_map energy)

    Is there some way to use reduce but modify the array in place? I've looked at SOACs like reduce_by_index and scatter, but neither does what I want. (They each operate on arrays of destination indexes, but the indexes here are source indexes.) I ask this because on my laptop (Intel i5-5350U @ 4x 2.9GHz, Intel HD Graphics 6000), benchmarking OpenCL with a [2000][2000]u32 energy array gives a throughput of 0.1GB/s--4x slower than the CPU. My only guesses for the bad performance are poor hardware or too much allocation. Though the former is true, it seems like the latter is a bigger culprit. Also, if it's helpful, running with -D shows a peak memory usage of 48MB.

    Thank you!

    Troels Henriksen
    @athas
    @NPN are you sure that reduction is actually properly associative? It's not obvious to me.
    Anyway, there are two performance problems: the minor one is the use of ++. That's not bad in itself (it's just a copy), but concatenation tends to inhibit other optimisations. In this case, it doesn't really matter.
    The biggest problem is the reduce that contains a nontrivial operator (in particular, a map). Futhark cannot generally exploit parallelism inside reduction operators (except for perfect nestings of maps).
    Just changing the reduce to a sequential foldl is going to improve your performance, but of course limit the amount of parallelism that can be exploited to the width of the image.
    An ideal approach would be to rewrite the code such that you have a map on top and reduce on the inside (which is a very efficient pattern), but that may require rethinking the algorithm a bit.
    Alright, based on my understanding of the algorithm (finding a minimum energy "path" from top to bottom), I don't think it's straightforward to parallelise along the height as well. I don't think your reduce is giving you the right result.
    Troels Henriksen
    @athas
    Oh, actually, now that I think about it, it might actually just be parallel.
    Troels Henriksen
    @athas
    Cool, this might be the first case I've seen of a reduction with an array-typed operator, where that was actually both correct and potentially even a good idea.
    For GPU-architectural reasons, it's possibly not a good idea for performance, but in principle this is a parallel reduction. And right now the Futhark compiler cannot turn it into good code.
    Ryan Huang
    @NPN
    Thank you for the detailed response. Neat to hear that I didn't mess up associativity. I'll think about it and see if I can come up with a map-reduce version, otherwise I might just go back to the original dynamic programming algorithm.
    Actually, when you say "it's possibly not a good idea for performance," do you mean even if I were to implement the in-place algorithm directly using OpenCL or CUDA?
    Troels Henriksen
    @athas
    @NPN You can always write your own reduce-like operation just by using map as a building block:
    let sum_seam_energy [h][w] (energy: [h][w]u32) (index: [h][w]i32): [w]u32 =
      let op (e1, i1) (e2, i2) =
        let energy_sum = map2 (\e i -> e + unsafe e2[i]) e1 i1
        let next_index = map (\i -> unsafe i2[i]) i1
        in (energy_sum, next_index)
      let as = zip energy index
      let res = loop as while length as > 1 do
                  map (\r -> r[0] `op` r[1])
                      (unflatten (length as / 2) 2 as)
      in res[0].1
    This particular implementation only works if w is a power of two, but that shouldn't be impossible to fix.
    On my Vega 64 GPU, it's about 27x faster than the original code.
    Now, the unflatten there is something I normally discourage (like concatenation, it can inhibit optimisation), but writing it with direct indexing unfortunately exposed a compiler bug that made the performance pretty bad.
    This approach also looks like what they are doing in that paper.
    Ryan Huang
    @NPN
    Thanks again. Indeed, I see a 14x speedup. I remember seeing a similar "manual reduce" in the "What is the minimal basis for Futhark?" blog post, but I didn't consider trying it here. I assume the simpler structure (loop-map instead of reduce-map) helps the compiler optimize? Neat!
    Troels Henriksen
    @athas
    It's not so much the simpler structure; it's that map-inside-map is something that the compiler knows how to parallelise fully.
    The map-inside-reduce pattern is not something we see much, and it is almost always a mistake, so the compiler hasn't been taught how to deal with it. I think in practice it probably cannot do much better than that loop you write yourself.
    Ryan Huang
    @NPN
    Oh, I see. This lets the compiler fuse the maps. And yeah, I suppose the explicit loop is what the reduce-map would turn into anyway.
    Troels Henriksen
    @athas
    @NPN So is this fast enough, you think?
    I can't see any obvious smoking guns left in the generated code, so any improvement beyond this will require something funky.
    Ryan Huang
    @NPN
    Yeah, I think it'll be just fine for my purposes.
    Thanks for all the help.
    Troels Henriksen
    @athas
    Anytime!