Hi, I'm new to GPGPU programming and Futhark. It's been interesting, but I've hit a roadblock trying to optimize an algorithm. The problem is somewhat involved, so apologies for the length of this post.
Seam carving is an algorithm for "content-aware image resizing". I'm trying to implement "non-cumulative seam carving", an adaptation of the algorithm for GPUs described in this paper. Here's the relevant description: (pp. 3-4)
To summarize, there are two steps:
Creating the index map is just a map over the energy array. But their algorithm for summing seam energy uses a tree reduction that modifies
OffsetMap in place. It doesn't seem like Futhark can do this, as
reduce does not operate on unique arrays. So, my current implementation creates temporary arrays: (
-- Note that I use a 2D array while the paper uses a 1D array, so my -- indexes are x-coordinates. let create_index_map [ ][ ] (energy: [ ][ ]u32): [ ][ ]i32 = map (\y -> let row = energy[y + 1] let mn i1 i2 = unsafe if row[i1] < row[i2] then i1 else i2 in map (\x -> let min = x let min = if x != 0 then mn min (x - 1) else min let min = if x != w - 1 then mn min (x + 1) else min in min ) (iota w) ) (iota (h - 1)) ++ [replicate w 0] let sum_seam_energy [ ][ ] (energy: [ ][ ]u32) (index: [ ][ ]i32): [w]u32 = let op (e1, i1) (e2, i2) = let energy_sum = map2 (\e i -> e + unsafe e2[i]) e1 i1 let next_index = map (\i -> unsafe i2[i]) i1 in (energy_sum, next_index) let ne = ((replicate w 0), (iota w)) let as = zip energy index in (reduce op ne as).1 entry main [ ][ ] (energy: [ ][ ]u32): [w]u32 = sum_seam_energy energy (create_index_map energy)
Is there some way to use
reduce but modify the array in place? I've looked at SOACs like
scatter, but neither does what I want. (They each operate on arrays of destination indexes, but the indexes here are source indexes.) I ask this because on my laptop (Intel i5-5350U @ 4x 2.9GHz, Intel HD Graphics 6000), benchmarking OpenCL with a
u32 energy array gives a throughput of 0.1GB/s--4x slower than the CPU. My only guesses for the bad performance are poor hardware or too much allocation. Though the former is true, it seems like the latter is a bigger culprit. Also, if it's helpful, running with
-D shows a peak memory usage of 48MB.
++. That's not bad in itself (it's just a copy), but concatenation tends to inhibit other optimisations. In this case, it doesn't really matter.
reducethat contains a nontrivial operator (in particular, a
map). Futhark cannot generally exploit parallelism inside reduction operators (except for perfect nestings of
reduceto a sequential
foldlis going to improve your performance, but of course limit the amount of parallelism that can be exploited to the width of the image.
mapon top and
reduceon the inside (which is a very efficient pattern), but that may require rethinking the algorithm a bit.
reduceis giving you the right result.
mapas a building block:
let sum_seam_energy [ ][ ] (energy: [ ][ ]u32) (index: [ ][ ]i32): [w]u32 = let op (e1, i1) (e2, i2) = let energy_sum = map2 (\e i -> e + unsafe e2[i]) e1 i1 let next_index = map (\i -> unsafe i2[i]) i1 in (energy_sum, next_index) let as = zip energy index let res = loop as while length as > 1 do map (\r -> r `op` r) (unflatten (length as / 2) 2 as) in res.1
wis a power of two, but that shouldn't be impossible to fix.
unflattenthere is something I normally discourage (like concatenation, it can inhibit optimisation), but writing it with direct indexing unfortunately exposed a compiler bug that made the performance pretty bad.
reducepattern is not something we see much, and it is almost always a mistake, so the compiler hasn't been taught how to deal with it. I think in practice it probably cannot do much better than that loop you write yourself.