These are chat archives for halide/Halide

18th
Oct 2017
Andrew Adams
@abadams
Oct 18 2017 17:09
slack was downvoted at the user meetup, because some companies have weird rules about slack participation, and because having an invite-based flow is discouraging to newcomers.
Zalman Stern
@zvookin
Oct 18 2017 17:10
My general thought was to try and keep both open
I.e. I'm fine with making Gitter the public option that is preferred, but I expect the research and other folks who like Slack will prolly stay there
Zalman Stern
@zvookin
Oct 18 2017 17:26
I just commented on Martijn Courteaux 's issue for threadsafe JIT calls: halide/Halide#2450
If anyone wants to check my thoughts, please do. Should be easy to provide a way to do this, but there's an issue of how to represent the full set of args in the C++ parameter list and then map them to Params and ImageParams. Not sure where it goes in priorities, but we should fix it.
Steven Johnson
@steven-johnson
Oct 18 2017 17:33
Is it possible to have Gitter notify me only when I’m “live”, but not via email when I’m away? (I’m assuming/hoping so but can’t find the relevant setting….)
Ah, think I found it. A little obscure.
Steven Johnson
@steven-johnson
Oct 18 2017 17:41
Still seeing timeouts in Travis jobs at random, e.g. https://travis-ci.org/halide/Halide/jobs/289282274 — restarting them usually fixes it but not sure what’s going on. Maybe we have been close to the timeout limit for a while and just now starting to get near it? Is the Travis timeout configurable?
Andrew Adams
@abadams
Oct 18 2017 17:41
I have in the past blacklisted long-running tests to keep us under the timeout
We should maybe blacklist a few more
unless the timeouts are occurring in the middle of compilation? The compiler has gotten larger too
Zalman Stern
@zvookin
Oct 18 2017 17:42
How much milage do you think there is in optimizing our compilation?
This came up last night with the Autodesk folks. Compiles are long.
Andrew Adams
@abadams
Oct 18 2017 17:43
Compilation of libHalide?
Zalman Stern
@zvookin
Oct 18 2017 17:43
Oh, no compilation of Halide code
Andrew Adams
@abadams
Oct 18 2017 17:44
It's a constant struggle. I think there's some low-hanging fruit in prefetching
It takes a while on some pipelines that don't even use it
Zalman Stern
@zvookin
Oct 18 2017 17:44
Though it might be worth doing a simple pass on libHalide too to check hader deps and such
Andrew Adams
@abadams
Oct 18 2017 17:44
I use ccache locally, which helps a lot for libHalide
Steven Johnson
@steven-johnson
Oct 18 2017 17:44
Have we done any basic profiling lately re: low hanging fruit?
Andrew Adams
@abadams
Oct 18 2017 17:44
Not sure if we can set up a persistent cache on travis
Not recently
Zalman Stern
@zvookin
Oct 18 2017 17:45
Do the buildbots/travis cache build artifacts?
ah
Looks like ccache is really easy to use on travis
Yeah the buildbots use ccache
not travis (yet)
Zalman Stern
@zvookin
Oct 18 2017 17:46
Cool. Maybe that will help reduce CI result times
Steven Johnson
@steven-johnson
Oct 18 2017 17:59
I’ll update travis for ccache
Andrew Adams
@abadams
Oct 18 2017 18:20
Some discussion of Halide (some of it pretty wrong): https://news.ycombinator.com/item?id=15495907
sandGorgon is deeply confused about what this new chip does and is
Steven Johnson
@steven-johnson
Oct 18 2017 18:45
@abadams: any concerns re: halide/Halide#2441
Andrew Adams
@abadams
Oct 18 2017 18:46
lgtm. merged
Steven Johnson
@steven-johnson
Oct 18 2017 18:49
I don’t have a HackerNews account and pretty sure that signing up for one would just be inviting disaster on myself, so not even gonna try to correct him
Zalman Stern
@zvookin
Oct 18 2017 19:25
I don't see anything in that HackerNews discussion that is concrete enough to respond to.
Steven Johnson
@steven-johnson
Oct 18 2017 19:54
That’s usually true for any given HackerNews discussion.
(See also http://n-gate.com/)
Andrew Adams
@abadams
Oct 18 2017 21:07
Hacking in a target feature that maintains a free list and reuses cuda allocations instead of calling cuMemFree eagerly does indeed speed up the apps on cuda a lot. Even on my shitty card they were bottlenecked on the allocator. On a 1080 this would be much more dramatic. bilateral grid goes from 3ms to 2.5, and local laplacian goes from 18ms to 15ms.
Zalman Stern
@zvookin
Oct 18 2017 21:09
We could just add an API to HalideCudaRuntime.h to set the hysterisis amount in the free list.
A target feature for this strikes me as a bad idea
Andrew Adams
@abadams
Oct 18 2017 21:10
Yeah, it really doesn't need to be a compile-time decision
We'd set the maximum number of wasted bytes?
Zalman Stern
@zvookin
Oct 18 2017 21:10
YEah and give a call to free it
Andrew Adams
@abadams
Oct 18 2017 21:10
and if we hit the limit, free the least-recently-used?
Zalman Stern
@zvookin
Oct 18 2017 21:10
I think so.
Could also just add a call to turn it on and one to free them all
The place this stuff goes south is if one is making a bunch of calls on different sized inputs
Andrew Adams
@abadams
Oct 18 2017 21:11
That sounds the same as a default value for the wasted bytes of inf
Zalman Stern
@zvookin
Oct 18 2017 21:11
Then the allocations can accumulate.
Andrew Adams
@abadams
Oct 18 2017 21:11
Yes, right now it's an exact fit, not a best fit
Presumably the user knows what's going on though.
Zalman Stern
@zvookin
Oct 18 2017 21:12
Limiting it is prolly a good idea simply because otherwise you'll run out of mem on the card in some situations
Andrew Adams
@abadams
Oct 18 2017 21:12
lens blur goes from 5.2ms to 3.4ms
Zalman Stern
@zvookin
Oct 18 2017 21:12
but it does require more logic and decisions in the implementation
Andrew Adams
@abadams
Oct 18 2017 21:12
Probably want to flush the free list and try again on an out-of-memory error
Zalman Stern
@zvookin
Oct 18 2017 21:13
It is really annoying that this sort of thing is still an everyday food in 2017 :-)
Andrew Adams
@abadams
Oct 18 2017 21:13
Setting it to inf would then act like a garbage collector, where it reclaims memory whenever it runs out
Zalman Stern
@zvookin
Oct 18 2017 21:14
yeah
Other question is whether we need to offer the feature across GPU APIs
I.e. is anyone else any faster
Andrew Adams
@abadams
Oct 18 2017 21:14
The nvidia people said that opengl already does this internally
but we need it for more modern apis
Zalman Stern
@zvookin
Oct 18 2017 21:15
Can we put the support in a .h like threadpool and then just plumb it through all the devices?
Andrew Adams
@abadams
Oct 18 2017 21:15
Devices may need to track different flags. E.g. my cuda free list keys off the context.
Zalman Stern
@zvookin
Oct 18 2017 21:15
Another thing in this realm is it would be nice for memoize to work with device allocations
Andrew Adams
@abadams
Oct 18 2017 21:15
Not sure if that's necessary or not
Yes
Zalman Stern
@zvookin
Oct 18 2017 21:16
The API for controlling it is likely in the device specific header so if it needs some extra args, etc. that can be done
I guess you can't even tell if it helps other device APIs without trying it
I need to get back to the cuda context contention thing too
Andrew Adams
@abadams
Oct 18 2017 21:28
discoverability is a big problem here
If the default behavior is to be slow
then people are going to just move on
Zalman Stern
@zvookin
Oct 18 2017 21:29
Um....
Andrew Adams
@abadams
Oct 18 2017 21:29
One way to look at this is that we're debating what the default value for memory wastage should be. Currently it's zero, and we're talking about adding a call to change it.
But maybe the default should be larger than that.
Zalman Stern
@zvookin
Oct 18 2017 21:30
It's a GPU. The default behavior is to be slow in 100 different ways unless you know what you're doing.
Andrew Adams
@abadams
Oct 18 2017 21:30
Halide tries to hide GPU boilerplate and be fast by default
Zalman Stern
@zvookin
Oct 18 2017 21:30
E.g. we compile the shader on first call too
Andrew Adams
@abadams
Oct 18 2017 21:30
on first-call is very different to always
Is it a better default to be slow, or to hold onto some GPU memory persistently
Zalman Stern
@zvookin
Oct 18 2017 21:31
So the size has to be larger than the compute roots on the device
We can't figure that out as a constant
It also one help for one shot timing.
Andrew Adams
@abadams
Oct 18 2017 21:32
A default of 1GB is going to cover a lot of ground.
Or perhaps some fraction of the free memory on the card
Zalman Stern
@zvookin
Oct 18 2017 21:32
Halide allocing 1GB of GPU memory and holding onto it by default will also harm our reputation
I.e. people gotta know about this either way
Andrew Adams
@abadams
Oct 18 2017 21:33
It will only alloc 1GB if your pipeline allocs >= 1GB
The objection would be that Halide doesn't release memory unless you make a special call
not that Halide pointlessly allocates a ton of memory
(though preallocating a pool is indeed another possible design)
Zalman Stern
@zvookin
Oct 18 2017 21:34
I have no objection to adding the support. I think turning it on by default is a breaking change.
Andrew Adams
@abadams
Oct 18 2017 21:34
Being slow by default harms our reputation more than holding onto memory for too long by default
And our GPU reputation is currently bad, partially due to this
I think we agree the support is good. We're just debating the correct initial value of the setting.
Zalman Stern
@zvookin
Oct 18 2017 21:35
I'm a bit pissy about this having spent a lot of time debugging Apple's changes to malloc in Mac OS X when I worked on Camera Raw.
We can look at changing the default later
We've lived with this for a long time so far so it isn't totally pressing in time
Andrew Adams
@abadams
Oct 18 2017 21:35
True, once support is there
A different question about adding the support: On Windows at least, this is also useful on the CPU
freeing serializes and is slow.
Zalman Stern
@zvookin
Oct 18 2017 21:37
Yeah, hence the idea of a reusable thing in the runtime. Ideally it would also go underneath memoize. I.e. there would be some general control over how much memory Halide is keeping in cached allocations for everything.
Of course my idea of a central memory allocator with a richer interface than malloc/free would also do this :-)
Andrew Adams
@abadams
Oct 18 2017 21:38
We would want separate settings for wasted device memory and wasted host memory, right?
Or does that not make sense for unified memory architectures on mobile
I guess you could also have wasted total memory as a third limit
Zalman Stern
@zvookin
Oct 18 2017 21:39
Ideally the programmer would be able to control all of it.
Andrew Adams
@abadams
Oct 18 2017 21:39
You could also have a per-device-api granularity
Zalman Stern
@zvookin
Oct 18 2017 21:39
I'm probably slightly more sympathetic to turning this on by default in Windows for CPU mem
Is it possible to have the profiler spit out allocation time cost?
Because we could add output to the profiler telling people where to look for the APIs to control this
And we should also report the caching behavior to help tune the sizes
Andrew Adams
@abadams
Oct 18 2017 21:41
That would make it more discoverable
but not to novices - the profiler itself is not particularly discoverable
Zalman Stern
@zvookin
Oct 18 2017 21:41
More generally I think one way to answer worries about our reputation for speed is to have a profiler that explains everything. To be sure we should tune our defaults well, but I think if this were an easy tradeoff to tune, fewer allocators would have to problem in the first place
Andrew Adams
@abadams
Oct 18 2017 21:42
I would really like madvise(... MADV_DONTNEED) on more platforms
to not give up the address space, but let the system reclaim the pages if necessary
We could use a pretty large default value safely if we had that
because all it's doing is reserving address space in the current process
I think I tried those sorts of things on Windows, but they were just as slow as free
The issue was that for any of these things it grabbed a lock and walked the page table
Doing some work per page
So only huge pages really speeds it up
but that's not available to user processes on vanilla windows
It's really designed for database servers
Zalman Stern
@zvookin
Oct 18 2017 21:44
To put it in perspective, I'm also worried about e.g. someone using this on an embedded device where 1GB has a completely different meaning than it does in the places we run code.
Andrew Adams
@abadams
Oct 18 2017 21:44
Well.. it's not going to increase peak memory footprint if you're only using Halide
Zalman Stern
@zvookin
Oct 18 2017 21:45
Currently Halide is pretty conservative around this sort of thing
Andrew Adams
@abadams
Oct 18 2017 21:45
and we could default to a fraction of the device memory
rather than a constant
Zalman Stern
@zvookin
Oct 18 2017 21:45
For the embedded case, you might even want static allocation
Andrew Adams
@abadams
Oct 18 2017 21:45
Preallocating and serving from a pool would also be nice, yeah
If we had that logic, asm.js might have been easier :)
Zalman Stern
@zvookin
Oct 18 2017 21:46
:-)
Are embedded Buffers writable?
Andrew Adams
@abadams
Oct 18 2017 21:46
I think they go in .rodata
so no
Zalman Stern
@zvookin
Oct 18 2017 21:46
I don't think we have a way to do a static allocation that is writable. That makes us threadsafe.
Andrew Adams
@abadams
Oct 18 2017 21:46
but they could be
Zalman Stern
@zvookin
Oct 18 2017 21:47
Pretty rare use case I guess
ronlieb
@ronlieb
Oct 18 2017 21:52
Hi Folks, just joined.
Andrew Adams
@abadams
Oct 18 2017 21:57
Howdy
Welcome
ronlieb
@ronlieb
Oct 18 2017 21:58
FYI: qualcomm pushed up a change to llvm, that will now require a patch like the following to properly compile for hvx. setting hvx length
index 6766304..cd8415a 100644
--- a/src/CodeGen_Hexagon.cpp
+++ b/src/CodeGen_Hexagon.cpp
@@ -1306,9 +1306,9 @@ string CodeGen_Hexagon::mcpu() const {
string CodeGen_Hexagon::mattrs() const {
std::stringstream attrs;
if (target.has_feature(Halide::Target::HVX_128)) {
  • attrs << "+hvx-double";
  • attrs << "+hvx-length128b";
    } else {
  • attrs << "+hvx";
  • attrs << "+hvx-length64b";
    }
    attrs << ",+long-calls";
    return attrs.str();
i will push up a PR after some testing
if the above is better suited to be communicated in email, let me know
Andrew Adams
@abadams
Oct 18 2017 22:01
Seems like an appropriate use of the channel, but it might be better to copy-paste diffs into a linked gist of some sort to not fill the screen.
Steven Johnson
@steven-johnson
Oct 18 2017 22:04
+1
Andrew Adams
@abadams
Oct 18 2017 22:38
Any suggestions for the name of the function that sets the permissible amount of wasted space. halide_set_allocation_cache_size ?
ronlieb
@ronlieb
Oct 18 2017 22:52
opened PR for attrs fix, halide/Halide#2455
Zalman Stern
@zvookin
Oct 18 2017 23:01
Re: halide_set_allocation_cache_size, the parallel to halide_memoization_cache_set_size is "halide_allocation_cache_set_size"
And then "halide_allocation_cache_clear" or "halide_memoization_cache_cleanup"
The latter is what is used for memoization, but it isn't quite right as one might empty the allocation cache as part of normal operation