Does anyone know how I could AllReduce a local Matrix with the result going into a DistMatrix without making too many copies? If I AllReduce into the same matrix, it looks like it ends up using space equivalent to 2 copies. I am guessing that this is for send and receive buffers? Ideally, I could only send elements to the process that holds the local data for that rank. But it looks like I would have to write that myself. Is there a better way?