C31 need 256MByte SRAM build in side on die, while C32 need 512MByte SRAM , which will need more than 600mm^2 area in 16nm node. This only Storage been included , cores and matrix bus between the core and memory is needed. Also some art work of compression is needed , fail rate of the SRAM need to be handle. So one 25mm25mm even a 25mm32mm die can not handle it . ------ while 256Mbyte SRAM for C31 can be handled in this case , get a good performance.
256/512 (C31/C32) is needed for Edge memory but you need to include Node memory for another 256/512 pool respectively, for a total of 512/1024 memory in a single chip. Alternatively, you can use TMTO at some performance penalty eg. 1/2 C32 performance at 768MB or 512 edge + 512/2 node or 1/4 hash at 640MB or 512 + 512/4
So C32 is just he best way to have logic die , memory die external , and smart algorithm.
2 issues with ASICs using external memory are granularity (cost) and performance (bandwidth). External memory ASICs will be worse than than a GPU because of memory_granularity/data_bus_width constraints.