Hi Stefano. Your case basically confirms my opinion that I had on MIC so far: While MIC is x86, just porting code that previously ran on MPI+OpenMP for multicore over to MIC, will not be fast - it is an accelerator that has to be programmed with respect to the hardware. OpenMP (up to v4.0 and all its additions) was never designed for these massively parallel shared memory architectures. Secondly, MIC brings with it one more level of parallelism that you have to care about, compared to CUDA/OpenACC: The Vectorization. In GPUs you can usually treat your whole data just as one big vector, on MIC (and also Xeons) you need the vector plus the multicore parallelization, preferably in separate parallelizable/unrollable loop regions. With respect to the parallelization I also expect MIC to have a preference for something in-between coarse-grained and fine-grained, while Xeons usually like very coarse-grained (which has to do with context switching overhead).
Where I could see HF being useful, is in that last problem: While it cannot (currently) automatically separate a single loop into smaller regions, it can give you freedom in moving the loops 'down' closer to the computations in order to make the parallelization more fine grained and/or allow vectorization - all while keeping your previous version (that runs well on BlueGene) alive in a unified codebase. You only code the parallel region wrappers for each version, the computational code is reused. Now, whether this is feasible or not I cannot decide without first having a look at your code. Especially the data structures would be interesting - is it 1D arrays? Higher dimensional arrays? AOS? SOA?
Btw. I'm in Lugano, Switzerland for a one-week Workshop from July 5th - in case you'd like to meet up.
do i=1,n-1 x = 2.0* x *(1+x) end do