Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Michel Müller
    @muellermichel
    Dear all. Thank you for your continued interest in Hybrid Fortran. This chat is meant to exchange ourselves about this project, your Fortran projects and whether there are potential overlaps. For those who haven't seen it yet, there is a new quickstart documentation available here: https://github.com/muellermichel/Hybrid-Fortran. Hybrid Fortran version 1.0 will be released somewhere in mid-summer and will include the current features plus some bugfixes and documentation. FYI I've started a PhD on this subject this April (making Fortran performance portable for accelerators using preprocessing and static analysis).
    Stefano Zaghi
    @szaghi
    @muellermichel I never find the occasion to tell you how Hybrid-Fortran is impressive! It is a very cool project! Unluckily, I have not yet found the time to using it, but there is an issue in my todo-list... before the END I will try it! Just a question: have you tested Hybrid-Fortran with Intel MIC? One of my boss have done some tests on MIC and he reported that for obtaining an acceptable speedup a lot of ad hoc tuning must be done... this could be frustrating because the MIC PCI board seems to have a very favorable cpu_power/costs... have you any experience on this?
    Michel Müller
    @muellermichel
    Dear Stefano. Thank you very much, it's good to hear there is some interest. I have to say that I didn't get the chance to run tests on Intel MIC yet. In theory Hybrid Fortran should be quite capable on doing a good job there, since I expect loop / storage reordering to be a central factor for MIC performance as well. There is some work to be done in order to get optimal vectorization in any case without manually reordering (from that point on I think that compilers are already doing a good optimization job, i.e. loop unrolling comes 'free' nowadays AFAIK). But you should be able to just use the existing OpenMP parallelization in Hybrid Fortran, choose a good storage order and start from there. But as I said I cannot say that I have any benchmarks, so that would be a very interesting result for me as well and I'd certainly help you along if you'd choose to pursue it.
    Stefano Zaghi
    @szaghi
    Dear Michel, this is in my desiderata, however I cannot any scheduling. From a couple of days I should have a meeting with my boss for talking just on his MIC tests, maybe I can tell you his problems after the meeting. Thank you for Hybrid-Fortran it is great!
    Stefano Zaghi
    @szaghi
    Hi Michel, today I discuss with my boss about his tests on Intel MIC. He obtained decent speedup (fully on mic without host, i.e. with only one mic accellerator, I am not sure but I think his mic should have 64 cores) not really exciting. The not so obvious part is that with the baseline code (mpi+openmp very classical, without any special tricks) the speedup on blue gene/q architecture is very good (up to 10K cores) but the same baseline code on mic has the worst speedup I have seen. It seems that false sharing is the main bottleneck. My boss had tricked the openmp threads parallel regions with a manual threads memory assigment,e.g. openmp only creates the pool of threads wherease the memory asgment and the loop subdisvikn per thread is done manually (mimic a mpi decomposition) and then the speedup becomes decent (still remains a lot of ptobrems with vectorization).
    Can your hybrid-fortran handle such a dirty problems?
    Michel Müller
    @muellermichel

    Hi Stefano. Your case basically confirms my opinion that I had on MIC so far: While MIC is x86, just porting code that previously ran on MPI+OpenMP for multicore over to MIC, will not be fast - it is an accelerator that has to be programmed with respect to the hardware. OpenMP (up to v4.0 and all its additions) was never designed for these massively parallel shared memory architectures. Secondly, MIC brings with it one more level of parallelism that you have to care about, compared to CUDA/OpenACC: The Vectorization. In GPUs you can usually treat your whole data just as one big vector, on MIC (and also Xeons) you need the vector plus the multicore parallelization, preferably in separate parallelizable/unrollable loop regions. With respect to the parallelization I also expect MIC to have a preference for something in-between coarse-grained and fine-grained, while Xeons usually like very coarse-grained (which has to do with context switching overhead).

    Where I could see HF being useful, is in that last problem: While it cannot (currently) automatically separate a single loop into smaller regions, it can give you freedom in moving the loops 'down' closer to the computations in order to make the parallelization more fine grained and/or allow vectorization - all while keeping your previous version (that runs well on BlueGene) alive in a unified codebase. You only code the parallel region wrappers for each version, the computational code is reused. Now, whether this is feasible or not I cannot decide without first having a look at your code. Especially the data structures would be interesting - is it 1D arrays? Higher dimensional arrays? AOS? SOA?

    Btw. I'm in Lugano, Switzerland for a one-week Workshop from July 5th - in case you'd like to meet up.

    Stefano Zaghi
    @szaghi
    :-) For some months I cannot leave Rome... I just have a 1 month daughter...
    The code in question is a CFD URANSE: 3D arrays, multiblock (chimera). MPI decomposes the blocks over processes, threads parallelize loops into each block (vectorization is usually exploited into the inner loop data, but we see that on mic the compiler fails on it, we must re-arrange the computations...). The code has some similarities with my OFF (both are finite volume). Your suggestion for wrapping openmp regions is interesting, I will think on it, maybe I can try hybrid-fortran before my expectations :-)
    Michel Müller
    @muellermichel
    Congratulations, man. My wife is also expecting, due in early October - getting nervous ;).
    What would be interesting to know is how exactly you need to "re-arrange your computations". Basically, if inlining is the problem, HF could help you there. If loop unrolling is the problem, I don't think it would help (HF does not do any unrolling since I expect compilers to be good enough at this). Why did the vectorization fail on the MIC?
    Michel Müller
    @muellermichel
    Btw. this isn't the first time I hear that MIC implementations face most of their problems with the vectorization. I think Intel would have done better if they'd have put their cards on OpenCL and OpenACC from the start (abstracting vector+multicore into one big vector like on GPUs), instead of promoting MIC as a plug-in replacement for OpenMP codes.
    Stefano Zaghi
    @szaghi
    it seems that the Intel Compiler, that we are using, is not able to vectorize on mic (while does on Xeon for example). Anyhow, these were only unofficial impressions... I must see in more details the results and the profilings. Good luck for your October launch :-)
    Henrik Holst
    @hholst80
    What is Hybrid Fortran?
    Stefano Zaghi
    @szaghi
    @hholst80 sorry for my delay, unfortunately gitter allerts are often obfuscated on my worstation because I leave open gitter app... I guess that you have already learned about Hybrid Fortran, but just be sure the project home page is here: in bad-few-words it is a sort of high level layer (based on directives similar to openmp) that extends Fortran language for enabling a great support for GPGPU accellerators. @muellermichel is the developer of this great project: ask to him more details. I have not yet found the time to test it, but its specifications are great!
    Michel Müller
    @muellermichel
    I didn't get the notification at all :(
    Stefano Zaghi
    @szaghi
    @muellermichel I have not still understand the real reason, but I feel that if I leave opened the gitter app on one device the notifications via email are not sent, while the notifications arrive if all gitter app are closed.
    Michel Müller
    @muellermichel
    Thanks. Yes, it's strange. Anyways, @hholst80, are you still interested? If yes, do you need a heads-up on Hybrid Fortran?
    f
    @f_kazemian_twitter
    Hi
    If I have a do loop like this :
    do i=1,n-1 x = 2.0* x *(1+x) end do
    How can I parallel or change the code for running the code faster?
    Ricardo Bánffy
    @rbanffy
    Ghost
    @ghost~59949097d73408ce4f71ac64
    Hello world
    I am a CS Undergrad, currently studying High Performance Computing. We are supposed to make a mini Project based only on Open MP without using MPI. The project span is 3 weeks. Things I already have done using Open MP: *
    • Calculation of Pi
    • Block Matrix Multiplication
    • Vector Multiplication
    Any suggestions for a suitable project which will teach me a lot, but alsobe done in 3 weeks?
    Michel Müller
    @muellermichel
    I'm really sorry, gitter doesn't send me any notifications :(. Have to figure out why.
    @f_kazemian_twitter : depends on how large "n" is.
    where are you indexing anything with "i"? are you really running the same operation n-1 times?
    Michel Müller
    @muellermichel
    @97amarnathk : One thing you could do in 3 weeks is learning to do Matrix & Vector Multiplication on GPU. With Hybrid Fortran it would take you way less, probably a few hours mainly to get the system and compiler set up, however to understand what's going on in the backend you'll need some more time.
    Michel Mueller
    @mueller_michel_twitter
    test
    Michel Müller
    @muellermichel
    test