These are chat archives for Fortran-FOSS-Programmers/General-Discussion

6th
May 2017
Damian Rouson
@rouson
May 06 2017 01:25
I haven't followed this discussion in detail. As you guys know, I'll respond a lot more in calls than text of any form. I just can't keep up with all the text flying by me every day. Maybe it's a sign of my age. What caught my eye was the mention of @cmacmackin mentioning guard and clean_tmp. Whoa... that's old-school. Presumably you picked this up from the 2003 paper by GW Stewart in ACM Fortran Forum -- not sure if Markdown syntax works here. Say it isn't so! If you're doing such things under the hood and are really confident that you have a scheme to get it right and that users of your code will never need it, then it's ok as a last resort. Otherwise, whatever led you down this path has to get fixed in the compiler or it's sure to come back to haunt users down the road. I did similar things in a paper from 2006, but in a limited setting (expression evaluation) with a clearly articulated (published) strategy that I'd like to think covered all the cases we cared about. We at least made an attempt to automate such matters in the papers that I've sent @szaghi so I'd recommend that route if it's applicable over guard and clean_temp. I really hope the compiler situation hasn't set us back 14 years. That would be a travesty.
Damian Rouson
@rouson
May 06 2017 01:32
Also, I don't think the way to think about PURE is in terms of whether the attribute in and of itself speeds up code. That's pretty unlikely. I think of PURE and DO CONCURRENT in terms of the discipline they impart on the programmer to do things that can aid in optimization. If you write functions that conform to the restrictions that PURE requires but don't mark them PURE, a sufficiently smart compiler can do all the same related optimizations anyway. In fact, the gfortran compiler tries to detect whether your procedure could have been marked PURE even if it wasn't and the compiler marks such procedures as implicitly pure internally. Likewise, if you do the things that DO CONCURRENT requires but write a regular DO loop, a sufficiently smart compiler will be able to optimize the code in the ways that DO CONCURRENT affords (and you will also find it easier to use other speedup strategies such as multithreading with OpenMP). Then the question becomes the reverse: if I violate the requirements of PURE and DO CONCURRENT, what compiler optimizations am I preventing. Framing it this way also shows how difficult the question is to answer because one then has to ask, well... how badly am I going to violate the restrictions. The skies the limit and one can slow code down quite considerably that way if one wants to do so. For example, PURE doesn't allow for I/O. You can slow a code down as much as you want in direct proportion with the size of the file you read or write. It's kind of like when someone goes to the doctor and says, "It hurts when I do this." and the doctor responds simply, "Then don't do that."
Stefano Zaghi
@szaghi
May 06 2017 04:33

@rouson , Damian,
thank you for your reply.

I haven't followed this discussion in detail. As you guys know, I'll respond a lot more in calls than text of any form. I just can't keep up with all the text flying by me every day. Maybe it's a sign of my age.

:smile: On the contrary, my bad spoken English prevent me to call you almost all the time...

What caught my eye was the mention of @cmacmackin mentioning guard and clean_tmp.

Good to know, I'll use that word when I really need to catch your attention :smile:

I don't think the way to think about PURE is in terms of whether the attribute in and of itself speeds up code ... if I violate the requirements of PURE and DO CONCURRENT, what compiler optimizations am I preventing.

This is exactly my point: what I would like to say Chris is that the polymorphic allocatable version violate the pure condition thus it is likely preventing optimizer, whereas the non polymorphic operators version is pure (in its contents and with the explicit attribute) thus it is likely more easy to be optimized. I was not concerned about the declaration rather about the actual contents. Moreover, I specified to Chris that performance I gained is more likely due to the fact that now the math operators act on plain real arrays, thus the compiler optimizer could be even more flavored.

The performance comparison for Chris has not yet started: my cluster and my workstation are crunching numbers this weekend, it will come next week. However I did a more synthetic" test to evaluate the *defined operators overhead in different circumstances. I compared:

  • user defined operators acting/returning real array defined as automatic array that are likely allocated on stack;
  • user defined operators acting/returning real array defined as allocatable array that are likely allocated on heap;
  • user defined operators acting/returning polymorphic allocatable class that are likely allocated on stack;
  • plain intrinsic arrays operators acting on real arrays for the reference;

My results was dramatic: all user defined operators have at least 50% overhead with respect plain intrinsic operators, with, in general, the polymorphic version the worst followed by the automatic arrays one and with the allocatable array version the better. I would really like to know you opinions. My test results can be found online here and test is this. For the sake of clearness I report the fortran code below. I hope I had make some design mistakes in the test, because the overhead is really not negligible. Are these results expected for you?

User defined operators overhead hunter

! A DEFY (DEmystyfy Fortran mYths) test.
! Author: Stefano Zaghi
! Date: 2017-05-05
!
! License: this file is licensed under the Creative Commons Attribution 4.0 license,
! see http://creativecommons.org/licenses/by/4.0/ .

module arrays
   use, intrinsic :: iso_fortran_env, only : real64

   implicit none

   type :: array_automatic
      integer                   :: n
      real(real64), allocatable :: x(:)
      contains
         procedure, pass(lhs) :: add_automatic
         generic :: operator(+) => add_automatic
         procedure, pass(lhs) :: assign_automatic
         generic :: assignment(=) => assign_automatic
   endtype array_automatic

   type :: array_allocatable
      integer                   :: n
      real(real64), allocatable :: x(:)
      contains
         procedure, pass(lhs) :: add_allocatable
         generic :: operator(+) => add_allocatable
         procedure, pass(lhs) :: assign_allocatable
         generic :: assignment(=) => assign_allocatable
   endtype array_allocatable

   type, abstract :: array_polymorphic_abstract
      contains
         procedure(add_interface), pass(lhs), deferred :: add_polymorphic
         generic :: operator(+) => add_polymorphic
         procedure(assign_interface),      pass(lhs), deferred :: assign_polymorphic
         procedure(assign_real_interface), pass(lhs), deferred :: assign_polymorphic_real
         generic :: assignment(=) => assign_polymorphic, assign_polymorphic_real
   endtype array_polymorphic_abstract

   type, extends(array_polymorphic_abstract) :: array_polymorphic
      integer                   :: n
      real(real64), allocatable :: x(:)
      contains
         procedure, pass(lhs) :: add_polymorphic
         procedure, pass(lhs) :: assign_polymorphic
         procedure, pass(lhs) :: assign_polymorphic_real
   endtype array_polymorphic

   abstract interface
      pure function add_interface(lhs, rhs) result(opr)
      import :: array_polymorphic_abstract
      class(array_polymorphic_abstract), intent(in)  :: lhs
      class(array_polymorphic_abstract), intent(in)  :: rhs
      class(array_polymorphic_abstract), allocatable :: opr
      endfunction add_interface

      pure subroutine assign_interface(lhs, rhs)
      import :: array_polymorphic_abstract
      class(array_polymorphic_abstract), intent(inout) :: lhs
      class(array_polymorphic_abstract), intent(in)    :: rhs
      endsubroutine assign_interface

      pure subroutine assign_real_interface(lhs, rhs)
      import :: array_polymorphic_abstract, real64
      class(array_polymorphic_abstract), intent(inout) :: lhs
      real(real64),                      intent(in)    :: rhs(1:)
      endsubroutine assign_real_interface
   endinterface

   contains
      pure function add_automatic(lhs, rhs) result(opr)
      class(array_automatic), intent(in) :: lhs
      type(array_automatic),  intent(in) :: rhs
      real(real64)                       :: opr(1:lhs%n)

      opr = lhs%x + rhs%x
      endfunction add_automatic

      pure subroutine assign_automatic(lhs, rhs)
      class(array_automatic), intent(inout) :: lhs
      real(real64),           intent(in)    :: rhs(1:)

      lhs%n = size(rhs, dim=1)
      lhs%x = rhs
      endsubroutine assign_automatic

      pure function add_allocatable(lhs, rhs) result(opr)
      class(array_allocatable), intent(in) :: lhs
      type(array_allocatable),  intent(in) :: rhs
      real(real64), allocatable            :: opr(:)

      opr = lhs%x + rhs%x
      endfunction add_allocatable

      pure subroutine assign_allocatable(lhs, rhs)
      class(array_allocatable), intent(inout) :: lhs
      real(real64),             intent(in)    :: rhs(1:)

      lhs%n = size(rhs, dim=1)
      lhs%x = rhs
      endsubroutine assign_allocatable
      pure function add_polymorphic(lhs, rhs) result(opr)
      class(array_polymorphic),          intent(in)  :: lhs
      class(array_polymorphic_abstract), intent(in)  :: rhs
      class(array_polymorphic_abstract), allocatable :: opr

      allocate(array_polymorphic :: opr)
      select type(opr)
      class is(array_polymorphic)
         select type(rhs)
         class is(array_polymorphic)
            opr%x = lhs%x + rhs%x
         endselect
      endselect
      endfunction add_polymorphic

      pure subroutine assign_polymorphic(lhs, rhs)
      class(array_polymorphic),          intent(inout) :: lhs
      class(array_polymorphic_abstract), intent(in)    :: rhs

      select type(rhs)
      class is(array_polymorphic)
         lhs%n = rhs%n
         lhs%x = rhs%x
      endselect
      endsubroutine assign_polymorphic

      pure subroutine assign_polymorphic_real(lhs, rhs)
      class(array_polymorphic), intent(inout) :: lhs
      real(real64),             intent(in)    :: rhs(1:)

      lhs%n = size(rhs, dim=1)
      lhs%x = rhs
      endsubroutine assign_polymorphic_real
endmodule arrays
program defy
   use, intrinsic :: iso_fortran_env, only : int64, real64
   use arrays, only : array_automatic, array_allocatable, array_polymorphic
   implicit none
   real(real64), allocatable :: a_intrinsic(:)
   real(real64), allocatable :: b_intrinsic(:)
   real(real64), allocatable :: c_intrinsic(:)
   type(array_automatic)     :: a_automatic
   type(array_automatic)     :: b_automatic
   type(array_automatic)     :: c_automatic
   type(array_allocatable)   :: a_allocatable
   type(array_allocatable)   :: b_allocatable
   type(array_allocatable)   :: c_allocatable
   type(array_polymorphic)   :: a_polymorphic
   type(array_polymorphic)   :: b_polymorphic
   type(array_polymorphic)   :: c_polymorphic
   integer(int64)            :: tic_toc(1:2)
   integer(int64)            :: count_rate
   real(real64)              :: intrinsic_time
   real(real64)              :: time
   integer                   :: N
   integer                   :: Nn
   integer                   :: i

   N = 100000
   Nn = N/100
   a_intrinsic   = [(real(i, kind=real64), i=1,N)]
   b_intrinsic   = [(real(i, kind=real64), i=1,N)]
   a_automatic   = [(real(i, kind=real64), i=1,N)]
   b_automatic   = [(real(i, kind=real64), i=1,N)]
   a_allocatable = [(real(i, kind=real64), i=1,N)]
   b_allocatable = [(real(i, kind=real64), i=1,N)]
   a_polymorphic = [(real(i, kind=real64), i=1,N)]
   b_polymorphic = [(real(i, kind=real64), i=1,N)]

   call system_clock(tic_toc(1), count_rate)
   do i=1, Nn
     c_intrinsic = a_intrinsic + b_intrinsic
   enddo
   call system_clock(tic_toc(2), count_rate)
   intrinsic_time = (tic_toc(2) - tic_toc(1)) / real(count_rate, kind=real64)
   print*, 'intrinsic: ', intrinsic_time

   call system_clock(tic_toc(1), count_rate)
   do i=1, Nn
     c_automatic = a_automatic + b_automatic
   enddo
   call system_clock(tic_toc(2), count_rate)
   time = (tic_toc(2) - tic_toc(1)) / real(count_rate, kind=real64)
   print*, 'automatic: ', time, ' + %(intrinsic): ', 100._real64 - intrinsic_time / time * 100

   call system_clock(tic_toc(1), count_rate)
   do i=1, Nn
     c_allocatable = a_allocatable + b_allocatable
   enddo
Stefano Zaghi
@szaghi
May 06 2017 04:44
@rouson , Damian I forgot this...

I just can't keep up with all the text flying by me every day.

This is the price when you are the most experienced and the most kind Fortran programmer available :smile: To limit the spam like mine you can only become less kind, but I hope this never happens!

Chris MacMackin
@cmacmackin
May 06 2017 09:53
@rouson I find it odd that you feel guard_temp and clean_temp are "old school", because you explicitly mention them in chapter 5 of your (relatively) recent book. The 2011 and 2012 papers you sent @szaghi definitely offer a more elegant approach, but they rely on finalisation. Unfortunately, gfortran still doesn't fully support finalisation and doesn't perform it on function results. I don't see how I can use your automated process without it.
Damian Rouson
@rouson
May 06 2017 15:40
That's because I consider my book to be old school too! My book was submitted to the publisher in August 2010, which is centuries ago in the Internet era. :D I've learned a lot since then and both the language and compiler have advanced a lot since then. If I recall correctly, the Fortran 2008 standard was published in October 2010 so the official language standard at the time the book was submitted was Fortran 2003. Back then, there was only one Fortran 2003 compliant compiler: IBM. In fact, there was no compiler in existence that could correctly compile the one Fortran 2008 example in the book: the coarray Burgers equation solver in chapter 12 -- not even the Cray compiler and Cray invented coarrays. That was the only code in the book that we could not test before publishing. Fast forward to today and we have four Fortran 2003 compilers: IBM, Cray, Portland Group, and Intel. NAG is extremely close to full 2003 compliance (anything missing is probably minor and I imagine their next release will offer full 2003 compliance). And GNU is only missing one major 2003 feature: parameterized derived types (PDTs, which I expect gfortran developer Paul Richard Thomas will start implementing soon). Moreover, we now have two Fortran 2008 compilers: Cray and the Intel beta release. IBM is only missing one major 2008 feature: coarrays. And GNU is only missing one major 2008 feature: the aforementioned 2003 feature (PDTs). And the landscape is quite rosy even when one jumps forward to the upcoming Fortran 2015 standard. The major new features in Fortran 2015 are summarized in two Technical Specification (TS) documents: TS 29113 Further Interoperability with C and TS18508 Additional Parallel Features. Four compilers already support most or all of TS 29113: IBM, Cray, Intel, and GNU. Two compilers already support parts of TS 18508: Cray and GNU. And it gets even better: GNU is only missing one new Fortran 2015 feature: teams (which I believe I've found a Ph.D. student who is likely to work on adding support for that feature, which will take a multi-year effort). And it gets even better than that: the 2015 standard makes Fortran the first mainstream language to offer support for fault tolerance and last week's GCC 7.1 release supports that very feature: failed-image detection and simulation. Using the latter feature requires using an unreleased branch of OpenCoarrays so I haven't made any big announcements yet, but it's a huge deal for anyone interested in reaching exaflop performance on future platforms. In short, this is a new world! Think about this unrelated but interesting fact: a paper from 2003 was written before the multi-core era and now we're exiting the multi-core era and entering the many-core era with Intel's Knight's Landing chip having roughly 72 cores. The pace of change is mind-blowing. :O
Please send me a list of gfortran bug reports related to finalization and consider whether your organization can make a donation in support of fixing those bugs. We've got to move on from the old days.
Chris MacMackin
@cmacmackin
May 06 2017 17:08
From a quick search, I have found the following open finalisation bug reports: 37336, 64290, 67471, 67444, 65347, 64290, 59694, 80524, 79311, 71798, 70863, 69298, 68778.
Some of those bugs are duplicates. I'm only a student, so I'm doubtful I'd be able to persuade anyone to make a donation. You never know, though--sometimes there is money left in a grant that's about to expire which they're looking to spend on something.
Truth be told, I'm getting really frustrated with Fortran. If I didn't already have so much effort invested in my Fortran code base, I'd probably switch to another language. There are so many bugs related to object oriented programming in gfortran and ifort, and I'm getting sick of having to work around them. Memory management is a massive pain and not something I want to be thinking about as a programmer. It is also extremely verbose and it takes considerably longer to write code in Fortran than in more concise languages.
Stefano Zaghi
@szaghi
May 06 2017 19:34

@cmacmackin @rouson ,

Damian, you know how I think high of you, but I disagree (with respect): the world could be changed, but it currently does not. Intel and GNU have so many bugs about OOP that claiming full support of 2003 or even 2008 standard for that compilers is premature. Maybe the world will change the next year, but in 2017 I am really in trouble doing OOP in Fortran.

I really would like to know your new idea about functional programming, but I am skeptical: if defined operators have so big overhead as I shown above, how functional programming be suitable for HPC? In HASTY I tried to do a really useful, but not so complex, thing with CAF and it is stopped by compilers bugs...

Chris,

Truth be told, I'm getting really frustrated with Fortran. If I didn't already have so much effort invested in my Fortran code base, I'd probably switch to another language. There are so many bugs related to object oriented programming in gfortran and ifort, and I'm getting sick of having to work around them. Memory management is a massive pain and not something I want to be thinking about as a programmer.

I am not so young as you, but my feeling is really the same: if I did not invested so hard in Fortran, I had likely used some other language two years ago. Probably, I'll try to invest more in Python: I see more and more HPC courses about "optimizing Python for number-crunching". Python performances are the worst I could imagine, but OOP is really a "new world" in Python.

Cheers