Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Thomas Rolinger
    @thomasrolinger

    The default iterators provided for BlockDist arrays/domains will essentially implement the behavior you have for the coforall. So when you say forall a in A and A is indeed a block distributed array, it will do what you want (i.e., execute across the locales).

    See the first example shown here: https://chapel-lang.org/docs/modules/dists/BlockDist.html

    Zhihui Du
    @zhihuidu
    Thanks for the example.
    So you mean
    forall a in A do
    a = a.locale.id;
    will be executed in all locales based on A's distribution? I think only locale 0 will execute the forall loop. Maybe I am wrong.
    Thomas Rolinger
    @thomasrolinger
    Chapel is not SPMD, so yes, locale 0 will "execute" the forall assuming it isn't nested in another structure like a coforall. The forall iterations are chunked up and assigned to tasks located across the locales. Those iterations are then executed on the different locales. Brad and others may have an easier method to explain the overall parallelism model.
    Zhihui Du
    @zhihuidu
    If we have such case
    forall i in 1..100 {
    }
    It is hard to consider how the chapel can assign the iterations to different locales. I am not sure how the compiler works at the low level.
    If in the forall loop, we will access multiple distributed arrays. The chapel compiler will first get all the values from other locales to locale 0 , calculate the result and then update the values in other locales (other locales will only provide and receive data, no calculation) or the compiler can ask other locales to do some iterations locally?
    Thomas Rolinger
    @thomasrolinger
    In the case of that forall loop that simply iterates over a range, it does not assign iterations across locales. As I mentioned earlier, block distributed arrays and domains have default iterators that do the "magic" of distributing work
    Zhihui Du
    @zhihuidu
    Thanks! So only the resources of locale 0 can be used for the parallel iterations.
    Thomas Rolinger
    @thomasrolinger
    In the above forall i in 1..100, yes the default behavior is to utilize just the cores/resources on locale 0 (assuming no other enclosing structures; what I mean by this is that locale 0 is not necessarily the locale that will execute that loop. It will be whatever locale you are currently on. You could have had something like on Locales[1] { forall i in 1..100 and that would execute on locale 1.)
    Zhihui Du
    @zhihuidu
    I understand this now.
    My question is if we have to access data on different locales in the loop, how the compiler works?
    Thomas Rolinger
    @thomasrolinger
    at runtime, there are checks to see whether data is local or remote (some checks can be done at compile time for specific cases, to statically determine something will be local). For remote accesses, communication is performed (i.e., get/put)
    Zhihui Du
    @zhihuidu
    I see. Thanks!
    Kevin Waters
    @kwaters4
    What is the current best practices for writing a file to binary with iostyle being deprecated? Is that still being developed?
    Is channel.writeBinary the only currently alternative?
    4 replies
    Brad Chamberlain
    @bradcray
    @zhihuidu : I’m only seeing this conversation now. Let me take a different stab at @thomasrolinger’s very excellent answers above, just to try and re-enforce some key things:
    In Chapel, the behavior of a parallel loop like forall i in expr… depends entirely on expr. Specifically, if expr is an invocation of a parallel iterator, the recipe for how many tasks to use for the loop, where they should run, and how the loop’s iterations should be divided between those tasks is defined within the iterator. If expr is instead a variable, then there will be a default parallel iterator defined on that variable’s type which specifies these things (and if there is not, the parallel loop will not be legal).
    As a specific example, if expr is a range (like 1..n) or domain (like {1..n, 1..n}) or array (like var A: [1..n] real;) then we will invoke the default parallel iterator for the corresponding type.
    The default parallel iterators for ranges, domains, and non-distributed arrays will only ever use local resources to execute the iterations of the loop. Thus, forall i in 1..n or forall i in {1..n, 1..n} or forall a in A (for my A above) will only use local tasks to execute iterations.
    By contrast, parallel iterators over distributed domains or arrays (like Block-or Cyclic-distributed arrays) will use tasks on all locales that own a part of the distributed index space.
    Brad Chamberlain
    @bradcray
    And then, as Thomas says, orthogonal to all this, any task can refer to any variable within its lexical scope, whether it is local or remote, due to Chapel’s global address space. It can also reason about where that variable is using the .locale query and compare that to where it is running using the built-in here variable.
    Have a good weekend!
    Zhihui Du
    @zhihuidu
    @bradcray Thanks and your answer make me realize that my understanding of the parallel forall loop is limited. Based on your description, I learned more about the chapel iterator functions. If A is a distributed array, even though I did not put the forall i in A in the coforall loc in locals { on loc{ }} structure, the default parallel iterator knows that A is a distributed array so the loop on A will also be executed on different locales instead of just on locale 0., is this correct?
    Brad Chamberlain
    @bradcray

    @zhihuidu : That’s correct (and in fact, you would not want to put the forall into a coforall loc in Locales on loc loop, or you’d end up doing a bunch of redundant work, since each task on each locale would invoke A’s default parallel iterator).

    Think of forall a in A as translating to forall a in A.defaultParallelIterator() where .defaultParallelIterator() is currently spelled .these() in Chapel.

    Zhihui Du
    @zhihuidu
    @bradcray Thanks. Now I understand why I will have worse performance when I put a forall a in A into a coforall loc in locales on loc loop. At first, I think the forall a in A loop can only run on locale 0.
    npadmana
    @npadmana
    Hi all -- I think this is related to the binary I/O question from above -- is there a way to read an entire array in binary from a file (without looping over it element by element, assuming a contiguous array etc)? channel.read used to do this, but it doesn't look like it works, and readBinary doesn't seem to accept an array?
    26 replies
    Tom Westerhout
    @twesterhout:matrix.org
    [m]

    A question. I have two nested coforall loops (one over nodes and one over threads). There is quite a bit of communication happening between nodes, but no remote tasks are spawned except for "fast executeOn", (I think) no allocations are performed, just computations. The peculiar thing is that I have to insert a few calls to chpl_task_yield to have the code terminate -- without them it just hangs. What could be causing such behavior?

    EDIT: One other thing, there is no oversubscription, so tasks are not competing for cores.

    3 replies
    Tom Westerhout
    @twesterhout:matrix.org
    [m]
    @ronawho: I went back and inserted some if statements to conditionally enable the yield statements, and I can't reproduce the issue anymore 🙈 So let's ignore it for now, and if I manage to reproduce it in a reliable way, I'll ping you again. Sorry for the noise
    npadmana
    @npadmana

    Hi all (again!).... I was trying to a reduction over a user-defined record as below, but running into an issue because there wasn't an way to define the initial state of the reduction variable... is there a simple way around this :

    record R {
        const x : int;
        var ii : int;
    
        proc init(x0: int) {
            this.x = x0;
            this.complete();
        }
    }
    
    operator +(lhs:R, rhs:R) {
        var ret = new R(lhs.x);
        ret.ii = lhs.ii + rhs.ii;
        return ret;
    }
    
    var h = new R(10);
    
    coforall ii in 0..4 with (+ reduce h) {
        h.ii += ii;
    }
    
    writeln(h);

    fails to compile with

    dev-0.1] 22:20 histogram$ chpl test.chpl
    test.chpl:17: error: unresolved call 'R.init()'
    test.chpl:5: note: this candidate did not match: R.init(x0: int)
    $CHPL_HOME/modules/internal/ChapelReduce.chpl:140: note: because call does not supply enough arguments
    test.chpl:5: note: it is missing a value for formal 'x0: int(64)'
    19 replies
    npadmana
    @npadmana
    Sorry, one more (for some reason, my brain is blanking on this) -- what is the best way to append to an array?
    4 replies
    kiti_Nomad
    @Kiti-Nomad
    Concurrency has never become so important during pandemics
    Brian Gravelle
    @mastino
    I am running into a memory corruption in a chapel application. Is there a way to use gdb to identify the offending data structure? I have been able to o this in C, but it appears chapel is removing some of the necessary information
    9 replies
    Kevin Waters
    @kwaters4
    Have any users or developers had success leveraging the TAU profiler for their Chapel code? Are there other profilers that anyone would recommend instead?
    6 replies
    Thomas Rolinger
    @thomasrolinger
    If I can assume that a loop is iterating over an array/domain with a default iterator, is there a way to "query" the iterator for the value some number of iterations ahead? So something like:
    for val in Array {
       // give me "val" that is N iterations ahead
    }
    1 reply
    Kevin Waters
    @kwaters4
    @lydia-duncan I was digging through some Git issues and I was wondering if this chapel-lang/chapel#9817 was completed? Or if there is a resource that would help with returning a chapel array from cython. I have been able to get the array into Chapel, now I want to get it back and I am unsure how.
    2 replies
    Zhihui Du
    @zhihuidu
    I am confused by the testing results of Chapel's atomic array.
    My assumption is the atomic array operations tmp=A[i].read() and A[i].write(tmp) should be more expensive than the cost of regular array operations tmp=A[i] and A[i]=tmp.
    However, my tests show that for the same code, if we just change the regular array operations with atomic array operations, sometime (often for a large graph as an input) the atomic array method is good. Sometime the atomic array method is bad (often for a small graph as an input) compared with the regular array method. No matter one locale or two locales.
    Can anyone help to explain the results? Thanks!
    Brad Chamberlain
    @bradcray
    Hi Zhuidu — Just to make sure I’m understanding the question, when you say "the cost of regular array operations tmp=A[i] and A[i]=tmp”, A here is an array of (say) non-atomic ints/reals, so not the same as the array A in the previous “atomic array operations” clause (presumably an array of atomic ints/reals), is that right?
    Zhihui Du
    @zhihuidu
    You are right.
    I implemented two methods for the connected components algrithm, one uses the regular array and another uses the atomic array. If found for large graphs, the performance of the atomic array method is good. I cannot understand it. I think the atomic array method is always be bad in performance compared with the regular array method.
    Brad Chamberlain
    @bradcray
    Generally speaking, I would agree with you that atomic operations in Chapel will be more expensive than non-atomic operations. The details can depend on factors like (a) whether the atomic operations are implemented using processor atomics or network atomics or (b) whether the variable upon which the atomic operation is applied is local or remote. A best-case scenario for the two to be competitive might be remote atomic operations vs. remote atomic reads on a network that has good support for both, like Cray Aries. Or local reads vs. local processor atomics.
    Offhand, I’m not aware of a microbenchmark that you could write in which [local vs. remote] x [processor vs. network] atomic operations would outperform the equivalent non-atomic reads/writes. Is there any chance that the performance difference could be due to algorithmic differences rather than atomic vs. non-atomic overheads? (that seems most likely to me). For example an algorithm using non-atomics may have races that benefit or hurt performance if they affect how many iterations the algorithm runs for, or the like. Or, a non-deterministic algorithm’s running time might vary even for a given atomic/non-atomic choice.
    Tagging @ronawho in case he sees another explanation that I’m not.
    Zhihui Du
    @zhihuidu
    Yes, at first I thought the reason is that atomic operation changes the total number of iterations. However, I found that the iteration numbers are the same. This is why I cannot understand it.
    My tests were on two different platforms. One has Infiniband and another does not. Both platforms have similar results.
    Let me check the code again according to your ideas on the possible reasons.
    Brad Chamberlain
    @bradcray
    To my knowledge, we do not currently make use of network atomics on infiniband (or any non-Cray networks), so I’d expect remote atomics to be quite expensive there (an active message to do the remote processor atomic) relative to remote reads/writes. If all the atomic ops were local, the difference should be that of doing local reads/writes vs. atomic reads/writes using the processor. I suppose one other potential factor might be that the alignment and padding of the values in memory may differ between the atomic and non-atomic implementations (?).
    Brad Chamberlain
    @bradcray
    [off-gitter, Elliot verified that I didn’t mis-characterize anything]
    Zhihui Du
    @zhihuidu
    @bradcray thanks for the detailed information! You are right and now I think it may be caused by the uncertainty of concurrent thread executions.
    In one iteration, multiple threads will compare the value of A[i] with their own specific value tmpj. A[i] will be updated with tmpj if A[i] is larger than tmpj. Otherwise, it will have no changes.
    A possible parallel execution scenario is that when multiple threads execute the comparison at the same time and then they update A[i] concurrently, the value of A[i] may not be the smallest one. This may lead to the following comparison and assignments will be executed again.
    If we use atomic assignment, the smallest value will be assigned to A[i] and the following comparisons cannot trigger the update operations. So fewer update operations will be executed. This may cause the atomic operation to take less time since fewer assignments will be executed.
    Brad Chamberlain
    @bradcray
    :+1:
    Zhihui Du
    @zhihuidu
    @bradcray
    In Chapel code, we let a=1, b=2, and c=3. Here only c is an atomic variable.
    We have two parallel threads.
    Thread 1 executes
    if a < c.read() then c.write(a)
    and thread 2 executes
    if b<c.read() then c.write(b)
    So, the value of c can be 1 or 2. Or the final value of c depends on which thread will execute the write finally. Is this correct? Thanks!
    Brad Chamberlain
    @bradcray
    @zhihuidu: That sounds correct to me.
    Zhihui Du
    @zhihuidu
    thanks! @bradcray
    Zhihui Du
    @zhihuidu
    @bradcray I check the Chapel atomic functions. However, I cannot find a way to make
    if a < c.read() then c.write(a)
    as an atomic operation.
    compareAndSwap is close but I need a "< " condition instead of "==" condition.
    Can Chaple support the comparison and set value as an atomic operation?
    Brad Chamberlain
    @bradcray

    @zhihuidu: I think the standard technique for doing this type of pattern with atomics would be to do something like:

    do {
      const oldC = c.read();
      if a < oldC {
        // do a compare-and-swap / compare-exchange here that ensures that c is still ‘oldC’, and if it is, swap in a; if it’s not, spin around this loop again
      }
    } until the swap is successful

    Note that nothing about this is particularly Chapel-specific. I believe the same sort of approach would be taken in any language with atomics of this flavor (like modern C, C++).