## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
LightPegasus
@LightPegasus
I am trying to implement a parallel version of the Ford-Fulkerson algorithm (Edmond-Karp Algorithm version). I am using forall to split the work on my array of tuples across my locales. I cannot figure out why it is creating such a large overhead i.e. (parallel: 0.09s for 10 vertices vs. serial: 0.008s for 10 vertices). Is there a better way to parallelize it? Should I use a different distribution?
/* Function that implements the Ford-Fulkerson Max Flow algorithm
*
* Return: the maxium flow from s to t
* resGraph: an adjacency matrix that contains the capacities
* s: source vertex
* t: sink vertex
*/

proc FordFulkerson(resGraph: [], s: int, t: int)
{
// array that stores the path by BFS
var parent: [0..V-1] int;
var max_flow: int = 0; // no flow initially

while (bfs(resGraph, s, t, parent)) {
// Find the minimum residual capacity of the edges
var path_flow: int = max(int);
var q = new list((int, int));

var v: int = t;
while (v != s) {
q.append((v, parent[v]));
v = parent[v];
}

const Space = {0..q.size-1};
var D = Space dmapped Block(Space);
var A: [D] (int, int) = q.toArray();

forall i in A with (min reduce path_flow) do
path_flow = min(path_flow, resGraph[i(1), i(0)]);

// Update residual capacities of the edges and reverse edges along the path
forall i in A with (ref resGraph) {
resGraph[i(1), i(0)] -= path_flow;
resGraph[i(0), i(1)] += path_flow;
}

// Add path flow to overall flow
max_flow += path_flow;

}
return max_flow;
} //End of the FordFulkerson function
12 replies
Michael Merrill
@mhmerrill
what is the recommended way of breaking out of a coforall?
we ended up calling Errors.exit(0) to exit the program in this case but this feels a little icky, I guess we could throw an exception from the thread that wants to stop all the tasks...
Michael Merrill
@mhmerrill
I guess we could also share a var and poll it...
LightPegasus
@LightPegasus
I am working on a parallel BFS algorithm. I get the correct answer when running my program that is using said algorithm, but the time it takes to run is worse then running the algorithm in series. I am wondering if anyone has any suggestion on what to work on or how to improve upon it. My code is on my GitHub: https://github.com/LightPegasus/Ford-Fulkerson
Thomas Rolinger
@thomasrolinger
@LightPegasus I’d suggest looking at the code in listing 5 in this paper: https://ieeexplore.ieee.org/document/9721333 it is far from the best performing code but it should give you somewhere to start from. In general, a distributed bag is not likely going to do what you want for BFS. The paper describes an approach to use aggregation to make the performance better but it is a bit out of date for how we could do it today. If you’re interested in that approach, let me know.
Thomas Rolinger
@thomasrolinger
Another issue not specific to BFS is that your graph data structure is a dense/full 2D matrix rather than a compressed/sparse matrix. So you’re spending tons of time in the forall on line 49 iterating over every vertex to find neighbors. A compressed representation only stores the non-zeros (the edges in the graph). That way you can easily access a given vertex’s neighbors. Also, the graph/matrix is not distributed, so you will have a ton of remote communication in that forall when you access the graph from any locale besides locale 0.
David Melkumov
@Dmelkumo
I already asked this in a thread, but I thought I'd ask again for visibility: how would I use a minloc reduction intent in a forall loop? Would I need to zip an array and its domain to iterate, and then have one variable for the min and one for the index?
11 replies
Tom Westerhout
@twesterhout:matrix.org
[m]

I have a weird segmentation fault:

proc getBlockPtrs(arr) {
logDebug("getBlockPtrs(", arr, ")");
type eltType = arr.eltType.eltType;
var ptrs : [0 ..# arr.size] c_ptr(eltType);
for i in arr.dim(0) {
ref locBlock = arr[i];
if locBlock.dom.size > 0 {
logDebug("if");
ref x = locBlock.data[locBlock.dom.low];
logDebug("end if");
}
}
logDebug("returning");
return ptrs;
}

proc finalizeInitialization(...) {
logDebug(_dataPtrs);
logDebug("assigning...");
_dataPtrs = getBlockPtrs(_locBlocks);
logDebug("finalizeInitialization is done!");
}

This code prints everything except for "finalizeInitialization is done!" and fails with a segmentation fault. It seems like the error happens during array assignment. Are there techniques to debug that?

EDIT: Oh yeah, forgot to mention that the type of _dataPtrs is [0 ..# 1] c_ptr(real(64)).

Tom Westerhout
@twesterhout:matrix.org
[m]
Interestingly, when I change getBlockPtrs function to receive ptrs by reference rather than return it, the error dissappears...
Lydia Duncan
@lydia-duncan
Hmm. My guess is that the pointers being stored in ptrs are more local than you’d want them to be. They could potentially be referring to a copied version of what is sent into arr that is local to getBlockPtrs. Or maybe the ptrs array is getting cleaned up in such a way that impacts what’s being returned as well
Tom Westerhout
@twesterhout:matrix.org
[m]
I'd think so as well, but the failure happens even before my code got a chance to use ptrs. In other words, the segfault happens when I try to assign to _dataPtrs rather than when trying to dereference one of the pointers.
The code seems to consistently fail to copy arrays of c_ptr(uint(64)) of length 1. Could there be some magic conversion for size-one arrays that's taking place?
Currently, I have to trace all the uses of the array and do the following:
  // This crashes:
var basisStatesPtrs : [0 ..# numLocales] c_ptr(uint(64)) = basisStates._dataPtrs;
// This works:
var basisStatesPtrs : [0 ..# numLocales] c_ptr(uint(64)) = noinit;
c_memcpy(c_ptrTo(basisStatesPtrs), c_const_ptrTo(basisStates._dataPtrs),
numLocales:c_size_t * c_sizeof(c_ptr(uint(64))));
And I'm compiling with CHPL_COMM=none so numLocales==1
2 replies
Tom Westerhout
@twesterhout:matrix.org
[m]
@benharsh: actually, if you have a minute, could you try reproducing the segfault on your side? Since you already have access to my code it should be relatively painless, I think. Could you try doing git pull and then compiling the second example: make bin/Example02. There is now a flag --enableSegFault which will cause the program to crash (twesterhout/distributed-matvec@5efe6ff). The Chapel environment that I'm using is this one: https://github.com/twesterhout/distributed-matvec/blob/master/env/setup-env-comm-none.sh
6 replies
Tom Westerhout
@twesterhout:matrix.org
[m]
That's great! Should I open an issue about it or have you done so already?
Hi everyone! I am currently trying to find a way to create an array A of different size on all my locales. For example, say I have 2 locales, I want the domain of A to be 0..3 on locale0 and the domain of A to be 2..10 on locale1. Is there any straightforward way of doing this? I’ve been trying some stuff out but have not been getting it to work as expected.
13 replies
Tom Westerhout
@twesterhout:matrix.org
[m]
A random question: is anybody using Chapel with Nix? (I've checked, and there appears to be no Chapel package in nixpkgs) Has this ever been considered? and if not, would there be interest in it?
5 replies
Hi all -- what is the best way to read data from a pipe (i.e. start a process and pipe its output into Chapel)?
@npadmana: I don’t do this often enough myself to be able to tell you immediately, but would look to the Subprocess module if I were trying to do this: https://chapel-lang.org/docs/modules/standard/Subprocess.html
2 replies
Zhihui Du
@zhihuidu

@bradcray When an array A is evenly distributed onto many locales, and each locale only works on its local part of A, this is perfect.
coforall loc in locales {
on loc {
work on elements of A assigned to the current locale
}
}

However, if we just do it like this
forall a in A {
work on a
}
It seems that only locale 0 will execute the loop. I wonder for the elements of A that were not assigned to locale 0, locale 0 will first get the data from other locales, work on the element, and then send them back or ask other locales to directly work on the elements?

Thomas Rolinger
@thomasrolinger
@zhihuidu When you say "A is evenly distributed" do you mean via something like BlockDist or other built-in distribution? If that is the case, the forall approach should execute a given iteration on the locale where a is located. How are you determining that only locale 0 is executing the loop?
Zhihui Du
@zhihuidu

"A is evenly distributed" means no load balancing problem. For example, the size of A is 100 and we have 4 locales with block distribution.
You mean for the code
forall a in A {
work on a
}
even though I did not put it in
coforall loc in locales {
on loc {

}
}
the compiler can automatically assign the work to different locales?
I assume that all code not in
coforall loc in locales {
on loc {
}
}
will execute only on locale 0. Is this correct?

@thomasrolinger
Thomas Rolinger
@thomasrolinger

The default iterators provided for BlockDist arrays/domains will essentially implement the behavior you have for the coforall. So when you say forall a in A and A is indeed a block distributed array, it will do what you want (i.e., execute across the locales).

See the first example shown here: https://chapel-lang.org/docs/modules/dists/BlockDist.html

Zhihui Du
@zhihuidu
Thanks for the example.
So you mean
forall a in A do
a = a.locale.id;
will be executed in all locales based on A's distribution? I think only locale 0 will execute the forall loop. Maybe I am wrong.
Thomas Rolinger
@thomasrolinger
Chapel is not SPMD, so yes, locale 0 will "execute" the forall assuming it isn't nested in another structure like a coforall. The forall iterations are chunked up and assigned to tasks located across the locales. Those iterations are then executed on the different locales. Brad and others may have an easier method to explain the overall parallelism model.
Zhihui Du
@zhihuidu
If we have such case
forall i in 1..100 {
}
It is hard to consider how the chapel can assign the iterations to different locales. I am not sure how the compiler works at the low level.
If in the forall loop, we will access multiple distributed arrays. The chapel compiler will first get all the values from other locales to locale 0 , calculate the result and then update the values in other locales (other locales will only provide and receive data, no calculation) or the compiler can ask other locales to do some iterations locally?
Thomas Rolinger
@thomasrolinger
In the case of that forall loop that simply iterates over a range, it does not assign iterations across locales. As I mentioned earlier, block distributed arrays and domains have default iterators that do the "magic" of distributing work
Zhihui Du
@zhihuidu
Thanks! So only the resources of locale 0 can be used for the parallel iterations.
Thomas Rolinger
@thomasrolinger
In the above forall i in 1..100, yes the default behavior is to utilize just the cores/resources on locale 0 (assuming no other enclosing structures; what I mean by this is that locale 0 is not necessarily the locale that will execute that loop. It will be whatever locale you are currently on. You could have had something like on Locales[1] { forall i in 1..100 and that would execute on locale 1.)
Zhihui Du
@zhihuidu
I understand this now.
My question is if we have to access data on different locales in the loop, how the compiler works?
Thomas Rolinger
@thomasrolinger
at runtime, there are checks to see whether data is local or remote (some checks can be done at compile time for specific cases, to statically determine something will be local). For remote accesses, communication is performed (i.e., get/put)
Zhihui Du
@zhihuidu
I see. Thanks!
Kevin Waters
@kwaters4
What is the current best practices for writing a file to binary with iostyle being deprecated? Is that still being developed?
Is channel.writeBinary the only currently alternative?
4 replies
@zhihuidu : I’m only seeing this conversation now. Let me take a different stab at @thomasrolinger’s very excellent answers above, just to try and re-enforce some key things:
In Chapel, the behavior of a parallel loop like forall i in expr… depends entirely on expr. Specifically, if expr is an invocation of a parallel iterator, the recipe for how many tasks to use for the loop, where they should run, and how the loop’s iterations should be divided between those tasks is defined within the iterator. If expr is instead a variable, then there will be a default parallel iterator defined on that variable’s type which specifies these things (and if there is not, the parallel loop will not be legal).
As a specific example, if expr is a range (like 1..n) or domain (like {1..n, 1..n}) or array (like var A: [1..n] real;) then we will invoke the default parallel iterator for the corresponding type.
The default parallel iterators for ranges, domains, and non-distributed arrays will only ever use local resources to execute the iterations of the loop. Thus, forall i in 1..n or forall i in {1..n, 1..n} or forall a in A (for my A above) will only use local tasks to execute iterations.
By contrast, parallel iterators over distributed domains or arrays (like Block-or Cyclic-distributed arrays) will use tasks on all locales that own a part of the distributed index space.
And then, as Thomas says, orthogonal to all this, any task can refer to any variable within its lexical scope, whether it is local or remote, due to Chapel’s global address space. It can also reason about where that variable is using the .locale query and compare that to where it is running using the built-in here variable.
Have a good weekend!
Zhihui Du
@zhihuidu
@bradcray Thanks and your answer make me realize that my understanding of the parallel forall loop is limited. Based on your description, I learned more about the chapel iterator functions. If A is a distributed array, even though I did not put the forall i in A in the coforall loc in locals { on loc{ }} structure, the default parallel iterator knows that A is a distributed array so the loop on A will also be executed on different locales instead of just on locale 0., is this correct?

@zhihuidu : That’s correct (and in fact, you would not want to put the forall into a coforall loc in Locales on loc loop, or you’d end up doing a bunch of redundant work, since each task on each locale would invoke A’s default parallel iterator).

Think of forall a in A as translating to forall a in A.defaultParallelIterator() where .defaultParallelIterator() is currently spelled .these() in Chapel.

Zhihui Du
@zhihuidu
@bradcray Thanks. Now I understand why I will have worse performance when I put a forall a in A into a coforall loc in locales on loc loop. At first, I think the forall a in A loop can only run on locale 0.
Hi all -- I think this is related to the binary I/O question from above -- is there a way to read an entire array in binary from a file (without looping over it element by element, assuming a contiguous array etc)? channel.read used to do this, but it doesn't look like it works, and readBinary doesn't seem to accept an array?
26 replies
Tom Westerhout
@twesterhout:matrix.org
[m]

A question. I have two nested coforall loops (one over nodes and one over threads). There is quite a bit of communication happening between nodes, but no remote tasks are spawned except for "fast executeOn", (I think) no allocations are performed, just computations. The peculiar thing is that I have to insert a few calls to chpl_task_yield to have the code terminate -- without them it just hangs. What could be causing such behavior?

EDIT: One other thing, there is no oversubscription, so tasks are not competing for cores.

3 replies
Tom Westerhout
@twesterhout:matrix.org
[m]
@ronawho: I went back and inserted some if statements to conditionally enable the yield statements, and I can't reproduce the issue anymore 🙈 So let's ignore it for now, and if I manage to reproduce it in a reliable way, I'll ping you again. Sorry for the noise

Hi all (again!).... I was trying to a reduction over a user-defined record as below, but running into an issue because there wasn't an way to define the initial state of the reduction variable... is there a simple way around this :

record R {
const x : int;
var ii : int;

proc init(x0: int) {
this.x = x0;
this.complete();
}
}

operator +(lhs:R, rhs:R) {
var ret = new R(lhs.x);
ret.ii = lhs.ii + rhs.ii;
return ret;
}

var h = new R(10);

coforall ii in 0..4 with (+ reduce h) {
h.ii += ii;
}

writeln(h);

fails to compile with

dev-0.1] 22:20 histogram$chpl test.chpl test.chpl:17: error: unresolved call 'R.init()' test.chpl:5: note: this candidate did not match: R.init(x0: int)$CHPL_HOME/modules/internal/ChapelReduce.chpl:140: note: because call does not supply enough arguments
test.chpl:5: note: it is missing a value for formal 'x0: int(64)'
19 replies