Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
  • Oct 21 2017 21:21
    @jpitts banned @Musk55
<ehildenb> Basically, every way I've come up with for detecting loop-heads at the bytecode level, I've been able to come up with counterexamples (where the detection reports that it's a loop head, even if it's not).
<ehildenb> I'm working with limited knowledge, since I don't have the entire execution trace in-hand, I only have the branch points (which is when there is a JUMPI instruction). So it's possible that by storing more information during execution I can make the detection possible, I haven't considered that yet.
<leonardoalt> I guess looking at bytecode level only you'll always have false positives because of implicit things that become loops, like allocating a new array, ABI decoding/encoding, deleting an array, etc
<ehildenb> Yeah exactly, I have some examples of dynamic inputs, etc... that trigger it falsely sometimes.
<ehildenb> For example, if you have a function signature twoAddress(address a1, address a2), the compiler will factor out the code that computes x == (x & ((2 ^ 256) - 1)), and JUMP to it for each argument. This looks a lot like a loop, in fact when you jump to it the only differences each time you jump are some elements on teh wordstack (and the gas consumed), which looks a lot like a loop.
<ehildenb> The other heuristic I've looked at is "make sure that any elements which are on the wordstack and are valid jump destinations don't change between supposed loop iterations"
<ehildenb> But this is easy to write solidity code that tricks this heuristic (though the code itself is pathological)
<leonardoalt> I guess it shouldn't be hard to add info about loop heads. Is there an issue about that already?
<ehildenb> I don't see one.
<ehildenb> Want me to make one?
<leonardoalt> yea that'd be cool
<leonardoalt> but here there's no jump back, right?
<leonardoalt> even if those annotations exist, another problem is keeping them across optimizations
<ehildenb> Hmmmm, let me see if the direction we're jumping can be used as a heuristic.
<ehildenb> But my guess is no that it can't be.
<ehildenb> Because in the example above (the twoAddress function), the compiler can choose to put the code for checking the arguments before or after the main code path that we're in.
<leonardoalt> but does it become a loop in the cfg?
<ehildenb> Yes, with naive loop detection it odes.
<ehildenb> And I'm not sure there is a more clever way to do it.
<ehildenb> Because it basically jumps to the same code twice, and the only difference is that we're looking at a different offset in the <callData>
<ehildenb> But otherwise the states are structurally identical, so it looks like a loop.
<ehildenb> But like I said, I'm only storing the branch points, so it may bethat if I store more information from betwene branch points, I can infer that this isn't a loop.
<ehildenb> Actually, thinking about this more, I think I'll go with a different more user-oriented approach anyway.Basically we can provide hints about what we think are loops, and in cases we're 100% sure about loop-identification then we can do the loop folding, but otherwise we leave it to the user to identify loops for us.
<ehildenb> Would be a lot easier if there was a distinction between branches and loops in EVM bytecode.
<leonardoalt> ah I see

<leonardoalt> @ehildenb if I compile this contract with --asm:

contract L {
    function f(uint x) public pure {
        uint y;
        for (y = 0; y < x; ++y) {
        assert(x == y);

I see

/* "loop.sol":59:88  for (y = 0; y < x; ++y) {... */
/* "loop.sol":59:88  for (y = 0; y < x; ++y) {... */
<leonardoalt> I haven't looked into the srcmap, but now I'm wondering if it's not there

<leonardoalt> Also, in

contract L {
    function f(uint[20] memory a) public pure {
        uint s;
        for (uint i = 0; i < 20; ++i) {
            s += a[i];

if you compile with --ir you will see two loops, but with in --asm you can see that one is annotated with the for loop, and the other is annotated with utility.yul, denoting that it's from a compiler routine (here it's abi decoding)

<ehildenb> Huh, TIL, that will likely help a lot.
<ehildenb> So it looks like compiler already puts in the information.
<ehildenb> Thank you!
<ehildenb> I am going to have a heuristic that will only attempt to fold a loop if no items on the wordstack are valid jump destinations which change between loop iterations, basically. This should be safe in the sense that we won't detect loops which aren't actually loops (loop abstraction results in loss of control flow precision, so we want to do it as little as possible), but also give the user or external tool the option to explicitely specify where the loops are.
<ehildenb> So that we can work on most bytecode anyway, but if you provide hints (like the above, we'll be able to work on your code too).
<ehildenb> I'm also realizing it's important to make it possible to manually identify loops externally because we want this tool to work with any semantics.
<ehildenb> But I think the heuristic of "wordstack contains no items which change between loop iteration 1 and loop iteration 2 where both the pre and post are valid jump destinations" should work most of the time, unless the code does something really silly like store the jump-back destination in the local memory....
<ehildenb> maybe it's better to instead calculate the state delta, and make sure that nowhere in the state do you have a change from one integer to another where both integers are valid jump destinations. Then you know for sure that your jump structure must be the same. Still would rule out automatic detection of loops where every iteration of the loop changes something in the state from one valid jump destination to another, but is not using that part of the state for actual control flow. Kinda a hacky heuristic.
<leonardoalt> yea I think that makes sense
<leonardoalt> why no changes?
<ehildenb> When the compiler factors out a piece of code to be reused (but it's not a loop, it's just re-used in a bunch of places, eg. internal functions or argument processing routines), it stores the location to jump back to on the wordstack.
<ehildenb> So in the twoAddress example above (which actually comes from a transferFrom example), the CFG builder sees "hey, I've seen this state before", because everything is identical except for (i) the argument it's processing (so the offset into th ecalldata that shows up on the wordstack), and (ii) the argument on the wordstack representing the location to jump back to.
<ehildenb> Specifically, here is what the state delta looks like:
<ehildenb> If you look closely, you'll see the only change in pre and post state are on the <gas> cell, and the <wordStack> cell.
<ehildenb> The <gas> cell change is expected.
<ehildenb> Here we're trying to decide "is this a loop or not?" My old heuristic was "<k>, <pc>, and <callData>" cells are the same, so it returned "true, this is a loop", even though its not, it's just two jumps to the same piece of code which process the address arguments.
<ehildenb> So I'm trying to come up with a better heuristic.
<ehildenb> And the only telling thing that this isn't a loop is that the 6th argument on the <wordStack> goes from 306 => 323, and that is the argument that is used as the "jump-back" destination after the argument processing.
<ehildenb> For a loop, that "jump-back" argument, if it's present (depending on how the loop is structured), should always be the same, so there shouldn't be a state delta.
<ehildenb> I'd rather error on the side of assuming things aren't loops. So in this case, I basically decided that if the state-delta contains a rewrite anywhere from one valid jump destination (306) to another different one (323) (and we already have the valid jump destinations computed in the <jumpDests> cell), then do not consider the two states in looping relation.
<ehildenb> So you can trick the tool into thinking things aren't loops by writing pathalogical code (think something that every iteration of a loop wrote a different valid jump destination to storage slot 0 and did nothing with it), but in that case you just lose out on automatic loop detection.

<ehildenb> Not sure if the above makes sense, but this new heuristic:

def kevmShouldSubsume(constrainedTerm1, constrainedTerm2):
    if all([getCell(constrainedTerm1, cn) == getCell(constrainedTerm2, cn) for cn in ['PC_CELL', 'K_CELL', 'CALLDATA_CELL', 'PROGRAM_CELL', 'JUMPDESTS_CELL']]):
        jumpDests   = [ i['args'][0] for i in flattenLabel('_Set_', getCell(constrainedTerm1, 'JUMPDESTS_CELL')) ]
        (state1, _) = splitConfigAndConstraints(constrainedTerm1)
        (state2, _) = splitConfigAndConstraints(constrainedTerm2)
        stateDelta  = pushDownRewrites(KRewrite(state1, state2))
        rewrites    = []
        def _collectRewrites(_k):
            if isKRewrite(_k) and isKToken(_k['lhs']) and isKToken(_k['rhs']):
                rewrites.append((_k['lhs'], _k['rhs']))
            return _k
        traverseBottomUp(stateDelta, _collectRewrites)
        return not any([(i != j) and (i in jumpDests) and (j in jumpDests) for (i, j) in rewrites])
    return False

checks that:

  • All of the <pc>, <k>, <callData>, <program>, and <jumpDests> cells are identical (could probably also add <callDepth> to this one)
  • There are no deltas in the state between two integers that are valid jump destinations.

This correctly identifies the loop in this code:

    function sum(uint n) external pure returns (uint s) {
        s = 0;
        uint i = n;
        while (0 < i) {
            s = s + i;
            i = i - 1;
        return s;

while correctly not identifying a loop in this code:

    function twoAddress(address a1, address a2) public pure returns (uint256) {
        if (a1 == a2) {
            return 0;
        return 1;