These are chat archives for ethersphere/orange-lounge

9th
Apr 2018
lash
@nolash
Apr 09 10:02
Through the work with dynamic discovery tests over the weekend I think I've uncovered some race situations in the kademlia. The changes on the WIP PR shows the locks I've added (and removed), and with these changes I'm able to run the dynamic discovery tests concurrently. ethersphere/go-ethereum#370
There are still deadlocks on shutdown, and periodically the health check fails for reconnecting nodes (hence WIP). The latter may be due to failing test logic, which I couldn't really test before I got the concurrency running, so I will investigate now.
But maybe @justelad @holisticode @nonsense who've been the most(?) involved in hangs and irregularities in kademlia lately could have a look and see if it's old news or if we might be on to something with this?
Elad
@justelad
Apr 09 10:31

@nolash what i can report so far is that the problem is not with the kademlia implementation, but with the p2p client. there’s a hang there which manifests what two nodes try to call each other around the same time. hence the hack with the bytes.Compare in the test, which tells the smaller node to ditch connecting and falls back to the bigger one to try and connect instead.

so far from what @zelig and i investigated - this is a test env bug which would very likely not manifest in production nodes.

i think that if we would to add more of these tests on top of a testing framework that might be not optimal to test all of these functionalities at once we should rethink what we’d like to test (i.e. maybe for the time being not do an all-at-once functional test). maybe split the tests to connection/dial-ins and with these dynamic discovery tests manually create a connection from the network’s god’s view and hand it out to the nodes.
right now with my discovery test PR i’ve narrowed it down to about 3.5% tests flaking, not sure if i could squeeze more at this point

Daniel A. Nagy
@nagydani
Apr 09 10:33
Nice catch, @justelad
lash
@nolash
Apr 09 10:36
Actually @justelad I initially added this hack to my WIP, but I still had problems ... :/
maybe split the tests to connection/dial-ins and with these dynamic discovery tests manually create a connection from the network’s god’s view and hand it out to the nodes.
I'm not sure if I understand this. How is this different from snapshots?
Elad
@justelad
Apr 09 10:45
when you use the hive’s start/stop methods you’re forcing it to reconnect and dial in on its own. a note about your pr is also - why not use the built in start/stop functionalities of the hive? is the test about keeping the nodes up but forecfully severing the connections between them?
it seems there’s no snapshot involved in your test in any case
lash
@nolash
Apr 09 10:48
You mean Start/Stop from p2p/simulations.Network? I don't believe the simnodes give you access to the hive object. Anyway, network restart actually creates a NEW kademlia object (new pointer), which makes the comparison before and after shutdown all the more tricky; And your state persistence code is not in this branch. Or am I mistaken?
Elad
@justelad
Apr 09 10:51
ethersphere/go-ethereum#347
Yes sorry I meant Network.Start/Stop
Elad
@justelad
Apr 09 10:56
in your test file, you should get rid of lines 501-505
that’s what’s causing your problems :)
lash
@nolash
Apr 09 10:57
? I only count 359 lines..
get rid of this block:
    // function to sanction or prevent suggesting a peer
     if k.Reachable != nil && !k.Reachable(e.addr()) {
         log.Trace(fmt.Sprintf("%08x: peer %v is temporarily not callable", k.BaseAddr()[:4], e))
         return nil
     }
lash
@nolash
Apr 09 10:59
that's in swarm/network/kademlia.go
Elad
@justelad
Apr 09 11:03
true. do so nevertheless. the bytes.compare and this reachable block are mutually exclusive.
lash
@nolash
Apr 09 11:04
It doesn't make a difference, sadly :'(
ah you mean THEN reinstate your hack?
Elad
@justelad
Apr 09 11:06
yes
they are both hacks :)
also i’d get rid of this:
div += (150000 - rand.Int63n(300000)) * div / 1000000
lash
@nolash
Apr 09 11:08
what you are saying is wait with this PR till you are done with yours :)
Elad
@justelad
Apr 09 11:09
it might reduce your clutter :D
lash
@nolash
Apr 09 11:14
doesn't help unfortunately (and that change now gives me peerpot panics sometimes)
I don't think it's the same problem
or it's a different part of the same problem
Elad
@justelad
Apr 09 11:14
i’ll pull your branch and give it a test drive
lash
@nolash
Apr 09 11:17
Please do. Have a look at the stack traces when (if) it hangs
Elad
@justelad
Apr 09 11:17
:+1:
Daniel A. Nagy
@nagydani
Apr 09 13:12
So we have a linkable page with SOS details (dates, venue, etc.)?