Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
Noah Vesely
@nvesely
I haven't figured that out. Either we'd have to modify Tor a lot more significantly or have tcpdump running on the internal traffic and then compare the timestamps.
I do want to know, but just don't have the time right now.
Noah Vesely
@nvesely
I'm taking a closer look at your go-knn implementation @pylls. I was trying to follow along with Wang's thesis and what you're doing obviously deviates from that. But it's apparently based on a Wang implementation. I'm thinking that since this was a December 2015 thesis the algorithmic improvements listed there were just don't have published implementations.
I'm working on implementing exactly what Wang describes in 3.2.5. I guess we can compare results between that and your implementation based on Wang's Python implementation when it's ready.
Noah Vesely
@nvesely
Curious how things are going for you @mjuarezm @gunesacar @gchers ? I took a break to work on latest SD release for a while, but am trying to fix a couple last bugs that are causing my crawler to fail out at a certain point. I think related to when sites suspect I'm a crawler and disconnect the connection on their end--something weird in how Python is handling that. I just brought in traceback module to get some better info and figure this out.
pulls
@pylls
hi @fowlslegs. The weightlearning (alg_recommend2) and accuracy functions are identical in both two implementations provided by Wang at https://crysp.uwaterloo.ca/software/webfingerprint/ and should be the same in go-knn (sans one bugfix in weightlearning). What and how data is used is only identical to flearner.cpp in the attacks.zip. The current structure of the main part of the code is not a good way to use knn though as you note. When it comes to 3.2.5 I'd love to hear any differences you find and what changes you make
I guess you worked in rounds to start with?
pulls
@pylls
I did a best effort implementation of WLLCC as in Wang's phd thesis, now playing with with different parameters and datasets. One noteworthy thing I noticed, and it would be interesting to see if you see the same @fowlslegs, is that I get better results (for all metrics) as I increase rounds. In Section 3.4.4.2, Wang sees a flattening after 75 rounds, but I get increasingly better results all the way up to 800+ rounds with his dataset (100x90+9000) and better TPR as well
Noah Vesely
@nvesely
Hey, sorry meant to follow up with this. I had misunderstood something, having not yet completed reading your implementation upon posting that.
The only difference I see besides the lack of rounds is that you multiply Δw\Delta w by (0.2+Nbad)/kreco(0.2 + N_{bad})/k_{reco}, instead of (1+Nbad)/kreco(1 + N_{bad})/k_{reco} as described in Wang 3.2.5.
Noah Vesely
@nvesely
Did you catch any other differences between that and the lack of rounds @pylls?
Check out figure 3 of page 150 on the Usenix "Effective Attacks and Provable Defenses..." paper https://www.usenix.org/system/files/conference/usenixsecurity14/sec14-paper-wang-tao.pdf
There they noticed TPR increases become minimal after 800 rounds.
How many rounds is optimal (i.e., not much ROI after that point) is probably a function of your dataset size as well as other features of that dataset (e.g., ratio of non-monitored to monitored sites, instances of monitored sites, etc.).
pulls
@pylls
Good catch with figure 3 on page 150, thanks! I'm also unsure about the randomization of weights (weight[i] *= 0.9 + float64(rand.Intn(100)/500.0)) at the end and the c1 contribution (weight[j] += c1 / totalfd)
didn't find that in the thesis at least so removed it from the pure WLLCC
I'm currently, very slowly, running the implementation on different large datasets to see if I can find some heuristic
for the number of rounds that is
Noah Vesely
@nvesely
BTW everyone @redshiftzero is the most recent person who's going to be joining the Freedom of the Press Foundation team and will be working on this project with me. She has background in data science and has written a paper on a HTTPS fingerprinting (I believe on Tor, but maybe just on the regular 'net), so that's exciting because my background has mostly been in crypto and increasingly application + OS security--this was a first ML project. Anyway, I've been mostly working on other things the last couple of weeks, but am hoping to make some real progress this week and also to kick up more collaboration with everyone. Would love to hear what everyone's been up to lately @pylls @mjuarezm @gunesacar @gcher
Noah Vesely
@nvesely
@pylls Did you use pprof to analyze your Wa-kNN implementation?
pulls
@pylls
@fowlslegs: Hi! Nope, sorry
Noah Vesely
@nvesely
@pylls Oh, I was just going to say it's pretty efficient. I have found some optimizations though that I'll submit PRs for in the ocming weeks.
Are any of y'all going to the tor dev meeting?
pulls
@pylls
Thanks, sounds good
mjuarezm
@mjuarezm
@fowlslegs no, unfortunately, I won't be able to attend the tor-dev meeting this time..
Noah Vesely
@nvesely
Just had a session at the Tor summer meeting on website fingerprinting. Some good conversation. We've decided that a mailing list would be good as well since a lot of people mentioned they are averse to singing in to yet another chat room, which totally makes sense to those of us who are in n Slacks, Gitters, IRCs, etc..
Mike Perry mentioned that they recently obtained funding to work on Tor proposal 254 (see https://gitweb.torproject.org/torspec.git/tree/proposals/254-padding-negotiation.txt) based on WTF-PAD.
mjuarezm
@mjuarezm
Hi @fowlslegs, thanks for the updates. I wish I could attend the tor-dev meet. Great news about the funding!
Noah Vesely
@nvesely
https://trac.torproject.org/projects/tor/wiki/org/meetings/2016SummerDevMeeting/Notes/WebsiteFingerprinting Here are notes from the meeting/ a mini-wiki to try to provide resources to help get more people involved. Feel encouraged to edit and also sign up for the new mailing list linked to there.
pulls
@pylls
cool!
mjuarezm
@mjuarezm
awesome! thanks!
Giovanni Cherubin
@gchers
thank you!
Noah Vesely
@nvesely
@pylls Have any results to share as as far as rounds in Wa-kNN/ your attempt to find a heuristic for a good tradeoff between time & accuracy for different size datasets.
Noah Vesely
@nvesely
Also, do you have a branch for that?
pulls
@pylls
@fowlslegs: didn't find anything useful, so instead I took the approach of calculating all weights for k-folds in parallel (one routine per fold, instead of many routines within wllcc)
thanks for the pull request btw! :) Slowly catching up with life after travel this week
the latest work we did on knn for the WF and DNS work is at https://github.com/pylls/defector/blob/master/cmd/defector/, will look at updating the go-knn project to incorporate that together with your pull requests
Noah Vesely
@nvesely
@pylls Yeah, I implemented rounds, and am really surprised how little the difference is of 1 vs 10 or 100 rounds. Basically insignificant. I'm also surprised at how consistent the accuracy results are for 1 round despite randomizing the weight initialization. There's been something I've been meaning to try in the weight adjustment process--it will probably cost a good bit computationally, but I think it has a possibility of improving results. I'll try to implement it sometime in the coming week, and point you at the branch/ let you know how it goes.
pulls
@pylls
@fowlslegs I think we're doing rounds differently. If I understand your code right, you do rounds as repeating the full weightlearning run, each time training on all instances. If I understand rounds correctly, it's about how many times you train on an instance. Basically, rounds is here: https://github.com/fowlslegs/go-knn/blob/master/attack/knn.fixed/knn.fixed.go#L98
Noah Vesely
@nvesely
@pylls You're right. I did a search for "rounds" in Wang's thesis, and he could have really done a better job explaining what he meant. Anyway, going to try it that way and see if results are affected.
Noah Vesely
@nvesely
The way rounds are done now makes more sense in the context of having now read more ML lit.
Noah Vesely
@nvesely
https://3tmaadslguc72xc2.onion/ is no longer up @mjuarezm @gchers @jhayes14. I've been in a SecureDrop whole the last couple of months, but am working on WebFP again, and starting to collab w/ @mikeperry-tor (and some of y'all) starting soon on the adaptive padding defense. Anyway, I wanted to try to reproduce your results using ALPaCA, and test viability as a defense candidate for SD. If you would be willing to throw that code on Github (even in a private repo, which if it would cost you money, FPF can provide and let you admin), I think it would be helpful to that process.
Noah Vesely
@nvesely
@pylls I've looked a little bit at your latest go-knn code and noticed that you put my present feature pre-computation optimization inside the rounds loop. Since present features don't change w/ weight updates, you should move that code outside (above) the rounds loop. See the fpsd-integration branch in my fork (as soon as I push later today).
Okay, just pushed. You probably don't need to look at it. I'm sure you know what I mean.
Specifically looking at go-knn/knn.go btw.
Noah Vesely
@nvesely
        for j := 0; j < FeatNum; j++ {
            if weight[j] == 0 {
                badList[j] = 0
                panic("does this ever happen?")
                //continue
            }

            var maxGood float64
            var countBad int
            // find maxGood
            for k := 0; k < RecoPointsNum; k++ {
                n := math.Abs(feat[i][j] - feat[recoGoodList[k]][j])
                if feat[i][j] == -1 || feat[recoGoodList[k]][j] == -1 {
                    n = 0
                }
                if n >= maxGood {
                    maxGood = n
                }
            }

            for k := 0; k < RecoPointsNum; k++ {
                n := math.Abs(feat[i][j] - feat[recoBadList[k]][j])
                if feat[i][j] == -1 || feat[recoBadList[k]][j] == -1 {
                    n = 0
                }
                featDist[j] += n
                if n <= maxGood {
                    countBad++
                }
            }
            badList[j] = countBad
            if countBad < minBadList {
                minBadList = countBad
            }
        }
            for j := 0; j < featNum; j++ {
                // Find maxGoodFeatDist, the maximum distance between pTrain and all
                // points in recoGoodList
                var maxGoodFeatDist float64
                for _, v := range recoGoodList {
                    if xTrain[v][j] != 0.0 && pTrain[j] != 0.0 {
                        n = weight[j] * math.Abs(xTrain[v][j]-pTrain[j])
                        if n > maxGoodFeatDist {
                            maxGoodFeatDist = n
                        }
                    } else {
                        n = 0
                    }
                }
                // Find the relevant number of bad distances, badList[j], and the total
                // distance between pTrain[j] and p'[j] for p' in recoBadList,
                // featDist[j]. Side note: because Go garbage collection is slow, it's
                // faster to reset the values in a slice than to let Go garbage collect
                // it initialize a new one with each round.
                badList[j] = 0
                featDist[j] = 0
                for _, v := range recoBadList {
                    if xTrain[v][j] != 0.0 && pTrain[j] != 0.0 {
                        n = math.Abs(xTrain[v][j] - pTrain[j])
                        if weight[j]*n <= maxGoodFeatDist {
                            badList[j]++
                        }
                        featDist[j] += n
                    } else {
                        n = 0
                    }
                }
            }

            _, minBadList := getMin(badList)
Noah Vesely
@nvesely
@pylls I wish we could just write this out on a whiteboard, but I'll give my best attempt to explain what I think is a bug in your and Wang's code (above--mine is below and I believe fixes the bug). So the purpose of the featDist slice is to keep track of the cumulative, raw (i.e., unweighted) distance between pTrain (the point we're training on), and all points in recoBadList. This is so we can increase the weights after the weight reduction of the useful features in order to keep the overall weighted distance (i.e., the weighted L^1 norm) between pTrain and recoBadList the same. In you and Wang's code you are first finding maxGood (maxGoodFeatDist in my version) as the maximum unweighted distance between pTrain and any point in recoGoodList. In section 3.2.5 of Wang's thesis this value is defined as being the maximum weighted distance, which makes sense because that's the effective metric we're looking at when it comes to making sense of the effectiveness of weights in determining classification. Then when trying to find the relevant number of bad distances, you also look at the unweighted feature norms. In my code, in the second inner loop, I assign this as the initial value of n because we do want to use the featDist slice to keep track of this points, so we can take advantage of the simple mathematical equality (weight[j] - deltaW) * d = weight[j] * d - deltaW * d (which of course requires we know d, the raw feature norm). Then for comparison against maxGoodFeatDist, a weighted norm value, I multiply by weight[j]. I hope this makes sense. Should probably file a Github issue.
Noah Vesely
@nvesely
Side note: those else blocks setting n = 0 are useless.