I'm reworking parser logic right now. Could you please let know what's the point of
if tag == "div": marker = self.url.find("?q=") if marker >= 0: self.url = self.url[marker + 3:] marker = self.url.find("&sa") if marker >= 0: self.url = self.url[:marker] if self.url != "": if self.url.find("://", 0, 12) >= 0: index = len(self.results) + 1 self.results.append(Result(index, self.title, self.url, self.text)) else: skipped += 1
The processing part is introduced in jarun/googler@c0808cf, but the commit message doesn't give any example, and after the modifications
self.url surely won't be a URL anymore, unless you percent decode it, then there's a chance, but I just don't see the point.
As for finding the protocol host separator, could you give me an example of it failing? The check was introduced in jarun/googler@d498c50 by the way, which says "Skip google news and images links", but even without this check Google News and Images results should be filtered out under Google's current layout, because they wouldn't survive our class guards.
http://www.thingiverse.com/search?q=Raspberry+pi&sa=. This was the format in which google used to give some result URLs to it's own domains or to nearest eating places, based on your location (tracking :D). However, I believe not to check for the google domain is crude.
helloand you'll find results from Youtube, Google Maps and Google Images.
g mnr-c g-blkrather than
g— i.e., the URL post processing code has no effect on that (youtube.com certainly can't be a relative link). When I parse classes appropriately, the result does show up regardless.
Currently parser logic is a mess.- I'm afraid it's true. The problem is none of the guys who worked on the parser (including me or Narrat) had the time to re-work it. So we built-up on it.
I noticed that some tests can be dropped and some can be made more rigorous- awesome. We'll figure out regressions, if any, during testing.
No need to profile. Comment out
resp_body = gzip.GzipFile and
parser.feed(resp_body) for no IO and no parsing, and in Zsh
> time ( repeat 10 googler --np hello >/dev/null ) ( repeat 10; do; googler --np hello > /dev/null; done; ) 0.60s user 0.20s system 43% cpu 1.865 total
resp_body = gzip.GzipFile for IO but no parsing:
( repeat 10; do; googler --np hello > /dev/null; done; ) 0.70s user 0.22s system 15% cpu 5.780 total
parser.feed(resp_body) for IO and parsing (this is from my new parser logic branch, which even has slightly more overhead):
( repeat 10; do; googler --np hello > /dev/null; done; ) 0.97s user 0.22s system 20% cpu 5.760 total
As you can see, parsing time is negligible; IO is not.
Note that I'm on Wi-Fi right now; might be better on Ethernet. But it's by no means slow Wi-Fi. According to speedtest-cli: latency 4.856 ms, Download: 189.38 Mbit/s, Upload: 165.39 Mbit/s.
Actually I made a mistake when measuring "no IO and no parsing" time. I only commented out the reading from socket (downloading) part; opening connection and waiting time were included. The correct time is something like
( repeat 10; do; googler --np hello > /dev/null; done; ) 0.53s user 0.21s system 93% cpu 0.797 total
so IO time is even longer, almost 500ms. In fact, I verified with Chromium's network tab, and indeed on my current Wi-Fi it takes about 500ms in total (waiting time included, ~100ms) to finish the GET request.
Did you test for any regressions?