Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Terminator X
    @jarun
    nope. Seems like I need to.
    Terminator X
    @jarun
    gpg --recv-keys 95043DCB
    gpg: key 95043DCB: "Arun Prakash Jana <engineerarun@gmail.com>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
    also
    gpg --keyserver hkp://keys.gnupg.net --recv-keys 95043DCB
    gpg: key 95043DCB: "Arun Prakash Jana <engineerarun@gmail.com>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
    Zhiming Wang
    @zmwangx
    Got it this time.
    Terminator X
    @jarun
    @zmwangx I am thinking of sharing the gitter link on googler page. What do you think? I guess anyone can join and leave a message for queries, clarifications etc., basically stuff that's not strictly an issue or a PR.
    what's your opinion?
    Zhiming Wang
    @zmwangx
    No objection. Here's the badge if you like badges (I do): https://img.shields.io/gitter/room/jarun/googler.svg?maxAge=2592000.
    Zhiming Wang
    @zmwangx

    I'm reworking parser logic right now. Could you please let know what's the point of

            if tag == "div":
                marker = self.url.find("?q=")
                if marker >= 0:
                    self.url = self.url[marker + 3:]
                marker = self.url.find("&sa")
                if marker >= 0:
                    self.url = self.url[:marker]
    
                if self.url != "":
                    if self.url.find("://", 0, 12) >= 0:
                        index = len(self.results) + 1
                        self.results.append(Result(index, self.title,
                                                   self.url,
                                                   self.text))
                    else:
                        skipped += 1

    The processing part is introduced in jarun/googler@c0808cf, but the commit message doesn't give any example, and after the modifications self.url surely won't be a URL anymore, unless you percent decode it, then there's a chance, but I just don't see the point.

    As for finding the protocol host separator, could you give me an example of it failing? The check was introduced in jarun/googler@d498c50 by the way, which says "Skip google news and images links", but even without this check Google News and Images results should be filtered out under Google's current layout, because they wouldn't survive our class guards.

    Finally, can we just remove the "skipped" thing? First, I've personally never seen an ad in Google's response, even for queries that clearly should serve up ads. I think ads are rendered through Javascript in the browser. Secondly, if "google news and images links" count towards skipped, then "... ads skipped" is obviously not an accurate description. Thirdly, I've never seen "... ads skipped"...

    And I don't think people really care about how many results have been skipped...
    Terminator X
    @jarun
    Regarding the second commit, I have seen Google News, Youtube, Images "URL" results which do not have any http:// or https:// before them. Perhaps Google handles them internally when clicked. I wanted to remove those.
    Regarding the firs commit, take the example of http://www.thingiverse.com/search?q=Raspberry+pi&sa=. This was the format in which google used to give some result URLs to it's own domains or to nearest eating places, based on your location (tracking :D). However, I believe not to check for the google domain is crude.
    In case of some ads, there were no links.
    the filtered ones included Google maps resutls as well.
    search for hello and you'll find results from Youtube, Google Maps and Google Images.
    Terminator X
    @jarun
    If you can work on a more elegant way to parse out the Google ads results, or if it works by default now, we should be good :)
    Zhiming Wang
    @zmwangx
    Google Maps and Google Images results do not match the structures we parse, so they are automatically filtered out even without post processing. Same goes for eating places and such (search "pizza", for instance).
    As for YouTube, when you google "hello" the first result is a music video. I think it's a valid result, because apparently it's a pretty well-known music video. I think we should keep it.
    And the only reason it doesn't show up currently is because its outermost div class is g mnr-c g-blk rather than g — i.e., the URL post processing code has no effect on that (youtube.com certainly can't be a relative link). When I parse classes appropriately, the result does show up regardless.
    Zhiming Wang
    @zmwangx
    I spent more time investigating those top "card result", and the more I investigate, the more I think they should be included. The reason is that sometimes it contains the "official result", and all further results are non-official (the card result is not duplicated). For instance, if you google "gangnam style", the card result gives you the official MV, and the second result (the first one if you exclude the card result) is some ridiculous fanmade crap titled "PSY- Gangnam Style (Official Music Video)"... And if you google "star wars rogue one trailer", same story.
    Terminator X
    @jarun
    Sure thing! I'm OK with exploring new stuff.
    I think we can easily test these for an hour each and figure out what we want to do.
    I'ld be really happy if we have to parse less. Currently it takes longer than it should and the lesser the condition checks, the faster it would work. So please continue.
    As we are already fetching gzip compressed results, parsing is the place we should check out for performance improvements.
    Zhiming Wang
    @zmwangx
    The bottleneck here is most likely IO (networking), so I don't think parsing performance matters at all. It's just a straightforward one-pass parser.
    IMO we should simplify the logic for robustness and maintainability.
    Currently parser logic is a mess. If you actually follow the parser logic, you'll see that we basically have a blob of poorly connected GOTOs.
    What I've done so far is to rework the parser logic for extensibility and maintainability, and thoroughly document it. In that process I noticed that some tests can be dropped and some can be made more rigorous.
    Terminator X
    @jarun
    The bottleneck here is most likely IO (networking) - with my network speed, I don't think so. I guess I should do a profiling. We can add this to the task list as well.
    Currently parser logic is a mess. - I'm afraid it's true. The problem is none of the guys who worked on the parser (including me or Narrat) had the time to re-work it. So we built-up on it.
    I noticed that some tests can be dropped and some can be made more rigorous - awesome. We'll figure out regressions, if any, during testing.
    Zhiming Wang
    @zmwangx

    No need to profile. Comment out resp_body = gzip.GzipFile and parser.feed(resp_body) for no IO and no parsing, and in Zsh

    > time ( repeat 10 googler --np hello >/dev/null )
    ( repeat 10; do; googler --np hello > /dev/null; done; )  0.60s user 0.20s system 43% cpu 1.865 total

    Uncomment resp_body = gzip.GzipFile for IO but no parsing:

    ( repeat 10; do; googler --np hello > /dev/null; done; )  0.70s user 0.22s system 15% cpu 5.780 total

    Now uncomment parser.feed(resp_body) for IO and parsing (this is from my new parser logic branch, which even has slightly more overhead):

    ( repeat 10; do; googler --np hello > /dev/null; done; )  0.97s user 0.22s system 20% cpu 5.760 total

    As you can see, parsing time is negligible; IO is not.

    Note that I'm on Wi-Fi right now; might be better on Ethernet. But it's by no means slow Wi-Fi. According to speedtest-cli: latency 4.856 ms, Download: 189.38 Mbit/s, Upload: 165.39 Mbit/s.

    Terminator X
    @jarun
    We make gzip compression optional.
    Today.
    Ahh sorry
    Can we have the numbers with fetching results without gzip compression?
    Zhiming Wang
    @zmwangx
    I'm in a meeting right now. Will get back to you later.
    Zhiming Wang
    @zmwangx
    Not much different without gzip:
    ( repeat 10; do; googler --np hello > /dev/null; done; )  0.86s user 0.22s system 19% cpu 5.572 total
    Zhiming Wang
    @zmwangx

    Actually I made a mistake when measuring "no IO and no parsing" time. I only commented out the reading from socket (downloading) part; opening connection and waiting time were included. The correct time is something like

    ( repeat 10; do; googler --np hello > /dev/null; done; )  0.53s user 0.21s system 93% cpu 0.797 total

    so IO time is even longer, almost 500ms. In fact, I verified with Chromium's network tab, and indeed on my current Wi-Fi it takes about 500ms in total (waiting time included, ~100ms) to finish the GET request.

    Terminator X
    @jarun
    OK... in that case we are fine with gzip.
    Zhiming Wang
    @zmwangx
    By the way, in the end I still omitted the "card result" (https://github.com/jarun/googler/blob/master/googler#L317-L319) because sometimes it is duplicated (Wikipedia result could be duplicated, for instance), plus the fact that it doesn't have an abstract.
    Did you test for any regressions? I already did randomized tests, but given that Google doesn't serve the same thing to everyone, we might still need more testing.
    Terminator X
    @jarun
    By the way, in the end I still omitted the "card result"
    okies
    Did you test for any regressions?
    I will in the coming weekend.
    too busy with office due to an ongoing release :)
    Terminator X
    @jarun
    Can you check with -d switch if your query is being redirected?
    OK leave it...
    Zhiming Wang
    @zmwangx
    Redirected to?