Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    If maintaining that code and archive is too much work for you, I would be totally willing to manage that. (I don't mean to just dump work on you guys)
    1 reply
    npd
    @npd:mozilla.org
    [m]
    IETF and W3C maintain policies for their public archives, including when they will (infrequently) remove messages. I can assure you these are significant archives that really matter, and that spam, abuse and private information have all been issues in the past.
    but I agree that it certainly shouldn't be a frequent issue, especially in settings where the lists are access-controlled
    I think git-lfs may be effective for shared research settings (since we may need to frequently, and efficiently update many, sometimes large files), but I'm also open to seeing if IA would be a good place, if we can find a good metadata setup that works
    Christovis: do you have an update on your 3gpp crawling? do you think git-lfs would work for sharing archives, or would you be interested in an Internet Archive backup, or have any other ideas?
    3 replies
    npd
    @npd:mozilla.org
    [m]
    I've tried a different matrix handle to see if we can get a notification to him 🙂
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    Is chris active on matrix? I could always send him a message on github.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    npd:
    npd
    @npd:mozilla.org
    [m]
    hmm, yeah, you might try email or github
    2 replies
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: Are you using the stock bigbang scraper for the 3GPP archive? I tried to make it async and speed it up and it didn't work that well.
    2 replies
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    You run into IP banning issues at a certain point, I tried a couple different proxy solutions to try and distribute requests over and none of them worked that well.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool).
    I didn't get to the point of using any of the pools with aiohttp because if you can't get logged in then the headers are all screwed up so I didn't think it was worth it. Proxies + aiohttp (if the session is working) might work well though.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    :point_up: Edit: The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool). After doing all that I re-read the wiki and it said that you had already scraped the listserv so I figured it would be smarter to just ask you for it.
    :point_up: Edit: The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool).
    :point_up: Edit: I didn't get to the point of using any of the pools with aiohttp because if you can't get logged in then the headers are all screwed up so I didn't think it was worth it. Proxies + aiohttp (if the session is working) might work well though. After doing all that I re-read the wiki and it said that you had already scraped the listserv so I figured it would be smarter to just ask you for it.
    npd
    @npd:mozilla.org
    [m]
    I think if slow crawling is feasible for our use cases, it's more polite, better for the archive maintainer and less likely to lead to banning/blocking
    2 replies
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: One reason that scraping listservs takes so long is because you are really requesting each message url twice, the headers first and then the message content because the python requests library can't render the iframe. It might make sense to implement selenium and run it headless so that it can render the js iframe with just 1 request. It might cut run length in half.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: I just got back. Has the scraping finished yet? If so would you be willing to send me a copy of the archive?
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    Magic wormhole is probably the easiest way to send the archive. (pip install magic-wormhole). If you are good with that then please send the code via email since wormhole uses PAKE encryption and matrix is open. I would also throw in a --code-length 50 argument just for good measure 😀.
    I will have that PR submitted by the end of the day - Sarah
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    :point_up: Edit: @Christovis: Magic wormhole is probably the easiest way to send the archive. (pip install magic-wormhole). If you are good with that then please send the code via email since wormhole uses PAKE encryption and matrix is open. I would also throw in a --code-length 50 argument just for good measure 😀.
    2 replies
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: Okay will do. Just submitted that PR that you asked for. Sorry, I didn't mean for me to come across like I was demanding you send me that archive. I know how annoying it can be for open source projects to become full time jobs and I don't want to put extra work on you. The only reason I followed up was because you had mentioned last week that you thought it would be finished by the end of the week. -Sarah
    1 reply
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    Yeah I didn't know if you were going to upload those there that's why I said that. If they are uploaded then yes that flag wouldn't matter.
    npd
    @npd:mozilla.org
    [m]
    I would recommend against making these publicly accessible. especially if 3gpp is limiting scraping or requires authentication in order to see full email addresses, we wouldn't want to publish all the addresses for spam harvesting, etc.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    the mbox file format (with compression on top of that) will likely make it next to impossible that an automated scraper would find those addresses
    npd
    @npd:mozilla.org
    [m]
    if the host of the archive is using multiple measures to limit access, I believe that's a clear sign that they don't intend for entire archives to be easily publicly downloadable and email addresses accessed in mass. perhaps you know more about email spammer practices than I do (or than the archive hosts do), but I wouldn't consider an mbox file next to impossible to parse
    1 reply
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: If you have a sec, could you send me however much you have gotten so far?
    1 reply
    That way I can get started on the viz 🙂
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    Also regarding my PR, I read through a ton of listserv threads last night and noticed a couple things: 1) Company logo images are treated as attachments and therefore my solution would also download all of those which we don't want. 2) At least in the case of the 3GPP listserv there are essentially no attachments, people generally just upload files to the 3GPP ftp server and then send a link in the listserv.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    Another thought on this: at least in regards to 3GPP, all of the email addresses in question are freely available on the 3GPP FTP server (they are embedded in almost every document) that anybody can access. I am sure any automated scraper has already gotten those from there and therefore wouldn't need to get them from our mbox.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    I am a college student using the data in a paper about 3GPP for my telecom class.
    I would be fine signing some sort of release to get the data. I just want to limit the strain we put on 3GPP's servers. 🙂
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: Is there any update on the legal advice you received?
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    @Christovis: was my application received?
    Christovis
    @Christovis
    Yes, please let me now when you could shake hands through the wormhole.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    I am available right now if that works for you @Christovis
    (or any of the next couple hours)
    If you are busy it might make sense to wait for the weekend.
    markopolo1992
    @markopolo1992:matrix.org
    [m]
    :point_up: Edit: (or for any of the next couple hours)
    Christovis
    @Christovis
    Hi @markopolo1992:matrix.org I am sorry that it takes so long and hope you understand that this delicate topic takes a while to get right. If you are still on the bigbang mailing list you know that we continue to work on the data access permit application. We will get back to you as soon as we have finalised it.
    npd:mozilla.org @npd:mozilla.org waves
    npd
    @npd:mozilla.org
    [m]
    @Christovis: ready to chat when you are
    npd
    @npd:mozilla.org
    [m]
    IETF is this week, and I'm in an overlapping session this morning, so I'll be late to our call (unless the session ends early)
    npd
    @npd:mozilla.org
    [m]
    no one is in the call and I don't see any notes from today
    but for the record I know I'm behind on integrating my notebook, and apologize for my tardiness
    npd
    @npd:mozilla.org
    [m]
    regrets, friends, I don't think I'm going to make the call today, I'm just too far behind
    I'm afraid I don't have any updates on my task for a PR; I apologize
    npd
    @npd:mozilla.org
    [m]
    small call today, checkin with Seb, me, Priyanka
    npd
    @npd:mozilla.org
    [m]
    regrets, I think it's unlikely I'm going to make the call in 9 hours