IETF and W3C maintain policies for their public archives, including when they will (infrequently) remove messages. I can assure you these are significant archives that really matter, and that spam, abuse and private information have all been issues in the past.
but I agree that it certainly shouldn't be a frequent issue, especially in settings where the lists are access-controlled
I think git-lfs may be effective for shared research settings (since we may need to frequently, and efficiently update many, sometimes large files), but I'm also open to seeing if IA would be a good place, if we can find a good metadata setup that works
Christovis: do you have an update on your 3gpp crawling? do you think git-lfs would work for sharing archives, or would you be interested in an Internet Archive backup, or have any other ideas?
You run into IP banning issues at a certain point, I tried a couple different proxy solutions to try and distribute requests over and none of them worked that well.
The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool).
I didn't get to the point of using any of the pools with aiohttp because if you can't get logged in then the headers are all screwed up so I didn't think it was worth it. Proxies + aiohttp (if the session is working) might work well though.
:point_up: Edit: The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool). After doing all that I re-read the wiki and it said that you had already scraped the listserv so I figured it would be smarter to just ask you for it.
:point_up: Edit: The biggest issue with the proxies is that there are a lot of proxy connection issues which in turn throw a lot of errors which are hard to handle. There are a couple libraries that create an proxy server (that you point your requests library to) that distribute all requests randomly to a list of proxies you provide (the proxies are automatically tested before being added to the pool).
:point_up: Edit: I didn't get to the point of using any of the pools with aiohttp because if you can't get logged in then the headers are all screwed up so I didn't think it was worth it. Proxies + aiohttp (if the session is working) might work well though. After doing all that I re-read the wiki and it said that you had already scraped the listserv so I figured it would be smarter to just ask you for it.
I think if slow crawling is feasible for our use cases, it's more polite, better for the archive maintainer and less likely to lead to banning/blocking
@Christovis: One reason that scraping listservs takes so long is because you are really requesting each message url twice, the headers first and then the message content because the python requests library can't render the iframe. It might make sense to implement selenium and run it headless so that it can render the js iframe with just 1 request. It might cut run length in half.
Magic wormhole is probably the easiest way to send the archive. (pip install magic-wormhole). If you are good with that then please send the code via email since wormhole uses PAKE encryption and matrix is open. I would also throw in a --code-length 50 argument just for good measure 😀.
I will have that PR submitted by the end of the day - Sarah
:point_up: Edit: @Christovis: Magic wormhole is probably the easiest way to send the archive. (pip install magic-wormhole). If you are good with that then please send the code via email since wormhole uses PAKE encryption and matrix is open. I would also throw in a --code-length 50 argument just for good measure 😀.
@Christovis: Okay will do. Just submitted that PR that you asked for. Sorry, I didn't mean for me to come across like I was demanding you send me that archive. I know how annoying it can be for open source projects to become full time jobs and I don't want to put extra work on you. The only reason I followed up was because you had mentioned last week that you thought it would be finished by the end of the week. -Sarah
I would recommend against making these publicly accessible. especially if 3gpp is limiting scraping or requires authentication in order to see full email addresses, we wouldn't want to publish all the addresses for spam harvesting, etc.
if the host of the archive is using multiple measures to limit access, I believe that's a clear sign that they don't intend for entire archives to be easily publicly downloadable and email addresses accessed in mass. perhaps you know more about email spammer practices than I do (or than the archive hosts do), but I wouldn't consider an mbox file next to impossible to parse
Also regarding my PR, I read through a ton of listserv threads last night and noticed a couple things: 1) Company logo images are treated as attachments and therefore my solution would also download all of those which we don't want. 2) At least in the case of the 3GPP listserv there are essentially no attachments, people generally just upload files to the 3GPP ftp server and then send a link in the listserv.
Another thought on this: at least in regards to 3GPP, all of the email addresses in question are freely available on the 3GPP FTP server (they are embedded in almost every document) that anybody can access. I am sure any automated scraper has already gotten those from there and therefore wouldn't need to get them from our mbox.
:point_up: Edit: (or for any of the next couple hours)
Christovis
@Christovis
Hi @markopolo1992:matrix.org I am sorry that it takes so long and hope you understand that this delicate topic takes a while to get right. If you are still on the bigbang mailing list you know that we continue to work on the data access permit application. We will get back to you as soon as we have finalised it.