These are chat archives for rosshinkley/nightmare

25th
Jul 2016
Antonello Pasella
@antonellopasella
Jul 25 2016 14:17
Hi, I have a problem with proxy+auth. Here a gist with my setup, there is anyone that can help me?
https://gist.github.com/antonellopasella/0efa1fe935a4fbcc5b561b0699e809df
Ross Hinkley
@rosshinkley
Jul 25 2016 14:37
@antonellopasella what's the behavior you're getting?
and what are you expecting?
Antonello Pasella
@antonellopasella
Jul 25 2016 14:37
White screen with no page
Without authentication and free proxy I get the correct page
Ross Hinkley
@rosshinkley
Jul 25 2016 14:38
silly question, what does the debug output tell you?
your setup looks right at first blush
Antonello Pasella
@antonellopasella
Jul 25 2016 14:44
I'll activate debug and report to you back
Mingsterism
@mingsterism
Jul 25 2016 15:16
@rosshinkley hey ross. want to ask, let say i want to grab data from a site everyday. i run it at a certain time. but i do not want to recollect past data that i already have. only new data. how can i go about this?
is there a way to check efficiently? because as data grows larger, it takes more resources.
Ross Hinkley
@rosshinkley
Jul 25 2016 15:18
uh, that depends... if what you are collecting has good key information, you should be able to reconcile what you do and don't have
Mingsterism
@mingsterism
Jul 25 2016 15:19
hmm. what you mean good key information?
let say news articles.
Ross Hinkley
@rosshinkley
Jul 25 2016 15:19
gotcha
so in that example... you might use, say, the URL for the article to know whether or not you've processed it already
Mingsterism
@mingsterism
Jul 25 2016 15:20
so you mean i have to search my whole database? should i use certain search algos?
Ross Hinkley
@rosshinkley
Jul 25 2016 15:20
id existence queries are pretty cheap
Mingsterism
@mingsterism
Jul 25 2016 15:21
umm. what does that mean. sry.
Ross Hinkley
@rosshinkley
Jul 25 2016 15:21
well, ... i should walk that back
in this case you'd be doing an existence check on a field in your db
but still... something like select 1 from articles where url = $1 should be reasonably inexpensive
Mingsterism
@mingsterism
Jul 25 2016 15:22
so am i searching by string? or ID
Ross Hinkley
@rosshinkley
Jul 25 2016 15:22
i don't know enough about your data
you'll be searching for whatever makes the data unique
if you need to get real fancy, you could use a bloom filter
Mingsterism
@mingsterism
Jul 25 2016 15:23
ooo. i heard about that before.
it that good?
Ross Hinkley
@rosshinkley
Jul 25 2016 15:23
but i suspect for your purposes, it's probably unnecessary
Mingsterism
@mingsterism
Jul 25 2016 15:24
it was done in the large scale crawler
why unecessary?
Ross Hinkley
@rosshinkley
Jul 25 2016 15:24
adds quite a bit of complexity for not a whole lot of payoff
.... if your dataset is smallish
Mingsterism
@mingsterism
Jul 25 2016 15:25
ah i see. let say my bot grab the title " War in Iraq ", must I cross reference this title against every document in my mongoDB collection?
eg: I got 1 mil documents in collection, so must check 1 mil times, or until i got a match. is that correct?
each document stores a title
Ross Hinkley
@rosshinkley
Jul 25 2016 15:26
titles are probably not the best way to do that
as they're not going to be unique
setting that aside
are you talking about query performance?
because you shouldn't have to do a check like that manually.
... that's why we have queries.
Mingsterism
@mingsterism
Jul 25 2016 15:28
but can i automate that whilst the bot is extracting data?
i dont want it to extract redundant data
Ross Hinkley
@rosshinkley
Jul 25 2016 15:28
sure
Mingsterism
@mingsterism
Jul 25 2016 15:30
any reference sites i can take a look at to implement this?
im still not sure how to know if a url is not in my database.
unless check all documents one by one.
Ross Hinkley
@rosshinkley
Jul 25 2016 15:31
ha, fair enough. Let's take this offline
check your dms
Matthew Steedman
@knubie
Jul 25 2016 18:17
@rickmed thanks for the link!