by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Dominik Ottenbreit
    @dottenbr
    hi there @coder46 any chance you are around?
    @phobik sorry any chance you are around?
    phobik
    @phobik
    supp
    Tharunkumar V
    @tharun148
    Hi
    I want to scrap data from ajax website
    what should i do??
    D Rajiv Lochan Patra
    @rajiv13531
    hi
    regex for a number
    1,51,101............
    at end of string
    Harikrishnan Shaji
    @har777
    Hey how do I allow scrapy item json output to allow null values ?
    Mike Winkelmann
    @codewinkel
    Hi does anyone have experiences with a deployment of a scrapy spider with dependencies? I thinks I have done everything correctly in relation to the documentation. I will deploy a spider with my own dependency and external dependencies from pypi. Everything fro my setup.py will be included in the EGG-INFO stuff but nothing will be automatically downloaded :/ Can anyone help?
    Shashank Sharma
    @shashank-sharma
    Hey, can anyone help me with scrapy returning data in organized way
    Quentin Durantay
    @VonStruddle
    Hey everybody, does somebody know how to get a url after a redirect? I'm currently scraping a href that redirects to another url, and would like to store the letter
    Charles Green
    @charlesgreen
    @VonStruddle The response.url param in the callback should give you what you are looking for. https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response
    Hi All
    Quentin Durantay
    @VonStruddle
    @charlesgreen Actually it doesn't, but I've found a trick. Just use requests to GET the response.url, and load it back, as requests follows redirect by default
    Charles Green
    @charlesgreen
    @VonStruddle good to know. Thanks for sharing.
    Charles Green
    @charlesgreen
    @poulinjoel please check stackoverflow. I posted a reply.
    Jeremy Jordan
    @jeremyjordan_twitter
    Hi, I am running a spider with Scrapy but after it finishes crawling it can't seem to terminate. Log stats just recursively report that it is scraping 0 pages/minute. When I try to quit with Ctrl-C, it fails to shut down gracefully and I have to quit forcefully with Ctrl-C again. Any clue what is happening?
    Charles Green
    @charlesgreen
    @jeremyjordan_twitter did you find the issue? Sorry, late to reply. Sounds like the final step did not return or yield an Item or a request signaling the end of the run.
    cupidon192
    @cupidon192
    Hello
    I'm trying to login facebook with scrapy shell
    anyone know how to do it?
    Jay Kim (Data Scientist)
    @bravekjh
    Hi everyone. I joined this room first time today, nice to meet you all
    olivierognn
    @olivierognn
    Hi
    does someone use portia to scrape?
    Karmenzind
    @Karmenzind
    https://github.com/Karmenzind/fp-server
    Hey guys. I wrote a free proxy server based on Tornado and Scrapy. Try it if you need a proxy pool for your scrapy project.
    skyhawk
    @wsky03
    @Karmenzind
    @Karmenzind thanks for sharing
    Karmenzind
    @Karmenzind
    :smile:
    tiinanguyen
    @tiinanguyen
    Hello! I am using scrapy-cluster to web-scrape a very diverse, unstandardized set of websites. The current setup works for 90-95% of cases, but has issues with some specific sites. When I inspected a few of these problematic websites individually, the Google Chrome console showed some errors (in the web design?) such as “Uncaught ReferenceError: require is not defined”, “Uncaught TypeError: Cannot read property”, “A parser-blocking, cross site (i.e. different eTLD+1) script… is invoked”. I want to improve the accuracy of this method. Is there any middleware or tool I can use to bypass these errors and scrape these websites?
    farman32
    @farman32

    from bs4 import BeautifulSoup as soup # HTML data structure
    from urllib.request import urlopen as uReq # Web client

    URl to web scrap from.

    in this example we web scrap graphics cards from Newegg.com

    page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

    opens the connection and downloads html page from url

    uClient = uReq(page_url)

    parses html into a soup data structure to traverse html

    as if it were a json data type.

    page_soup = soup(uClient.read(), "html.parser")
    uClient.close()

    finds each product from the store page

    containers = page_soup.findAll("div", {"class": "item-container"})

    name the output file to write to local disk

    out_filename = "graphics_cards.csv"

    header of csv file to be written

    headers = "brand,product_name,shipping \n"

    opens file, and writes headers

    f = open(out_filename, "w")
    f.write(headers)

    loops over each product and grabs attributes about

    each product

    for container in containers:

    # Finds all link tags "a" from within the first div.
    make_rating_sp = container.div.select("a")
    
    # Grabs the title from the image title attribute
    # Then does proper casing using .title()
    brand = make_rating_sp[0].img["title"].title()
    
    # Grabs the text within the second "(a)" tag from within
    # the list of queries.
    product_name = container.div.select("a")[2].text
    
    # Grabs the product shipping information by searching
    # all lists with the class "price-ship".
    # Then cleans the text of white space with strip()
    # Cleans the strip of "Shipping $" if it exists to just get number
    shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")
    
    # prints the dataset to console
    print("brand: " + brand + "\n")
    print("product_name: " + product_name + "\n")
    print("shipping: " + shipping + "\n")
    
    # writes the dataset to file
    f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "\n")

    f.close() # Close the file

    can anyone plz run this code once and tell me the error that was generated as 'nonetype' obj is not subscriptable
    Edmondo Porcu
    @edmondo1984
    Guys, assuming I want to unit test my scrapers, how do I do it? Can I download the html page and test locally?
    Anh Nguyen
    @anh072
    Yeah I have the same question
    Normally i use beautiful soup. Just start out with scrapy
    Eaves Cat
    @wendrewshay
    @wendrewshay
    Hello, Anyone here?
    I have a question about lua script when I used scrapy_splash, I can't resolve it , could someone help me please.
    image.png
    This is my lua script, and the error reported like this >>
    image.png
    _fantasticDev_
    @SnakeGeneral
    @wendrewshay
    hello sir
    i have a problem with web scraping
    could you like to help me?
    _fantasticDev_
    @SnakeGeneral
    hello everyone!!!
    please help me~~~
    i have a problem with web scraping
    seuaCoder
    @seuaCoder

    Hello,

    Can someone help me with my scrapy spider ? I'm beginner in python. I have wrote a scrapy spider which retrieve 100 urls from a rest api, scrape each url and extract the data, then post the items to another rest endpoint through the pipelines.

    The problem is it is very slow, for 100 urls the jobs sometimes take 1 minutes but sometime 10 minutes to finish. All the urls are on diffrents domains/websites so there is no probleme gettings ban. Each websites receive only one request.

    What could be the possible issue ?

    Thank you.

    Thiago Marcello
    @thiagocmarcello
    how to loggin scrapy?
    Vishesh Mangla
    @XtremeGood
    unable to login through scrapy