Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Aug 27 12:29

    howie6879 on master

    Fixed: Spider can't send a POST… (compare)

  • Aug 27 04:37

    howie6879 on master

    Fix the encode error for HtmlFi… (compare)

  • Aug 26 14:24
    hiancdtrsnm commented #74
  • Aug 25 05:14
    howie6879 commented #74
  • Aug 25 05:12
    howie6879 commented #75
  • Aug 25 03:06
    hiancdtrsnm opened #75
  • Aug 25 02:58
    hiancdtrsnm edited #74
  • Aug 25 02:57
    hiancdtrsnm opened #74
  • Aug 25 02:54
    hiancdtrsnm closed #73
  • Aug 25 02:54
    hiancdtrsnm commented #73
  • Aug 25 02:23
    hiancdtrsnm opened #73
  • Aug 13 08:15
    howie6879 commented #68
  • Aug 13 06:23
    Snailwicked commented #68
  • Aug 09 03:41

    howie6879 on master

    Update README.md (compare)

  • Aug 09 03:23

    howie6879 on master

    Pass kwargs to init and return … Merge pull request #72 from max… (compare)

  • Aug 09 03:23
    howie6879 closed #72
  • Aug 09 03:11
    codecov-io commented #72
  • Aug 09 03:11
    codecov-io commented #72
  • Aug 09 03:10
    maxzheng opened #72
  • Aug 05 14:16

    howie6879 on master

    Format code by using black (compare)

Becaree
@misterpilou
To present myself a bit. I'm from France, i'm learning python since 3 years now. I work on 2 crawlers last year: Sparkler (in java) and scrapy cluster. I watch the code of the main composent of ruia (mazing work btw you rock!). And i think i need to work a bit by doing some spiders and testing things by myself before contribute back to ruia mostly because i still struggling with asyncio. I don't know if you already use/ heard off scrapy cluster, these architecture could interest you i think.
howie.hu
@howie6879
@misterpilou Hi , so glad you can join Ruia,. I'm from China, i've been using Python for three years, i work for a gamimg company and i am a machine learning engineer
About Ruia, i've been working in asyncio, so i came up with the idea of writing an async web scrapping framwork
This project started six months ago, Ruia is still under developing, so feel free to open issues and pull requests
Becaree
@misterpilou
@howie6879 so I did my first spider without difficulty, just got a bug early on crawl because the website didn't like my concurrency. Bu then everything worked perfectly. I was able to scrape 6000 food recipes in 3 hours (i reduced the concurrency a lots). All thanks to you Sir =)
howie.hu
@howie6879
@misterpilou That is a good news for Ruia,
As I said earlier, you need to consider the limitations of the target site.
Just like ip ua limirations etc..
Becaree
@misterpilou
@howie6879 Would making Requests awaitable would make sense? I have a spider with two parse functions. The first one if for iterating through the website and the second for download some files . The second is the purpose of my crawl, yielding it make iterating a lot before downloading the files. I don't know if it make sense, maybe i don't do things correctly
howie.hu
@howie6879
@misterpilou
I don't quite understand what do you mean, can you post details code?
Request.fetch() is awaitable
Perhaps this function is what you need?
    async def parse(self, response):
        pages = ['http://httpbin.org/get?p=1' for i in range(10)]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)
Becaree
@misterpilou
I don't have code anymore, i will try to reproduce it in another spider. But as you said if fetch is awaitable it would do the trick
For quick explanation, I've got a: yield Request(url=url, callback=self.parse_page) and hoping that awaiting it would call the parse_page function
Becaree
@misterpilou
@howie6879 i don't get how get_items work, every time i want to use it i got <Callback[parse_item]: target_item is expected
howie.hu
@howie6879
@misterpilou
Maybe you should define target_item in your subItem class
Hope this tutorials can help you: https://python-ruia.org/en/quickstart.html
Becaree
@misterpilou
Okay didn't notice that, thank you