by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 09:05
    vkudyushev edited #114
  • 09:00
    vkudyushev edited #114
  • 08:59
    vkudyushev edited #114
  • 08:58
    vkudyushev opened #114
  • Jun 22 00:34
    howie6879 closed #111
  • Jun 22 00:34
    howie6879 closed #112
  • Jun 07 12:34

    howie6879 on master

    v0.6.8 (compare)

  • Jun 07 12:24
    howie6879 commented #111
  • Jun 07 12:22
    howie6879 commented #111
  • Jun 07 11:19
    nqzhang commented #113
  • Jun 05 12:24
    howie6879 closed #113
  • Jun 05 07:00
    synodriver commented #113
  • Jun 04 13:24
    howie6879 commented #113
  • Jun 04 07:21
    synodriver opened #113
  • May 29 12:44
    howie6879 commented #112
  • May 29 07:48
    owen800q commented #112
  • May 29 07:48
    owen800q opened #112
  • Apr 25 14:48
    Vastxiao opened #111
  • Apr 22 12:04
    howie6879 closed #109
  • Apr 22 12:04
    howie6879 commented #109
Becaree
@misterpilou
To present myself a bit. I'm from France, i'm learning python since 3 years now. I work on 2 crawlers last year: Sparkler (in java) and scrapy cluster. I watch the code of the main composent of ruia (mazing work btw you rock!). And i think i need to work a bit by doing some spiders and testing things by myself before contribute back to ruia mostly because i still struggling with asyncio. I don't know if you already use/ heard off scrapy cluster, these architecture could interest you i think.
howie.hu
@howie6879
@misterpilou Hi , so glad you can join Ruia,. I'm from China, i've been using Python for three years, i work for a gamimg company and i am a machine learning engineer
About Ruia, i've been working in asyncio, so i came up with the idea of writing an async web scrapping framwork
This project started six months ago, Ruia is still under developing, so feel free to open issues and pull requests
Becaree
@misterpilou
@howie6879 so I did my first spider without difficulty, just got a bug early on crawl because the website didn't like my concurrency. Bu then everything worked perfectly. I was able to scrape 6000 food recipes in 3 hours (i reduced the concurrency a lots). All thanks to you Sir =)
howie.hu
@howie6879
@misterpilou That is a good news for Ruia,
As I said earlier, you need to consider the limitations of the target site.
Just like ip ua limirations etc..
Becaree
@misterpilou
@howie6879 Would making Requests awaitable would make sense? I have a spider with two parse functions. The first one if for iterating through the website and the second for download some files . The second is the purpose of my crawl, yielding it make iterating a lot before downloading the files. I don't know if it make sense, maybe i don't do things correctly
howie.hu
@howie6879
@misterpilou
I don't quite understand what do you mean, can you post details code?
Request.fetch() is awaitable
Perhaps this function is what you need?
    async def parse(self, response):
        pages = ['http://httpbin.org/get?p=1' for i in range(10)]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)
Becaree
@misterpilou
I don't have code anymore, i will try to reproduce it in another spider. But as you said if fetch is awaitable it would do the trick
For quick explanation, I've got a: yield Request(url=url, callback=self.parse_page) and hoping that awaiting it would call the parse_page function
Becaree
@misterpilou
@howie6879 i don't get how get_items work, every time i want to use it i got <Callback[parse_item]: target_item is expected
howie.hu
@howie6879
@misterpilou
Maybe you should define target_item in your subItem class
Hope this tutorials can help you: https://python-ruia.org/en/quickstart.html
Becaree
@misterpilou
Okay didn't notice that, thank you