Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • 11:57
    howie6879 commented #91
  • 11:56
    howie6879 commented #96
  • 11:55

    howie6879 on gh-pages

    v0.6.5 beta (compare)

  • 11:54
    howie6879 closed #94
  • 11:54

    howie6879 on master

    Merge pull request #1 from howi… Add documentation about IgnoreT… Merge pull request #96 from pan… (compare)

  • 11:54
    howie6879 closed #96
  • 11:03
    panhaoyu opened #96
  • 10:36
    abmyii commented #91
  • 09:44
    howie6879 commented #91
  • 09:42
    howie6879 commented #91
  • 08:50
    abmyii commented #93
  • 01:23

    howie6879 on master

    Default (when many=True) should… Merge pull request #93 from abm… (compare)

  • 01:23
    howie6879 closed #93
  • 01:23
    howie6879 closed #92
  • Jan 17 18:24
    abmyii commented #92
  • Jan 17 18:23
    abmyii synchronize #93
  • Jan 17 18:15
    abmyii commented #92
  • Jan 17 10:39
    howie6879 commented #92
  • Jan 17 10:37
    howie6879 commented #92
  • Jan 17 08:31
    abmyii commented #91
Becaree
@misterpilou
To present myself a bit. I'm from France, i'm learning python since 3 years now. I work on 2 crawlers last year: Sparkler (in java) and scrapy cluster. I watch the code of the main composent of ruia (mazing work btw you rock!). And i think i need to work a bit by doing some spiders and testing things by myself before contribute back to ruia mostly because i still struggling with asyncio. I don't know if you already use/ heard off scrapy cluster, these architecture could interest you i think.
howie.hu
@howie6879
@misterpilou Hi , so glad you can join Ruia,. I'm from China, i've been using Python for three years, i work for a gamimg company and i am a machine learning engineer
About Ruia, i've been working in asyncio, so i came up with the idea of writing an async web scrapping framwork
This project started six months ago, Ruia is still under developing, so feel free to open issues and pull requests
Becaree
@misterpilou
@howie6879 so I did my first spider without difficulty, just got a bug early on crawl because the website didn't like my concurrency. Bu then everything worked perfectly. I was able to scrape 6000 food recipes in 3 hours (i reduced the concurrency a lots). All thanks to you Sir =)
howie.hu
@howie6879
@misterpilou That is a good news for Ruia,
As I said earlier, you need to consider the limitations of the target site.
Just like ip ua limirations etc..
Becaree
@misterpilou
@howie6879 Would making Requests awaitable would make sense? I have a spider with two parse functions. The first one if for iterating through the website and the second for download some files . The second is the purpose of my crawl, yielding it make iterating a lot before downloading the files. I don't know if it make sense, maybe i don't do things correctly
howie.hu
@howie6879
@misterpilou
I don't quite understand what do you mean, can you post details code?
Request.fetch() is awaitable
Perhaps this function is what you need?
    async def parse(self, response):
        pages = ['http://httpbin.org/get?p=1' for i in range(10)]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)
Becaree
@misterpilou
I don't have code anymore, i will try to reproduce it in another spider. But as you said if fetch is awaitable it would do the trick
For quick explanation, I've got a: yield Request(url=url, callback=self.parse_page) and hoping that awaiting it would call the parse_page function
Becaree
@misterpilou
@howie6879 i don't get how get_items work, every time i want to use it i got <Callback[parse_item]: target_item is expected
howie.hu
@howie6879
@misterpilou
Maybe you should define target_item in your subItem class
Hope this tutorials can help you: https://python-ruia.org/en/quickstart.html
Becaree
@misterpilou
Okay didn't notice that, thank you