Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Oct 30 17:13

    howie6879 on gh-pages

    Delete CNAME (compare)

  • Oct 30 17:11

    howie6879 on gh-pages

    Create CNAME (compare)

  • Oct 29 14:52
    howie6879 commented #149
  • Oct 29 14:52
    howie6879 reopened #149
  • Oct 29 14:03
    mogeko closed #149
  • Oct 29 14:02
    mogeko commented #149
  • Oct 29 11:42

    howie6879 on gh-pages

    Update Caddyfile (compare)

  • Oct 29 10:10

    howie6879 on master

    Update docs (compare)

  • Oct 29 08:45

    howie6879 on gh-pages

    Update docs (compare)

  • Oct 29 00:07
    howie6879 commented #149
  • Oct 28 20:51
    Mogeko opened #149
  • Oct 05 12:15
    howie6879 closed #148
  • Sep 28 12:33
    howie6879 commented #148
  • Sep 28 10:54
    Mogeko opened #148
  • Jul 29 08:45

    howie6879 on master

    spider 添加类型注释 Merge pull request #147 from Va… (compare)

  • Jul 29 08:45
    howie6879 closed #147
  • Jul 29 04:46
    Vastxiao opened #147
  • Jul 12 14:36
    Vastxiao edited #146
  • Jul 12 14:35
    Vastxiao opened #146
  • Jul 05 12:50

    howie6879 on master

    fix: limit filter in _parse_htm… Merge pull request #143 from la… (compare)

Becaree
@misterpilou
To present myself a bit. I'm from France, i'm learning python since 3 years now. I work on 2 crawlers last year: Sparkler (in java) and scrapy cluster. I watch the code of the main composent of ruia (mazing work btw you rock!). And i think i need to work a bit by doing some spiders and testing things by myself before contribute back to ruia mostly because i still struggling with asyncio. I don't know if you already use/ heard off scrapy cluster, these architecture could interest you i think.
howie.hu
@howie6879
@misterpilou Hi , so glad you can join Ruia,. I'm from China, i've been using Python for three years, i work for a gamimg company and i am a machine learning engineer
About Ruia, i've been working in asyncio, so i came up with the idea of writing an async web scrapping framwork
This project started six months ago, Ruia is still under developing, so feel free to open issues and pull requests
Becaree
@misterpilou
@howie6879 so I did my first spider without difficulty, just got a bug early on crawl because the website didn't like my concurrency. Bu then everything worked perfectly. I was able to scrape 6000 food recipes in 3 hours (i reduced the concurrency a lots). All thanks to you Sir =)
howie.hu
@howie6879
@misterpilou That is a good news for Ruia,
As I said earlier, you need to consider the limitations of the target site.
Just like ip ua limirations etc..
Becaree
@misterpilou
@howie6879 Would making Requests awaitable would make sense? I have a spider with two parse functions. The first one if for iterating through the website and the second for download some files . The second is the purpose of my crawl, yielding it make iterating a lot before downloading the files. I don't know if it make sense, maybe i don't do things correctly
howie.hu
@howie6879
@misterpilou
I don't quite understand what do you mean, can you post details code?
Request.fetch() is awaitable
Perhaps this function is what you need?
    async def parse(self, response):
        pages = ['http://httpbin.org/get?p=1' for i in range(10)]
        async for resp in self.multiple_request(pages):
            yield self.parse_item(response=resp)
Becaree
@misterpilou
I don't have code anymore, i will try to reproduce it in another spider. But as you said if fetch is awaitable it would do the trick
For quick explanation, I've got a: yield Request(url=url, callback=self.parse_page) and hoping that awaiting it would call the parse_page function
Becaree
@misterpilou
@howie6879 i don't get how get_items work, every time i want to use it i got <Callback[parse_item]: target_item is expected
howie.hu
@howie6879
@misterpilou
Maybe you should define target_item in your subItem class
Hope this tutorials can help you: https://python-ruia.org/en/quickstart.html
Becaree
@misterpilou
Okay didn't notice that, thank you
SamAyala
@samayala22
@howie6879 Hello, do you still plan on maintaining Ruia ?