Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Federico Tolomei
    @s17t
    I am asking for sonatype access https://issues.sonatype.org/browse/OSSRH-1800
    rz
    @rzo1
    nice! that would be awesome. I know that it is possible...
    "Permissions granted." => looks nice
    Federico Tolomei
    @s17t
    I am going to untrack .idea folder and Intellij project files.
    Federico Tolomei
    @s17t
    rz
    @rzo1
    ok nice!
    lumyus
    @lumyus
    hey guys. how can one let the crawler wait to load a certain page that has dynamic content? And then wait until it is visible
    thanks for your help ;)
    rz
    @rzo1
    Dynamic Page Loading is done via JavaScript?
    Federico Tolomei
    @s17t
    thanks @rzo1, il will give a review to your repository
    rz
    @rzo1
    @s17t ok, feel free ... it's quite heavy and (yet) uncomplete, because it was some kind of proof-of-concept implementation, that db-implementation can be extracted
    (i did not add a lot of documentation, which should be done in advance...)
    linbren
    @linbren
    @lumyus config.setMaxDepthOfCrawling(1);
    zeyudu
    @zeyudu
    hi do u know how to swith proxy during the process
    Federico Tolomei
    @s17t
    @zeyudu, indeed it is not a feature. I would suggest to try to switch http.proxyHost JVM property value but crawler4j has its own properties, which is bad.
    would you kindly fill a feature request on github?
    zeyudu
    @zeyudu
    alright~
    yunlong17568
    @YunlongYang
    <img src="/static/images/abc.jpg"/> do not work
    if src start with '/' , crawler4j cannot work good
    Federico Tolomei
    @s17t
    please fill a bug specifiyng the url you are crawling
    yunlong17568
    @YunlongYang
    image.png
    crawler4j do not think the img is a resource
    I have open a issue at github as "edu.uci.ics.crawler4j.parser.Parser parse WebURL not completely. #254"
    Federico Tolomei
    @s17t
    thank you
    yunlong17568
    @YunlongYang
    u are welcome
    i think when finding url from html content, we should combine the context, not a simple text but HTML
    Valery Yatsynovich
    @valfirst
    hi everyone,
    does crawler4j support preemptive basic authentication?
    Julius Ahenkora
    @jahenkor
    Is form authentication still functional for crawler4j?
    Federico Tolomei
    @s17t
    hello Julius
    it should be but I m not a user of it (and it lacks test)
    did you encounter problems ?
    Julius Ahenkora
    @jahenkor
    Yes, the crawler outputs a "successful login" message, for correct/incorrect login credentials. And doesn't seem to (from what I can see) save any cookies from the session. So the crawler just crawls the login page.
    Federico Tolomei
    @s17t
    Ok, I will look into it
    Julius Ahenkora
    @jahenkor
    Appreciate it
    Riccardo Tasso
    @raymanrt
    Hi, is it possible to customize or override the edu.uci.ics.crawler4j.parser.Parser class without overriding the WebCrawler.processPage method?
    probabily in my case it should be enough the override of HtmlContentHandler class
    Federico Tolomei
    @s17t
    Hi @raymanrt would you detail more your use case? Parser in WebCrawler is the internal parser. Users should use their own parser (e.g. jsoup, tika with custom config) in their WebCrawler
    Riccardo Tasso
    @raymanrt
    Hi @s17t I don't understand how to use my own parser
    Parser is a private field in WebCrawler, and there is no setter
    the only method which uses the parser is processPage, but it's private
    Federico Tolomei
    @s17t
    @jahenkor See #291, when using FormAuthInfo be sure to set a CookieStore and a CookiePolicy to CrawlConfig see FormAuthInfoTest
    Julius Ahenkora
    @jahenkor
    @s17t thanks ill definitely try it out!
    Federico Tolomei
    @s17t
    I am trying to release 4.4.0. Sonatype requirements about POMs are tight. <developer> is now required. If some committer want to get included now please send me <developer> tag, or send a PR after 4.4.0 release, in a couple of days :|
    rz
    @rzo1
    ok
    Federico Tolomei
    @s17t
    the 4.4.0 is on the repo
    Federico Tolomei
    @s17t
    about #310 , the snapshot are on sonatype's snapshot repo too now. A travis sub task upload them after a successful build. I would keep using cloudbees' repo and left the sonatype as fail over, for now.
    Julius Ahenkora
    @jahenkor
    Hey guys!
    I'm attempting to crawl past a login page with form-based authentication, but the crawler isn't going past the login page. I see that it grabs a cookie from a successful POST request, but is it persisting through the crawl session?