Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    linbren
    @linbren
    @lumyus config.setMaxDepthOfCrawling(1);
    zeyudu
    @zeyudu
    hi do u know how to swith proxy during the process
    Federico Tolomei
    @s17t
    @zeyudu, indeed it is not a feature. I would suggest to try to switch http.proxyHost JVM property value but crawler4j has its own properties, which is bad.
    would you kindly fill a feature request on github?
    zeyudu
    @zeyudu
    alright~
    yunlong17568
    @YunlongYang
    <img src="/static/images/abc.jpg"/> do not work
    if src start with '/' , crawler4j cannot work good
    Federico Tolomei
    @s17t
    please fill a bug specifiyng the url you are crawling
    yunlong17568
    @YunlongYang
    image.png
    crawler4j do not think the img is a resource
    I have open a issue at github as "edu.uci.ics.crawler4j.parser.Parser parse WebURL not completely. #254"
    Federico Tolomei
    @s17t
    thank you
    yunlong17568
    @YunlongYang
    u are welcome
    i think when finding url from html content, we should combine the context, not a simple text but HTML
    Valery Yatsynovich
    @valfirst
    hi everyone,
    does crawler4j support preemptive basic authentication?
    Julius Ahenkora
    @jahenkor
    Is form authentication still functional for crawler4j?
    Federico Tolomei
    @s17t
    hello Julius
    it should be but I m not a user of it (and it lacks test)
    did you encounter problems ?
    Julius Ahenkora
    @jahenkor
    Yes, the crawler outputs a "successful login" message, for correct/incorrect login credentials. And doesn't seem to (from what I can see) save any cookies from the session. So the crawler just crawls the login page.
    Federico Tolomei
    @s17t
    Ok, I will look into it
    Julius Ahenkora
    @jahenkor
    Appreciate it
    Riccardo Tasso
    @raymanrt
    Hi, is it possible to customize or override the edu.uci.ics.crawler4j.parser.Parser class without overriding the WebCrawler.processPage method?
    probabily in my case it should be enough the override of HtmlContentHandler class
    Federico Tolomei
    @s17t
    Hi @raymanrt would you detail more your use case? Parser in WebCrawler is the internal parser. Users should use their own parser (e.g. jsoup, tika with custom config) in their WebCrawler
    Riccardo Tasso
    @raymanrt
    Hi @s17t I don't understand how to use my own parser
    Parser is a private field in WebCrawler, and there is no setter
    the only method which uses the parser is processPage, but it's private
    Federico Tolomei
    @s17t
    @jahenkor See #291, when using FormAuthInfo be sure to set a CookieStore and a CookiePolicy to CrawlConfig see FormAuthInfoTest
    Julius Ahenkora
    @jahenkor
    @s17t thanks ill definitely try it out!
    Federico Tolomei
    @s17t
    I am trying to release 4.4.0. Sonatype requirements about POMs are tight. <developer> is now required. If some committer want to get included now please send me <developer> tag, or send a PR after 4.4.0 release, in a couple of days :|
    rz
    @rzo1
    ok
    Federico Tolomei
    @s17t
    the 4.4.0 is on the repo
    Federico Tolomei
    @s17t
    about #310 , the snapshot are on sonatype's snapshot repo too now. A travis sub task upload them after a successful build. I would keep using cloudbees' repo and left the sonatype as fail over, for now.
    Julius Ahenkora
    @jahenkor
    Hey guys!
    I'm attempting to crawl past a login page with form-based authentication, but the crawler isn't going past the login page. I see that it grabs a cookie from a successful POST request, but is it persisting through the crawl session?
    Paul Galbraith
    @pgalbraith
    Does Yasser ever weigh in much any more? Just curious how active he is.
    Paul Galbraith
    @pgalbraith
    @jahenkor did you create an issue for the login/cookie problem
    Harshit Agarwal
    @harshitagarwal2
    hey I am trying to get the most search keywords in a search engine would I be able to do that with crawler4j?
    Federico Tolomei
    @s17t
    @pgalbraith thank you for your contributions. guava library is in the dependencies again. I would try to avoid to include guava 'just' for InternetDomainName. Is there any alternative implementation ? Maybe from apache ?
    Paul Galbraith
    @pgalbraith
    @s17t Hi I, looked into this one quite a bit, and wasn't able to find any satisfying alternatives other than the two I found (i.e. Guava for static lookup, and the https://github.com/whois-server-list/public-suffix-list lib for external download/compare) ... ultimately, though, I still am thinking that this is beyond what Crawler4j should be doing ... just provide the URL and let the consumer decide if they need to do more work to determine public/private host.
    Federico Tolomei
    @s17t
    I can't merge anymore in yasserg/crawler4j Does anybody know how to contact Yasserg ? Thx
    rz
    @rzo1
    Maybe via an issue on Github? It seems, that he revoked your contributor rights...
    Federico Tolomei
    @s17t
    I opened it two days ago: yasserg/crawler4j#384
    still no luck from his linkedin
    Federico Tolomei
    @s17t
    I have created an organization https://github.com/crawler4j/crawler4j with an import of the main repo. I have added @yasserg as admin (we will see if he will respond). @pgalbraith , @rzo1 and @Chaiavi have write permission in the repo.
    rz
    @rzo1
    still no luck @s17t ? if so, we should proceed on the organizational repository and request sonatype permissions for releases...
    Federico Tolomei
    @s17t
    sadly no