Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Julius Ahenkora
    @jahenkor
    Yes, the crawler outputs a "successful login" message, for correct/incorrect login credentials. And doesn't seem to (from what I can see) save any cookies from the session. So the crawler just crawls the login page.
    Federico Tolomei
    @s17t
    Ok, I will look into it
    Julius Ahenkora
    @jahenkor
    Appreciate it
    Riccardo Tasso
    @raymanrt
    Hi, is it possible to customize or override the edu.uci.ics.crawler4j.parser.Parser class without overriding the WebCrawler.processPage method?
    probabily in my case it should be enough the override of HtmlContentHandler class
    Federico Tolomei
    @s17t
    Hi @raymanrt would you detail more your use case? Parser in WebCrawler is the internal parser. Users should use their own parser (e.g. jsoup, tika with custom config) in their WebCrawler
    Riccardo Tasso
    @raymanrt
    Hi @s17t I don't understand how to use my own parser
    Parser is a private field in WebCrawler, and there is no setter
    the only method which uses the parser is processPage, but it's private
    Federico Tolomei
    @s17t
    @jahenkor See #291, when using FormAuthInfo be sure to set a CookieStore and a CookiePolicy to CrawlConfig see FormAuthInfoTest
    Julius Ahenkora
    @jahenkor
    @s17t thanks ill definitely try it out!
    Federico Tolomei
    @s17t
    I am trying to release 4.4.0. Sonatype requirements about POMs are tight. <developer> is now required. If some committer want to get included now please send me <developer> tag, or send a PR after 4.4.0 release, in a couple of days :|
    rz
    @rzo1
    ok
    Federico Tolomei
    @s17t
    the 4.4.0 is on the repo
    Federico Tolomei
    @s17t
    about #310 , the snapshot are on sonatype's snapshot repo too now. A travis sub task upload them after a successful build. I would keep using cloudbees' repo and left the sonatype as fail over, for now.
    Julius Ahenkora
    @jahenkor
    Hey guys!
    I'm attempting to crawl past a login page with form-based authentication, but the crawler isn't going past the login page. I see that it grabs a cookie from a successful POST request, but is it persisting through the crawl session?
    Paul Galbraith
    @pgalbraith
    Does Yasser ever weigh in much any more? Just curious how active he is.
    Paul Galbraith
    @pgalbraith
    @jahenkor did you create an issue for the login/cookie problem
    Harshit Agarwal
    @harshitagarwal2
    hey I am trying to get the most search keywords in a search engine would I be able to do that with crawler4j?
    Federico Tolomei
    @s17t
    @pgalbraith thank you for your contributions. guava library is in the dependencies again. I would try to avoid to include guava 'just' for InternetDomainName. Is there any alternative implementation ? Maybe from apache ?
    Paul Galbraith
    @pgalbraith
    @s17t Hi I, looked into this one quite a bit, and wasn't able to find any satisfying alternatives other than the two I found (i.e. Guava for static lookup, and the https://github.com/whois-server-list/public-suffix-list lib for external download/compare) ... ultimately, though, I still am thinking that this is beyond what Crawler4j should be doing ... just provide the URL and let the consumer decide if they need to do more work to determine public/private host.
    Federico Tolomei
    @s17t
    I can't merge anymore in yasserg/crawler4j Does anybody know how to contact Yasserg ? Thx
    rz
    @rzo1
    Maybe via an issue on Github? It seems, that he revoked your contributor rights...
    Federico Tolomei
    @s17t
    I opened it two days ago: yasserg/crawler4j#384
    still no luck from his linkedin
    Federico Tolomei
    @s17t
    I have created an organization https://github.com/crawler4j/crawler4j with an import of the main repo. I have added @yasserg as admin (we will see if he will respond). @pgalbraith , @rzo1 and @Chaiavi have write permission in the repo.
    rz
    @rzo1
    still no luck @s17t ? if so, we should proceed on the organizational repository and request sonatype permissions for releases...
    Federico Tolomei
    @s17t
    sadly no
    for nexus at time of 4.0/4.5 release I uploaded my GPG keys so I still have authorization to push artifact
    Federico Tolomei
    @s17t
    Concerns have been raised about keeping the name 'crawler4j' for the fork. I support to still use the 'crawler4j' the name as long the original Yasser copyright statement is maintained in README and in the documentation. The license is Apache2 so the license is not an issue.
    rz
    @rzo1
    the question is, if we should change the package / groupId structure ...
    would be a clear cut in terms of a real fork.
    Federico Tolomei
    @s17t
    that is another question
    I would like to avoid the change. I will try contact the the edu.uci.ics admin
    rz
    @rzo1
    ok.
    rz
    @rzo1
    any updates from Yasserg?
    Federico Tolomei
    @s17t
    nope :(
    I am talking with @pgalbraith by email about an hard fork
    in these days. We will propose something in next days I think.
    rz
    @rzo1
    alright :)
    i am happy to see the project going forward. there is a lot of open work :)
    rz
    @rzo1
    Okay. It seems, the official repo is "dead" :/
    Federico Tolomei
    @s17t
    Hi, everybody interested in evolutionary fork of crawler4j is invited to vote the fork's name: http://www.strawpoll.me/17363919
    rz
    @rzo1
    ok
    rotis23
    @rotis23
    Hi all. Anyone still around on this? Was this successfully forked?
    Sai Aditya Harish
    @77aditya77
    Hi
    Anyone around?