I have open a issue at github as "edu.uci.ics.crawler4j.parser.Parser parse WebURL not completely. #254"
Federico Tolomei
@s17t
thank you
yunlong17568
@YunlongYang
u are welcome
i think when finding url from html content, we should combine the context, not a simple text but HTML
Valery Yatsynovich
@valfirst
hi everyone, does crawler4j support preemptive basic authentication?
Julius Ahenkora
@jahenkor
Is form authentication still functional for crawler4j?
Federico Tolomei
@s17t
hello Julius
it should be but I m not a user of it (and it lacks test)
did you encounter problems ?
Julius Ahenkora
@jahenkor
Yes, the crawler outputs a "successful login" message, for correct/incorrect login credentials. And doesn't seem to (from what I can see) save any cookies from the session. So the crawler just crawls the login page.
Federico Tolomei
@s17t
Ok, I will look into it
Julius Ahenkora
@jahenkor
Appreciate it
Riccardo Tasso
@raymanrt
Hi, is it possible to customize or override the edu.uci.ics.crawler4j.parser.Parser class without overriding the WebCrawler.processPage method?
probabily in my case it should be enough the override of HtmlContentHandler class
Federico Tolomei
@s17t
Hi @raymanrt would you detail more your use case? Parser in WebCrawler is the internal parser. Users should use their own parser (e.g. jsoup, tika with custom config) in their WebCrawler
Riccardo Tasso
@raymanrt
Hi @s17t I don't understand how to use my own parser
Parser is a private field in WebCrawler, and there is no setter
the only method which uses the parser is processPage, but it's private
Federico Tolomei
@s17t
@jahenkor See #291, when using FormAuthInfo be sure to set a CookieStore and a CookiePolicy to CrawlConfig see FormAuthInfoTest
Julius Ahenkora
@jahenkor
@s17t thanks ill definitely try it out!
_
Federico Tolomei
@s17t
I am trying to release 4.4.0. Sonatype requirements about POMs are tight. <developer> is now required. If some committer want to get included now please send me <developer> tag, or send a PR after 4.4.0 release, in a couple of days :|
about #310 , the snapshot are on sonatype's snapshot repo too now. A travis sub task upload them after a successful build. I would keep using cloudbees' repo and left the sonatype as fail over, for now.
Julius Ahenkora
@jahenkor
Hey guys!
I'm attempting to crawl past a login page with form-based authentication, but the crawler isn't going past the login page. I see that it grabs a cookie from a successful POST request, but is it persisting through the crawl session?
Paul Galbraith
@pgalbraith
Does Yasser ever weigh in much any more? Just curious how active he is.
Paul Galbraith
@pgalbraith
@jahenkor did you create an issue for the login/cookie problem
Harshit Agarwal
@harshitagarwal2
hey I am trying to get the most search keywords in a search engine would I be able to do that with crawler4j?
Federico Tolomei
@s17t
@pgalbraith thank you for your contributions. guava library is in the dependencies again. I would try to avoid to include guava 'just' for InternetDomainName. Is there any alternative implementation ? Maybe from apache ?
Paul Galbraith
@pgalbraith
@s17t Hi I, looked into this one quite a bit, and wasn't able to find any satisfying alternatives other than the two I found (i.e. Guava for static lookup, and the https://github.com/whois-server-list/public-suffix-list lib for external download/compare) ... ultimately, though, I still am thinking that this is beyond what Crawler4j should be doing ... just provide the URL and let the consumer decide if they need to do more work to determine public/private host.
Federico Tolomei
@s17t
I can't merge anymore in yasserg/crawler4j Does anybody know how to contact Yasserg ? Thx
rz
@rzo1
Maybe via an issue on Github? It seems, that he revoked your contributor rights...
I have created an organization https://github.com/crawler4j/crawler4j with an import of the main repo. I have added @yasserg as admin (we will see if he will respond). @pgalbraith , @rzo1 and @Chaiavi have write permission in the repo.
rz
@rzo1
still no luck @s17t ? if so, we should proceed on the organizational repository and request sonatype permissions for releases...
Federico Tolomei
@s17t
sadly no
for nexus at time of 4.0/4.5 release I uploaded my GPG keys so I still have authorization to push artifact
Federico Tolomei
@s17t
Concerns have been raised about keeping the name 'crawler4j' for the fork. I support to still use the 'crawler4j' the name as long the original Yasser copyright statement is maintained in README and in the documentation. The license is Apache2 so the license is not an issue.
rz
@rzo1
the question is, if we should change the package / groupId structure ...
would be a clear cut in terms of a real fork.
Federico Tolomei
@s17t
that is another question
I would like to avoid the change. I will try contact the the edu.uci.ics admin