by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Gabriel Volpe
    @gvolpe
    Good to know, thanks @ruippeixotog
    I was aware about the typed browser, but couldn't find ownText there
    huangbo
    @blueberrynotblue
    scala-scraper is async or sync?
    Rui Gonçalves
    @ruippeixotog
    all methods are sync
    Roberto Leibman
    @rleibman
    Hey... hi! I'm using your wonderful library to attempt to parse recipes from tons of places throughout the webs.
    A bunch of the urls I have fail because they're just frankly hard to parse (they're not structured in any way that makes much sense)
    So I want to take the whole document and clean it up, remove adds, clicks, javascript, etc...
    Is there a way using the library to filter out stuff I don't want? Something maybe like:
    doc >> filter(element => goodTags.contains(element.tagName))
    Roberto Leibman
    @rleibman
    This extractor kind of does it, but it returns a list of elements, which is obviously not what I want, I want the original document modified.
    val filter: HtmlExtractor[Element, Iterable[Element]] = _.filter{ element => keepTags.contains(element.tagName.toLowerCase) }
    Rui Gonçalves
    @ruippeixotog
    hey @rleibman! scala-scraper currently doesn't offer that functionality
    it can be particularly tricky to work with the underlying HTML parsers, particularly the ones who run Javascript code. In scala-scraper, a Document represents a possibly dynamic webpage with a known URL and a (possible) Javascript codebase running on it
    creating a new Document with modifications in a functional style would be a bit strange, on that aspect
    but something like a DocumentView providing the semantics you want while making it explicit that it's backed by a full Document instance could be a solution
    Roberto Leibman
    @rleibman
    Is DocumentView a thing that exists or something you're proposing?
    Rui Gonçalves
    @ruippeixotog
    it doesn't exist yet, sorry
    Antonio Fijo
    @afijog

    Hi,
    First of all, nice library!! And now my question goes like this:

    I am trying to navigate through an AJAX 'pagination footer' with the typical 1, 2, 3... next... links. Everything works ok with the HtmlUnitBrowser and some specific features the API provide.
    I would like to make it robust in case of network failure, timeout... If I switch off the network I can see the following message, but I have no clue of how to catch that exception.

    abr 09, 2018 23:42:55 PM org.apache.http.impl.execchain.RetryExec execute
    INFORMACIËN: I/O exception (java.net.NoRouteToHostException) caught when processing request to {}->http://www.myhost.com: No route to host: connect

    Is there any way to catch the exception? Or find out if the AJAX request failed with some other method.
    If I use the ready state, status code or status message I always get 'complete', 200 and 'OK'.

    Thanks in advance for your help.

    Shaun Elliott
    @javamonkey79
    Is there any way to access the parent element of a selector?
    e.g. I have this:
    val browser = JsoupBrowser()
    println(text((browser.parseString(html) >> "h2:contains(Some text)")))
    Is it possible to get the h2's parent, in this case?
    Shaun Elliott
    @javamonkey79
    got it
    extractor("h2:contains(Some text)", element, asIs[Element])).parent
    Roberto Leibman
    @rleibman

    Hey, hi!
    So the extractor: elementList("script[type=application/ld+json]") is giving me a CSSException when I use HtmlUnitBrowser, but it was not doing so when I was using the JsoupBrowser. The exception is:

    org.w3c.css.sac.CSSException: Invalid selectors: //script[type=application/ld+json]
        at com.gargoylesoftware.htmlunit.html.DomNode.getSelectorList(DomNode.java:1898)
        at com.gargoylesoftware.htmlunit.html.DomNode.querySelectorAll(DomNode.java:1861)
        at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser$HtmlUnitElement.selectUnderlying(HtmlUnitBrowser.scala:184)
        at net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser$HtmlUnitElement.$anonfun$select$1(HtmlUnitBrowser.scala:186)
        at net.ruippeixotog.scalascraper.model.LazyElementQuery.iterator(ElementQuery.scala:45)
        at net.ruippeixotog.scalascraper.model.ElementQuery$.$anonfun$apply$1(ElementQuery.scala:65)
        at net.ruippeixotog.scalascraper.model.LazyElementQuery.iterator(ElementQuery.scala:45)
        at scala.collection.IterableLike.foreach(IterableLike.scala:71)
        at scala.collection.IterableLike.foreach$(IterableLike.scala:70)

    any ideas?

    (don't pay attention to the //, I was trying something out)
    Roberto Leibman
    @rleibman
    Hopefully a quick question.... I want to extract all text (as a Seq[String]) from
    <div class="recipe-ingredients">
        <div class="field">
            <div class="field--label">Ingredients</div>
            <div class="field--items">
                <div class="field--item">
                    <div class="ingredient-group">For the Steak</div>
                    <ul>
                        <li>1/4 cup toasted sesame oil</li>
                        <li>3 tablespoons Tamari/soy sauce</li>
                    </ul>
                </div>
                <div class="field--item">
                    <div class="ingredient-group">For the Noodles</div>
                    <ul>
                        <li>2 tablespoons smooth peanut butter</li>
                        <li>1/4 cup chives, minced</li>
                    </ul>
                </div>
            </div>
        </div>
    </div>
    I got this far by doing:
    doc >> elementList( "[class='ingredients'], [class='recipe-ingredients'], [id='recipe-ingredients'], [class='ingredient_dish_header'], [class='ingred-list']" )
    But I'm a bit lost here.
    ... just noticed this gitter is probably dead, since my last question (from April) wasn't answered either.
    Rui Gonçalves
    @ruippeixotog
    hi, @rleibman! Sorry, lately I haven't had much time to visit Gitter
    if you don't have good HTML id or class attributes you can use, it's not very direct with scala-scraper either
    you can do something like this
    scala> doc
    res17: browser.DocumentType =
    JsoupDocument(<html>
     <head></head>
     <body>
      <div class="recipe-ingredients">
       <div class="field">
        <div class="field--label">
         Ingredients
        </div>
        <div class="field--items">
         <div class="field--item">
          <div class="ingredient-group">
           For the Steak
          </div>
          <ul>
           <li>1/4 cup toasted sesame oil</li>
           <li>3 tablespoons Tamari/soy sauce</li>
          </ul>
         </div>
         <div class="field--item">
          <div class="ingredient-group">
           For the Noodles
          </div>
          <ul>
           <li>2 tablespoons smooth peanut butter</li>
           <li>1/4 cup chives, minced</li>
          </ul>
         </div>
        </div>
       </div>
      </div>
     </body>
    </html>)
    
    scala> (doc >> elementList("*")).flatMap(_.childNodes.collect { case TextNode(content) => content })
    res18: List[String] = List(" ", " ", " ", " ", " ", " ", Ingredients, " ", " ", " ", " ", " ", " ", For the Steak, " ", " ", " ", 1/4 cup toasted sesame oil, 3 tablespoons Tamari/soy sauce, " ", " ", " ", For the Noodles, " ", " ", " ", 2 tablespoons smooth peanut butter, 1/4 cup chives, minced)
    Roberto Leibman
    @rleibman
    That's helpful, thanks!
    BONNEVALLE Vincent
    @n3f4s
    Hi, I'm scraping a website that has some issue in its code (but is functionnal and I can extract what I want from it). scala-scraper output a lot of logs that aren't usefull to me. Is there a way to tune down the logs?
    Rui Gonçalves
    @ruippeixotog
    What kind of logs does it log?
    I believe it depends on the Browser implementation you're using; you may want to look at jsoup or HtmlUnit docs to see how to turn them down
    BONNEVALLE Vincent
    @n3f4s

    Log like that:

    Dec 17, 2019 1:15:18 AM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
    AVERTISSEMENT: Obsolete content type encountered: 'text/javascript'.

    (AVERTISSEMENT = Warning in French)

    Rui Gonçalves
    @ruippeixotog
    Yep, that's HtmlUnit. HtmlUnit uses Log4J (http://htmlunit.sourceforge.net/logging.html), you can configure it as you would configure other logging libraries
    David Biggs
    @davidgbiggs
    Looked through the Element.scala class for methods to set the value of an input element but had no luck; is there a way to do that with this library? Or is it just parsing html via GET, no mechanization? If not is there a good scala lib you'd recommend that does that?
    Rui Gonçalves
    @ruippeixotog
    Element, at least under JsoupBrowser, is immutable. However, you can easily submit the contents of a form by using the built-in extractors - something like:
    val (formData, signInAction) = browser.get(loginUrl) >> formDataAndAction
    val signInData = formData + ("userid" -> username) + ("pass" -> password)
    browser.post(signInAction, signInData)
    the idea of this library is not to mutate the DOM, but to extract content from HTML pages
    howellele
    @howelleleh_twitter
    Any chance anyone in here is able to help a 100% novice learn how to scrape an ecommerce site?
    Roberto Leibman
    @rleibman
    My first tip, before you go any further, is that you capture their site first before you go all hog and get blocked (i.e. don't use their live site for testing)... ask me how I know.
    Diretnan Domnan
    @deven96
    Hello everyone. Can anyone guide me on how to specify userAgent for the Jsoup Browser
    Diretnan Domnan
    @deven96
    Also can anyone please help guide me as to why browser.get() keeps failing on Heroku with org.jsoup.HttpStatusException: HTTP error fetching URL
    Roberto Leibman
    @rleibman
    You got a bit more info? a stack trace? Can you put a breakpoint and debug, maybe get more info by poking around the status? What happens if you use curl, w3m or wget?
    Diretnan Domnan
    @deven96
    Yeah let me post the stack trace... funny thing is that it works fine locally
    Diretnan Domnan
    @deven96
    ```bash
    Diretnan Domnan
    @deven96
    com.twitter.finagle.util.DefaultMonitor - Exception propagated to the default monitor (upstream address: /10.14.25.48:17443, downstream address: n/a, label: https). 2020-06-11T19:52:10.148523+00:00 app[web.1]: org.jsoup.HttpStatusException: HTTP error fetching URL 2020-06-11T19:52:10.148524+00:00 app[web.1]: at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:776) 2020-06-11T19:52:10.148524+00:00 app[web.1]: at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:722) 2020-06-11T19:52:10.148525+00:00 app[web.1]: at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:306) 2020-06-11T19:52:10.148525+00:00 app[web.1]: at net.ruippeixotog.scalascraper.browser.JsoupBrowser.executeRequest(JsoupBrowser.scala:73) 2020-06-11T19:52:10.148526+00:00 app[web.1]: at net.ruippeixotog.scalascraper.browser.JsoupBrowser.$anonfun$executePipeline$3(JsoupBrowser.scala:84) 2020-06-11T19:52:10.148526+00:00 app[web.1]: at scala.Function1.$anonfun$andThen$1(Function1.scala:52) 2020-06-11T19:52:10.148526+00:00 app[web.1]: at scala.Function1.$anonfun$andThen$1(Function1.scala:52) 2020-06-11T19:52:10.148527+00:00 app[web.1]: at net.ruippeixotog.scalascraper.browser.JsoupBrowser.get(JsoupBrowser.scala:36) 2020-06-11T19:52:10.148527+00:00 app[web.1]: at com.engines.scalic.RedThreeMPThreeEngine.search(RedThreeMPThree.scala:44) 2020-06-11T19:52:10.148528+00:00 app[web.1]: at com.commands.scalic.ApiController.$anonfun$new$2(Api.scala:80)
    Roberto Leibman
    @rleibman
    mhhh... not much more info there, it'd be nice to see the error status (e.g. 403 would tell you it's forbidden, which probably means the site you're trying to scrape is forbidding whatever IP Heroku is coming from... not surprising) it's also hard to say without the url you're trying to scrape.