Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Milosz Kukla
    @miloszkukla

    Florian:

    The attribute will always reflect the attribute - this is according to the specs
    If we would just change this then we would violate the specs which violates one of the core principles of AngleSharp

    @FlorianRappl what did you mean by "The attribute will always reflect the attribute" ?
    Florian Rappl
    @FlorianRappl
    What I meant is that GetAttribute will always reflect the real / raw value, while properties (such as href) may be normalized / changed somehow.
    Milosz Kukla
    @miloszkukla
    oh so now I think I understand what Egil was asking about, thanks :)
    Rune Jacobsen
    @havremunken

    Hey guys, so I have this simple code

                var config = Configuration.Default.WithDefaultLoader();
                var context = BrowsingContext.New(config);
                var document = await context.OpenAsync("http://some.url/file.html");

    And I am using document.QuerySelectorAll() to parse some anchor tags. These have relative links, like "file2.html". document.BaseUri in this case is http://some.url/file.html - is there a simple way of creating the full URL for accessing file2.html in this case? In this trivial example that would mean removing file.html and substituting file2.html - but is there a way to say "create a full URL based on this BaseURI and this relative link" that will work like a browser does?

    8 replies
    Tom Hazell
    @The-Nutty
    Hi Guys,
    Im doing some profiling on the HTML parsing as we have noticed that it is sometimes very slow (we think on documents that are invalid or otherwise not quite right). So im doing some profiling and the HtmlDomBuilder#HeisenbergAlgorithm keeps coming up, googling HeisenbergAlgorithm does not bring anything related up, would someone be able to point me in the right place that explains what its doing.
    In trying to understand what its doing I was looking through chrome’s blink source code for the place where HeisenbergAlgorithm would have been called, and the logic they employ seems way simpler/quicker. code The equivalent of what they are doing seems to be :
    1. Find the/if there is an A tag in the _formattingElements
    2. Call ProcessFakeEndTag which when you are in the body just calls InBodyEndTag on a new token of type end tag with name tag name a.
    3. removes the active A tag from _formattingElements
    4. If open elements contains the a tag then remove it from there.
      Is there some difference between the 2 parsers that im missing?
    8 replies
    Ryan Cleven
    @onehundredfeet
    Hi
    I was wondering if anyone had an example for how to efficiently substitute one HTML tree in for an element in another using AngleSharp. In other words, take all the tags inside the body of one HTML tree and place them underneath one of the elements in the other tree.
    stevozilik
    @stevozilik

    Hi Everyone. Great to see this tool, I'm evaluating it with purpose of improving our Web UI test automation.

    We are considering to adopt Asp Net Core Test Server (https://docs.microsoft.com/en-us/aspnet/core/test/integration-tests?view=aspnetcore-3.1), and I'm curious what is the best way to integrate AngleSharp into the TestServer pipeline. I can see the MS example first gets full response from WebApplicationFactory created HttpClient, which I can imagine only works for initial page load.

    We would like to include Javascript and subsequent Ajax calls (it's a Asp Net Core Angular SPA website) in the testing, which I assume needs more sophisticated integration with AngleSharp (via the extensibility points, so it internal always uses the HttpClient provided from the TestServer)

    I had a look at the extensibility points and don't know if that's the way to go, and where to start just based on the docs https://anglesharp.github.io/docs/API.html

    Any ideas appreciated => AngleSharp testing Asp Net Core Angular Spa

    Eric Vander Wal
    @ericvanderwal
    I am trying to do an auto scroll on Twitter for scraping. Can anyone point me in the right direction or example of how to do the scroll?
    (Twitter content is loaded dynamically). I am little light on JS skills.
    Rune Jacobsen
    @havremunken

    Maybe something like this would be a good starting point?

    https://developer.mozilla.org/en-US/docs/Web/API/Element/scrollIntoView

    Find the bottom of the list, last element, something like that, and scroll it into view.
    Eric Vander Wal
    @ericvanderwal
    @havremunken , thanks, ill look into it
    Sebastian Loncar
    @arakis
    Is it currently possible to get the calculated position and size of Dom elements? Or is anglesharp just a parser with lots of interfaces without implementation?
    With respecting css styles :-)
    Sebastian Loncar
    @arakis
    Ok, it seems getCalculatedStyle works, but there's not ClientWidth, OffsetWidth, or GetBoundingClientRect :-(
    Greg Bushnell
    @greg.computerden_gitlab
    is it possible to open up a local html file rather than online?
    1 reply
    System.AppDomain.CurrentDomain.BaseDirectory + "battery-report.html"
    is what im trying to open in var address
    Richard Thompson
    @richarth

    Hi. I'm trying to write some code to find an h1 on a page and get it's computed style. The code I've got works fine for getting styles that are inline in the HTML document but the styles that I have in an external style sheet don't seem to be applied.

    Are external style sheets supported and is there something I need to do to enable external stylesheets, perhaps in the config? I have tried Configuration.Default.WithDefaultLoader(new LoaderOptions { IsResourceLoadingEnabled = true }).WithCss()

    14 replies
    Aleksandr Asanov
    @dream811
    Hi, I am trying to scrap taobao and tmall site
    But After using anglesharp package, I can't get whole documents. that is why they are using javascript to draw dom.
    How can I resolve this issue, Please give me some hints, thanks guys!
    srid
    @srid:matrix.org
    [m]

    Hi, I"m trying out AngleSharp in F#, and experiencing a very strange with QuerySelectorAll basically working only for the top-level body element, but not for any elements inside it. Here's an example:

    > doc.QuerySelectorAll("body") |> Seq.tryHead |> Option.map (fun n -> n.TagName);;
    val it : string option = Some "BODY"
    
    > doc.QuerySelectorAll("div") |> Seq.tryHead |> Option.map (fun n -> n.TagName);;
    val it : string option = None
    
    >

    This is for the documented loaded from this url.

    Oddly enough, the HtmlBodyElement has the correct BaseUrl but the value of OutterHtml is a mere "<body></body>". Hmm, that might be it.
    Or not, as I see the same behaviour with other URLs.
    1 reply
    srid
    @srid:matrix.org
    [m]
    @FlorianRappl: I followed the README example on github, to load the Wikipedia URL.

    But when I just tried with local content, it works:

    
    let getDoc (htmlContent: string) = 
        let cfg = Configuration.Default.WithDefaultLoader()
        let ctx = BrowsingContext.New(cfg)
        async { return! ctx.OpenAsync(fun req -> req.Content(htmlContent) |> ignore) |> Async.AwaitTask } |> Async.RunSynchronously
    
    [<EntryPoint>]
    let main argv =
        let doc = getDoc "<body><div>Hello</div></body>"
        let cells = doc.QuerySelectorAll("div")
        let titles = query {
                for cell in cells do
                    select cell.TextContent
            }
        printfn $"Printing {Seq.length titles} titles"
        for title in titles do
            printfn $"Title: {title}"
        0 // return an integer exit code

    ^ This works, which is good enough I guess, as I don't plan to load remote content anyway.

    srid
    @srid:matrix.org
    [m]
    By the way, as a new F# dev (and being completely new to dotnet ecosystem; but familiar with FP), the main reason I'm exploring AngleSharp is to figure out how useful it would be for writing my own HTML templating language (based on XML'ish stuff; so stuff like partials & layout and variables are defined the XML way).
    srid
    @srid:matrix.org
    [m]

    Is it possible to get the contents of a <template> tag (so as to query on it)?

    In JS, you would do it using the .content property.

    Eg: to query the td inside the <template> tag of https://developer.mozilla.org/en-US/docs/Web/HTML/Element/template#examples
    (or even get the whole contents of template as its own document)
    Florian Rappl
    @FlorianRappl
    Yes definitely.
    srid
    @srid:matrix.org
    [m]
    I tried let tmpl = doc.QuerySelectorAll("template"), which returns the template tag, but I can't drill down further than.
    Florian Rappl
    @FlorianRappl
    Why - its an HTMLTemplateElement - just cast the result to the right type
    Sorry but C# / F# are no dynamic languages like JS. So you need to get the types right ;)
    srid
    @srid:matrix.org
    [m]
    fwiw, here's my current code:
        let templates = doc.QuerySelectorAll("template")
        let titles =
            query {
                for tmpl in templates do
                    for p in tmpl.QuerySelectorAll("p") do
                        select p.OuterHtml
            }
    titles is empty seq, so the "p" querying didn't work, which isn't surprising I guess, because I'm suppose to get the fragment inside tmpl and then query on it.
    Florian Rappl
    @FlorianRappl
    Yes you need to query the fragment - your current query works against children of templates, not the content in the template
    srid
    @srid:matrix.org
    [m]
    My assumption is that I need a filler in here: tmpl.GetContentFragment().QuerySelectorAll("p") - but what would be GetContentFragment?
    Don't see anything relevant in auto-completion list for tmpl.
    Not sure what GetContentFragment is / should be. I guess you want Content?
    srid
    @srid:matrix.org
    [m]
    Right! That must be it; now I gotta figure out casting in F# ...
    QuerySelectorAll returns a collection of IElement. I should cast that to IHtmlTemplateElement
    Wait, you can't down cast it. Looking for polymorhpic querying ... (is that anti-pattern in dotnet)
    Florian Rappl
    @FlorianRappl
    1. Either use QuerySelector or iterate over all results (not sure if you interested in a single one or all results).
    2. Check the type before casting it.
    srid
    @srid:matrix.org
    [m]
    QuerySelector looks interesting - let me see how I can use it to pull multiple <template> tags
    Wait, I thought that's more polymorphic that QuerySelectorAll - but the difference is only in arity