Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    That is the full SO post
    ok, so is your question answered or is it still open? Couldn't quite follow the SO post
    Joseph Crawford

    Hello Everyone, I am trying to use cheerio in a nodejs application and trying to parse a document to get a structure that I can use to index documents with algolia. I am allowing the developer consuming this to pass in a structure such as h1 h3 p, or h1 h2 h3 p but this will be dynamic. I am then trying to get the elements and create json objects which will be passed to algolia for indexing. Below is a sample html structure and how the json objects are created

    <h1>Article Title</h1>
    <h3>Section Title</h3>
    <h3>Section Two</h3>
    <p>Content 2</p>
    <h3>Section Three</h3>
    <p>Content 3</p>
    <h1>Secondary Title</h1>
    <p>Secondary Content</p>

    so for the options.structure I would pass in h1 h3 p, this would then create the following structures to pass to algolia

      "h1":"Article Title",
      "h1":"Article Title",
      "h3":"Section Title"
      "h1":"Article Title",
      "h3":"Section Title",
      "h1":"Article Title",
      "h3":"Section Two”
      "h1":"Article Title",
      "h3":"Section Two”,
      "p":"Content 2”
      "h1":"Article Title",
      "h3":"Section Three”
      "h1":"Article Title",
      "h3":"Section Three”,
      "p":"Content 3”
      "h1”:”Secondary Title",
      "h1”:”Secondary Title",
      "p”:”Secondary Content”,

    Now this structure would have to be dynamic meaning rather than the above they could pass in h1 h2 h3 h4 p and it should be parsed accordingly.

    There could also be content under h1 such as <h1>Title</h1><p>Content</p> and this would receive a different importance as seen in the last example above.

    The trouble I am having is how can I select all paragraphs under each tag in the structure. I know I can get all p tags under h1 and that will return them all but it will not let me know which h1 the p’s belong to, etc.

    I am following this article trying to replicate this using cheerio.

    Joseph Crawford
    I guess the p could be left out of the structure since it could occur anywhere under any other element and would have to be determined in code. I am just trying to figure out how i could end up with something like h1 > h3 > p p p, h1 > h3 > p, h1 > p p, etc. if I grab all p's under h1 it could be 10 p's but i wouldn't know which h1 they belong to for indexing so I would want to loop over each structure element check for the existence of p and note it, then check for the next structure element provided etc. and create an object based on this, then i can use that to create these indexes. Keeping the structure of which element a p belongs to is required though and that is what I cannot figure out.
    Joseph Crawford
    I am thinking that I need to loop over each structure element passed in, then loop over each element found and for each grab any sibling p's and add them to a data structure, however once I loop over the elements such as h1 I am finding it tough to get any sibling p's. I cannot use nextUntil because I do not know if the next tag would be an h1, h2, h3, etc. Is there a way in cheerio that I could get all sibling p's for an element while looping over them and not get sibling p's for other elements in the loop until i loop over them?
    Correct me if I'm wrong:
    Your problem is that semantically, the <p> is a child of <h2> which in turn is a child of <h1>, so there is a semantic parent-child relationship h1 > h2 > p
    but this relationship is not transferred to the html
    in the html, h1 and h2 and p are just some tags and neither is fundamentally "senior" to the other.
    Joseph Crawford
    @Gei0r correct they are all siblings there is no parent/child with h elements and p elements
    I have never seen a parent child relationship with h tags and p tags always seen them as siblings
    yeah, you've never seen that relationship in html because html is not set up like this
    but that's also the reason why you won't get the parent->sibling structure from an html parser like cheerio
    So you will have to do it yourself. Starting at the top, read in the tags one by one and the remember the ones you've already seen
    nextUntil is not the right function for this use. You should use next
    Joseph Crawford
    @Gei0r thanks for that information
    Christopher Shelley
    hey all is it possible to 'next' twice?
    there is a UL that i want to get - the only identifier is on a sibling H1 - between the H1 and the UL is a SPAN tag. I can get the SPAN tag in 'next' of the H1 but dont know how to then get the UL after it
    Victor Aprea
    Hi gang, I seem to be having some trouble getting runtime errors trying to use cheerio in an Angular 7 project
    at first i was getting errors at build time about "can't find 'stream'"
    so i npm installed stream and the build-time error went away, only to be replaced by a very similar run-time error
    i didn't see anything quite like this in the issues... anyone have any ideas?
    can you try to reduce your problem to a minimal working example and post it?
    you use cheerio at client side? to do crawling?
    Hey anyone here can anyone help me a bit
    I am unable to fetch <script> tag content using cheerio. Any settings/config am I missing?

    Hello guys, I've just started using CheerioJS and would like to know how to scrape a website. $("#bproducts div.p div.pname") will return an object with all the product names, but I want to get the div.pprice next to the div.pname.

    I was trying to make two arrays like:

    const pName = $("#bproducts div.p div.pname");
    const pPrice = $("#bproducts div.p div.pname");

    and then glue it together into one object as key:value.

    What I want is to get a key:value pair like pname:pprice

    Fausto A. Guerrero
    Even if you are using cheriojs, this is not the place to ask for question that does not require specific cheriojs knowledge. I would suggest you make the question in stack overflow
    This is exactly the place to ask this question because there is no way to understand what the hell cheerio is doing. It says that children is an array but it's an array that contains those weird object that makes no sense. I've spent now about 8h on cheerio and I'm still not able to get the children and create the product array. People say that CheerioJS is amazing but there are no tutorials and the documentation is very poor and doesn't explain how the object represents the html elements.

    hi there, dear community, I have a question regarding some scrapping issues I'm having, I'm trying to get an 'a tag' text and each time I use $(element).text() I receive an empty string, looking at the html I can see everything there but there are also angular elements. my a tag scrapping result looks like this:

    { '0':
       { type: 'tag',
         name: 'a',
         namespace: 'http://www.w3.org/1999/xhtml',
          { 'ng-href': '{{treatPathProduct(product.Characteristics[\'Path\'])}}',
            'ng-bind-html': 'product.Characteristics[\'##ProductLabel\']' },
         'x-attribsNamespace': { 'ng-href': undefined, 'ng-bind-html': undefined },
         'x-attribsPrefix': { 'ng-href': undefined, 'ng-bind-html': undefined },
         children: [],
          { type: 'tag',
            name: 'h2',
            namespace: 'http://www.w3.org/1999/xhtml',
            attribs: {},
            'x-attribsNamespace': {},
            'x-attribsPrefix': {},
            children: [Array],
            parent: [Object],
            prev: [Object],
            next: [Object] },
         prev: null,
         next: null },
       { withDomLvl1: true,
         normalizeWhitespace: false,
         xml: false,
         decodeEntities: true },
      length: 1 }

    My question is, how can I get the text and the href attribute? both are empty or seems empty

    Hello, Could you please help me how to get child node index in Cheerios. For example I have one main DIV. Under that main DIV I am getting multiple child DIV's based on the database value. I want to check child DIV based on id/class and get the position(index) whether child DIV in which position like 2 position or 3 position. under that main DIV. If anyone have idea could you please help me on this?
    Hi there! Im new at cheerio and I just want to know if there is a fast way to make a json file
    Anyone here
    I have been using cheerio to extract data, which I able to get on the terminal , but when I use the same code on the browser I am getting a cross origin error
    I want to understand why is it that I am getting the value in the terminal but not in the browser
    No one here?
    hello guys, do you have any spider community website?
    Hello, is there a way to use cheerio as an esmodule?
    so that way I can use it from Deno
    Nguyễn Mạnh Trung
    Hello guys, how to using cheerio in chrome extension?
    Jagdish Parihar

    I am trying

    $("iframe").each(function (_i, link) {
    const data = cheerio.html(link)
    // output : <iframe id="iframe"></iframe>
    It gives me the data of the tag only. But I want the whole data in html format which is present inside the iframe tag.

    eg : <html><body><h1>hello world</h1></body></html>

    Sergei Stadnik
    Hello, looking for a bit of advice. I am using Postman to write UI tests, and need to get text from element inside the DOM. The test passes everytime, even with the wrong text.
    I would appreciate any examples. var cheerio = require('cheerio'); const $ = cheerio.load('<ul class="cards-wrapper">...</ul>'); $.html(); pm.test('Test name', function () { $('.cards-wrapper').text("12345"); });
    Hello! I just discovered cheerio! Looks awesome. Exactly what I was looking for.
    Question: can cheerio run on the frontend? If not, can write backend code with cheerio and using webpack substitute jQuery for cheerio on the frontend?
    how do you parse the dom using cheerio? i want to retreive the title of webpages
    anyone here on this channel?