Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Andrew Kaiser
    @andykais

    Hi, is it possible to limit the results of a selector? Say I have a very large html page, and there are 10000 items on the page. If I only want the first ten, is there a way to limit the loop in cheerio? I know I can mostly achieve what I am trying to do within the querySelector:

    cheerio.load(largeHtmlString)
    const items = []
    $('ul > li:nth-child(-n+10)').text((i, text) => items.push(text))

    but this seems slightly less clear, and not applicable to all situations. Is it possible to give a "max" to cheerio?

    Vojtěch Drábek
    @comodoro
    Would someone please be so kind and point out why the following does not iterate over all .table-responsive results on the page?
    https://gist.github.com/comodoro/3b0534f4e821e9055f3cb78fa4e5513b
    Vojtěch Drábek
    @comodoro
    No need anymore... the page returns something else for non-browsers and it did not occur to me, sorry for the post.
    Arun Kumar
    @arunkumar413
    Hi All
    Can I edit a HTML file and save using cheeerio?
    Anup Dhakal
    @anuphunt
    var request = require ('request');
    var cheerio = require('cheerio');
    var fs = require ('fs');
    
    request("http://kathmandupost.ekantipur.com/news/2018-08-31/bimstec-summit-multilateral-meet-underway.html", function(error, response, body){
        if(error){
            console.log("Error: "+ error);
        }
        console.log("Status code: " + response.statusCode);
        var $ = cheerio.load(body);
    
        var title = $(this).find('h1 .title').text();
        fs.appendFileSync('ekantipur.txt',title);
    });
    What am I doing wrong in this code? I just want to get the text from class "title" that has h1 tag.
    andreachiera
    @andreachiera
    Hi,
    i'd like to know if all functions in cheerio's module are synchronous?
    Konstantin Bläsi
    @konstantinblaesi
    Using <h2>Headline</h2><p>Paragraph</p> as input to cheerio and running $(...).text() on it results in "HeadlineParagraph", but I'd expect a newline after "Headline" or at least a space? Does anyone know a solution ?
    Danish S
    @medanishtheboy_twitter
    Has anyone used selenium for scraping?
    whats faster cheerio or selenium (with java)
    Vojtěch Drábek
    @comodoro
    @medanishtheboy_twitter Cheerio is faster since Selenium renders the page first with everything, which may be Javascript. It basically simulates user input. Cheerio works with HTML only, so it is faster, but limited to HTML source
    Kevin Gonzalez
    @kgonzale
    Hello all
    Kevin Gonzalez
    @kgonzale
    Why is everything that is being scraped popped out into a single object in this code? p tag
      $("li.staff-directory-department").map((item, index) => {
        arrayInfo.push({
          id: $("h2.staff-directory-department", index)
            .text()
            .trim(),
          name: $("h3.staff-member-name", index)
            .text()
            .split("\n")
            .map(splitString => splitString.trim())
            .filter(trimString => trimString.length > 0),
          title: $("p.staff-member-title", index).text(),
          email: $("p.staff-member-email", index).text(),
          phone: $("p.staff-member-phone", index).text()
        });
      });
     title:
       'Director of Institutional Research & EffectivenessAdministrative AssistantDirector of Online Learning and Educational TechnologyAssociate Vice President of Academics and Strategic InitiativesVice President for Academics & Student Life',
    example
    Gei0r
    @Gei0r
    @kgonzale please add html input
    That is the full SO post
    Gei0r
    @Gei0r
    ok, so is your question answered or is it still open? Couldn't quite follow the SO post
    Joseph Crawford
    @jcrawford

    Hello Everyone, I am trying to use cheerio in a nodejs application and trying to parse a document to get a structure that I can use to index documents with algolia. I am allowing the developer consuming this to pass in a structure such as h1 h3 p, or h1 h2 h3 p but this will be dynamic. I am then trying to get the elements and create json objects which will be passed to algolia for indexing. Below is a sample html structure and how the json objects are created

    <h1>Article Title</h1>
    <h3>Section Title</h3>
    <p>Content</p>
    <h3>Section Two</h3>
    <p>Content 2</p>
    <h3>Section Three</h3>
    <p>Content 3</p>
    <h1>Secondary Title</h1>
    <p>Secondary Content</p>

    so for the options.structure I would pass in h1 h3 p, this would then create the following structures to pass to algolia

    {
      "link":"path/article#article-title",
      "importance":0,
      "objectID":"path/article#article-title-816d21744be4034a4bfa3323722024b4"
      "h1":"Article Title",
    }
    
    {
      "link":"path/article#section-title",
      "importance":3,
      "objectID":"path/article#section-title-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Title"
    }
    
    {
      "link":"path/article#section-title-p0",
      "importance":5,
      "objectID":"path/article#section-title-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Title",
      "p":"Content"
    }
    
    {
      "link":"path/article#section-two”,
      "importance":3,
      "objectID":"path/article#section-two-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Two”
    }
    
    {
      "link":"path/article#section-two-p0",
      "importance":5,
      "objectID":"path/article#section-two-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Two”,
      "p":"Content 2”
    }
    
    {
      "link":"path/article#section-three”,
      "importance":3,
      "objectID":"path/article#section-three-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Three”
    }
    
    {
      "link":"path/article#section-three-p0",
      "importance":5,
      "objectID":"path/article-title#section-three-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Three”,
      "p":"Content 3”
    }
    
    {
      "link":"path/article#secondary-title",
      "importance":0,
      "objectID":"path/article#secondary-title-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
    }
    
    {
      "link":"path/article#secondary-title-p0”,
      "importance:4,
      "objectID":"path/article#secondary-title-p0-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
      "p”:”Secondary Content”,
    }

    Now this structure would have to be dynamic meaning rather than the above they could pass in h1 h2 h3 h4 p and it should be parsed accordingly.

    There could also be content under h1 such as <h1>Title</h1><p>Content</p> and this would receive a different importance as seen in the last example above.

    The trouble I am having is how can I select all paragraphs under each tag in the structure. I know I can get all p tags under h1 and that will return them all but it will not let me know which h1 the p’s belong to, etc.

    I am following this article trying to replicate this using cheerio.
    https://blog.algolia.com/how-to-build-a-helpful-search-for-technical-documentation-the-laravel-example/

    Joseph Crawford
    @jcrawford
    I guess the p could be left out of the structure since it could occur anywhere under any other element and would have to be determined in code. I am just trying to figure out how i could end up with something like h1 > h3 > p p p, h1 > h3 > p, h1 > p p, etc. if I grab all p's under h1 it could be 10 p's but i wouldn't know which h1 they belong to for indexing so I would want to loop over each structure element check for the existence of p and note it, then check for the next structure element provided etc. and create an object based on this, then i can use that to create these indexes. Keeping the structure of which element a p belongs to is required though and that is what I cannot figure out.
    Joseph Crawford
    @jcrawford
    I am thinking that I need to loop over each structure element passed in, then loop over each element found and for each grab any sibling p's and add them to a data structure, however once I loop over the elements such as h1 I am finding it tough to get any sibling p's. I cannot use nextUntil because I do not know if the next tag would be an h1, h2, h3, etc. Is there a way in cheerio that I could get all sibling p's for an element while looping over them and not get sibling p's for other elements in the loop until i loop over them?
    Gei0r
    @Gei0r
    Correct me if I'm wrong:
    Your problem is that semantically, the <p> is a child of <h2> which in turn is a child of <h1>, so there is a semantic parent-child relationship h1 > h2 > p
    but this relationship is not transferred to the html
    in the html, h1 and h2 and p are just some tags and neither is fundamentally "senior" to the other.
    Joseph Crawford
    @jcrawford
    @Gei0r correct they are all siblings there is no parent/child with h elements and p elements
    I have never seen a parent child relationship with h tags and p tags always seen them as siblings
    Gei0r
    @Gei0r
    yeah, you've never seen that relationship in html because html is not set up like this
    but that's also the reason why you won't get the parent->sibling structure from an html parser like cheerio
    So you will have to do it yourself. Starting at the top, read in the tags one by one and the remember the ones you've already seen
    nextUntil is not the right function for this use. You should use next
    Joseph Crawford
    @jcrawford
    @Gei0r thanks for that information
    Christopher Shelley
    @basiclaser
    hey all is it possible to 'next' twice?
    there is a UL that i want to get - the only identifier is on a sibling H1 - between the H1 and the UL is a SPAN tag. I can get the SPAN tag in 'next' of the H1 but dont know how to then get the UL after it
    Victor Aprea
    @vicatcu
    Hi gang, I seem to be having some trouble getting runtime errors trying to use cheerio in an Angular 7 project
    at first i was getting errors at build time about "can't find 'stream'"
    so i npm installed stream and the build-time error went away, only to be replaced by a very similar run-time error
    i didn't see anything quite like this in the issues... anyone have any ideas?
    Gei0r
    @Gei0r
    can you try to reduce your problem to a minimal working example and post it?
    Dj-jom2x
    @Dj-jom2x
    you use cheerio at client side? to do crawling?
    Ashish117
    @Ashish117
    Hey anyone here can anyone help me a bit
    mikku
    @mikku_g_twitter
    Hi
    I am unable to fetch <script> tag content using cheerio. Any settings/config am I missing?
    Damian Toczek
    @damiantoczek

    Hello guys, I've just started using CheerioJS and would like to know how to scrape a website. $("#bproducts div.p div.pname") will return an object with all the product names, but I want to get the div.pprice next to the div.pname.

    I was trying to make two arrays like:

    const pName = $("#bproducts div.p div.pname");
    const pPrice = $("#bproducts div.p div.pname");

    and then glue it together into one object as key:value.

    What I want is to get a key:value pair like pname:pprice

    Fausto A. Guerrero
    @enjoythelive1
    Even if you are using cheriojs, this is not the place to ask for question that does not require specific cheriojs knowledge. I would suggest you make the question in stack overflow
    Damian Toczek
    @damiantoczek
    This is exactly the place to ask this question because there is no way to understand what the hell cheerio is doing. It says that children is an array but it's an array that contains those weird object that makes no sense. I've spent now about 8h on cheerio and I'm still not able to get the children and create the product array. People say that CheerioJS is amazing but there are no tutorials and the documentation is very poor and doesn't explain how the object represents the html elements.
    hackerunet
    @hackerunet

    hi there, dear community, I have a question regarding some scrapping issues I'm having, I'm trying to get an 'a tag' text and each time I use $(element).text() I receive an empty string, looking at the html I can see everything there but there are also angular elements. my a tag scrapping result looks like this:

    { '0':
       { type: 'tag',
         name: 'a',
         namespace: 'http://www.w3.org/1999/xhtml',
         attribs:
          { 'ng-href': '{{treatPathProduct(product.Characteristics[\'Path\'])}}',
            'ng-bind-html': 'product.Characteristics[\'##ProductLabel\']' },
         'x-attribsNamespace': { 'ng-href': undefined, 'ng-bind-html': undefined },
         'x-attribsPrefix': { 'ng-href': undefined, 'ng-bind-html': undefined },
         children: [],
         parent:
          { type: 'tag',
            name: 'h2',
            namespace: 'http://www.w3.org/1999/xhtml',
            attribs: {},
            'x-attribsNamespace': {},
            'x-attribsPrefix': {},
            children: [Array],
            parent: [Object],
            prev: [Object],
            next: [Object] },
         prev: null,
         next: null },
      options:
       { withDomLvl1: true,
         normalizeWhitespace: false,
         xml: false,
         decodeEntities: true },
      length: 1 }

    My question is, how can I get the text and the href attribute? both are empty or seems empty