Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Yuriy Musienko
    @musienkoyuriy
    but <meta /> is ok
    P. Doyle
    @pd164594
    Hey all - Quick question - I have am using cheerio to pull the url of a news article down And I want to then take that URL for the article and pass it to the client. I am having trouble finding info on how to do that. Any ideas?
    SuperOP535
    @SuperOP535
    @mhsjlw You do $('.pinmessage').text()
    kidrock
    @kidrock
    hi i have a problem with À html code, cheerio allwayes try to convert into utf-8 charater
    Ghost
    @ghost~559d399815522ed4b3e3a63a
    This is the first example from README:
    'use strict';
    
    const cheerio = require('cheerio')
    const $ = cheerio.load('<h2 class="title">Hello world</h2>')
    
    $('h2.title').text('Hello there!')
    $('h2').addClass('welcome')
    
    console.log($.html());
    //EXPECTED: <h2 class="title welcome">Hello there!</h2>
    //ACTUAL: <html><head></head><body><h2 class="title welcome">Hello there!</h2></body></html>
    ~
    See expected vs actual
    You should fix the docs
    Andrew Kaiser
    @andykais

    Hi, is it possible to limit the results of a selector? Say I have a very large html page, and there are 10000 items on the page. If I only want the first ten, is there a way to limit the loop in cheerio? I know I can mostly achieve what I am trying to do within the querySelector:

    cheerio.load(largeHtmlString)
    const items = []
    $('ul > li:nth-child(-n+10)').text((i, text) => items.push(text))

    but this seems slightly less clear, and not applicable to all situations. Is it possible to give a "max" to cheerio?

    Vojtěch Drábek
    @comodoro
    Would someone please be so kind and point out why the following does not iterate over all .table-responsive results on the page?
    https://gist.github.com/comodoro/3b0534f4e821e9055f3cb78fa4e5513b
    Vojtěch Drábek
    @comodoro
    No need anymore... the page returns something else for non-browsers and it did not occur to me, sorry for the post.
    Arun Kumar
    @arunkumar413
    Hi All
    Can I edit a HTML file and save using cheeerio?
    Anup Dhakal
    @anuphunt
    var request = require ('request');
    var cheerio = require('cheerio');
    var fs = require ('fs');
    
    request("http://kathmandupost.ekantipur.com/news/2018-08-31/bimstec-summit-multilateral-meet-underway.html", function(error, response, body){
        if(error){
            console.log("Error: "+ error);
        }
        console.log("Status code: " + response.statusCode);
        var $ = cheerio.load(body);
    
        var title = $(this).find('h1 .title').text();
        fs.appendFileSync('ekantipur.txt',title);
    });
    What am I doing wrong in this code? I just want to get the text from class "title" that has h1 tag.
    andreachiera
    @andreachiera
    Hi,
    i'd like to know if all functions in cheerio's module are synchronous?
    Konstantin Bläsi
    @konstantinblaesi
    Using <h2>Headline</h2><p>Paragraph</p> as input to cheerio and running $(...).text() on it results in "HeadlineParagraph", but I'd expect a newline after "Headline" or at least a space? Does anyone know a solution ?
    Danish S
    @medanishtheboy_twitter
    Has anyone used selenium for scraping?
    whats faster cheerio or selenium (with java)
    Vojtěch Drábek
    @comodoro
    @medanishtheboy_twitter Cheerio is faster since Selenium renders the page first with everything, which may be Javascript. It basically simulates user input. Cheerio works with HTML only, so it is faster, but limited to HTML source
    Kevin Gonzalez
    @kgonzale
    Hello all
    Kevin Gonzalez
    @kgonzale
    Why is everything that is being scraped popped out into a single object in this code? p tag
      $("li.staff-directory-department").map((item, index) => {
        arrayInfo.push({
          id: $("h2.staff-directory-department", index)
            .text()
            .trim(),
          name: $("h3.staff-member-name", index)
            .text()
            .split("\n")
            .map(splitString => splitString.trim())
            .filter(trimString => trimString.length > 0),
          title: $("p.staff-member-title", index).text(),
          email: $("p.staff-member-email", index).text(),
          phone: $("p.staff-member-phone", index).text()
        });
      });
     title:
       'Director of Institutional Research & EffectivenessAdministrative AssistantDirector of Online Learning and Educational TechnologyAssociate Vice President of Academics and Strategic InitiativesVice President for Academics & Student Life',
    example
    Gei0r
    @Gei0r
    @kgonzale please add html input
    That is the full SO post
    Gei0r
    @Gei0r
    ok, so is your question answered or is it still open? Couldn't quite follow the SO post
    Joseph Crawford
    @jcrawford

    Hello Everyone, I am trying to use cheerio in a nodejs application and trying to parse a document to get a structure that I can use to index documents with algolia. I am allowing the developer consuming this to pass in a structure such as h1 h3 p, or h1 h2 h3 p but this will be dynamic. I am then trying to get the elements and create json objects which will be passed to algolia for indexing. Below is a sample html structure and how the json objects are created

    <h1>Article Title</h1>
    <h3>Section Title</h3>
    <p>Content</p>
    <h3>Section Two</h3>
    <p>Content 2</p>
    <h3>Section Three</h3>
    <p>Content 3</p>
    <h1>Secondary Title</h1>
    <p>Secondary Content</p>

    so for the options.structure I would pass in h1 h3 p, this would then create the following structures to pass to algolia

    {
      "link":"path/article#article-title",
      "importance":0,
      "objectID":"path/article#article-title-816d21744be4034a4bfa3323722024b4"
      "h1":"Article Title",
    }
    
    {
      "link":"path/article#section-title",
      "importance":3,
      "objectID":"path/article#section-title-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Title"
    }
    
    {
      "link":"path/article#section-title-p0",
      "importance":5,
      "objectID":"path/article#section-title-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Title",
      "p":"Content"
    }
    
    {
      "link":"path/article#section-two”,
      "importance":3,
      "objectID":"path/article#section-two-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Two”
    }
    
    {
      "link":"path/article#section-two-p0",
      "importance":5,
      "objectID":"path/article#section-two-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Two”,
      "p":"Content 2”
    }
    
    {
      "link":"path/article#section-three”,
      "importance":3,
      "objectID":"path/article#section-three-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Three”
    }
    
    {
      "link":"path/article#section-three-p0",
      "importance":5,
      "objectID":"path/article-title#section-three-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Three”,
      "p":"Content 3”
    }
    
    {
      "link":"path/article#secondary-title",
      "importance":0,
      "objectID":"path/article#secondary-title-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
    }
    
    {
      "link":"path/article#secondary-title-p0”,
      "importance:4,
      "objectID":"path/article#secondary-title-p0-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
      "p”:”Secondary Content”,
    }

    Now this structure would have to be dynamic meaning rather than the above they could pass in h1 h2 h3 h4 p and it should be parsed accordingly.

    There could also be content under h1 such as <h1>Title</h1><p>Content</p> and this would receive a different importance as seen in the last example above.

    The trouble I am having is how can I select all paragraphs under each tag in the structure. I know I can get all p tags under h1 and that will return them all but it will not let me know which h1 the p’s belong to, etc.

    I am following this article trying to replicate this using cheerio.
    https://blog.algolia.com/how-to-build-a-helpful-search-for-technical-documentation-the-laravel-example/

    Joseph Crawford
    @jcrawford
    I guess the p could be left out of the structure since it could occur anywhere under any other element and would have to be determined in code. I am just trying to figure out how i could end up with something like h1 > h3 > p p p, h1 > h3 > p, h1 > p p, etc. if I grab all p's under h1 it could be 10 p's but i wouldn't know which h1 they belong to for indexing so I would want to loop over each structure element check for the existence of p and note it, then check for the next structure element provided etc. and create an object based on this, then i can use that to create these indexes. Keeping the structure of which element a p belongs to is required though and that is what I cannot figure out.
    Joseph Crawford
    @jcrawford
    I am thinking that I need to loop over each structure element passed in, then loop over each element found and for each grab any sibling p's and add them to a data structure, however once I loop over the elements such as h1 I am finding it tough to get any sibling p's. I cannot use nextUntil because I do not know if the next tag would be an h1, h2, h3, etc. Is there a way in cheerio that I could get all sibling p's for an element while looping over them and not get sibling p's for other elements in the loop until i loop over them?
    Gei0r
    @Gei0r
    Correct me if I'm wrong:
    Your problem is that semantically, the <p> is a child of <h2> which in turn is a child of <h1>, so there is a semantic parent-child relationship h1 > h2 > p
    but this relationship is not transferred to the html
    in the html, h1 and h2 and p are just some tags and neither is fundamentally "senior" to the other.
    Joseph Crawford
    @jcrawford
    @Gei0r correct they are all siblings there is no parent/child with h elements and p elements
    I have never seen a parent child relationship with h tags and p tags always seen them as siblings
    Gei0r
    @Gei0r
    yeah, you've never seen that relationship in html because html is not set up like this
    but that's also the reason why you won't get the parent->sibling structure from an html parser like cheerio
    So you will have to do it yourself. Starting at the top, read in the tags one by one and the remember the ones you've already seen
    nextUntil is not the right function for this use. You should use next
    Joseph Crawford
    @jcrawford
    @Gei0r thanks for that information
    Christopher Shelley
    @basiclaser
    hey all is it possible to 'next' twice?
    there is a UL that i want to get - the only identifier is on a sibling H1 - between the H1 and the UL is a SPAN tag. I can get the SPAN tag in 'next' of the H1 but dont know how to then get the UL after it
    Victor Aprea
    @vicatcu
    Hi gang, I seem to be having some trouble getting runtime errors trying to use cheerio in an Angular 7 project
    at first i was getting errors at build time about "can't find 'stream'"
    so i npm installed stream and the build-time error went away, only to be replaced by a very similar run-time error
    i didn't see anything quite like this in the issues... anyone have any ideas?
    Gei0r
    @Gei0r
    can you try to reduce your problem to a minimal working example and post it?