Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Vojtěch Drábek
    @comodoro
    @medanishtheboy_twitter Cheerio is faster since Selenium renders the page first with everything, which may be Javascript. It basically simulates user input. Cheerio works with HTML only, so it is faster, but limited to HTML source
    Kevin Gonzalez
    @kgonzale
    Hello all
    Kevin Gonzalez
    @kgonzale
    Why is everything that is being scraped popped out into a single object in this code? p tag
      $("li.staff-directory-department").map((item, index) => {
        arrayInfo.push({
          id: $("h2.staff-directory-department", index)
            .text()
            .trim(),
          name: $("h3.staff-member-name", index)
            .text()
            .split("\n")
            .map(splitString => splitString.trim())
            .filter(trimString => trimString.length > 0),
          title: $("p.staff-member-title", index).text(),
          email: $("p.staff-member-email", index).text(),
          phone: $("p.staff-member-phone", index).text()
        });
      });
     title:
       'Director of Institutional Research & EffectivenessAdministrative AssistantDirector of Online Learning and Educational TechnologyAssociate Vice President of Academics and Strategic InitiativesVice President for Academics & Student Life',
    example
    Gei0r
    @Gei0r
    @kgonzale please add html input
    That is the full SO post
    Gei0r
    @Gei0r
    ok, so is your question answered or is it still open? Couldn't quite follow the SO post
    Joseph Crawford
    @jcrawford

    Hello Everyone, I am trying to use cheerio in a nodejs application and trying to parse a document to get a structure that I can use to index documents with algolia. I am allowing the developer consuming this to pass in a structure such as h1 h3 p, or h1 h2 h3 p but this will be dynamic. I am then trying to get the elements and create json objects which will be passed to algolia for indexing. Below is a sample html structure and how the json objects are created

    <h1>Article Title</h1>
    <h3>Section Title</h3>
    <p>Content</p>
    <h3>Section Two</h3>
    <p>Content 2</p>
    <h3>Section Three</h3>
    <p>Content 3</p>
    <h1>Secondary Title</h1>
    <p>Secondary Content</p>

    so for the options.structure I would pass in h1 h3 p, this would then create the following structures to pass to algolia

    {
      "link":"path/article#article-title",
      "importance":0,
      "objectID":"path/article#article-title-816d21744be4034a4bfa3323722024b4"
      "h1":"Article Title",
    }
    
    {
      "link":"path/article#section-title",
      "importance":3,
      "objectID":"path/article#section-title-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Title"
    }
    
    {
      "link":"path/article#section-title-p0",
      "importance":5,
      "objectID":"path/article#section-title-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Title",
      "p":"Content"
    }
    
    {
      "link":"path/article#section-two”,
      "importance":3,
      "objectID":"path/article#section-two-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Two”
    }
    
    {
      "link":"path/article#section-two-p0",
      "importance":5,
      "objectID":"path/article#section-two-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Two”,
      "p":"Content 2”
    }
    
    {
      "link":"path/article#section-three”,
      "importance":3,
      "objectID":"path/article#section-three-816d21744be4034a4bfa3323722024b4",
      "h1":"Article Title",
      "h3":"Section Three”
    }
    
    {
      "link":"path/article#section-three-p0",
      "importance":5,
      "objectID":"path/article-title#section-three-p0-ae5e3efcdddc387616bbc5b1e5b1b134",
      "h1":"Article Title",
      "h3":"Section Three”,
      "p":"Content 3”
    }
    
    {
      "link":"path/article#secondary-title",
      "importance":0,
      "objectID":"path/article#secondary-title-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
    }
    
    {
      "link":"path/article#secondary-title-p0”,
      "importance:4,
      "objectID":"path/article#secondary-title-p0-816d21744be4034a4bfa3323722024b4"
      "h1”:”Secondary Title",
      "p”:”Secondary Content”,
    }

    Now this structure would have to be dynamic meaning rather than the above they could pass in h1 h2 h3 h4 p and it should be parsed accordingly.

    There could also be content under h1 such as <h1>Title</h1><p>Content</p> and this would receive a different importance as seen in the last example above.

    The trouble I am having is how can I select all paragraphs under each tag in the structure. I know I can get all p tags under h1 and that will return them all but it will not let me know which h1 the p’s belong to, etc.

    I am following this article trying to replicate this using cheerio.
    https://blog.algolia.com/how-to-build-a-helpful-search-for-technical-documentation-the-laravel-example/

    Joseph Crawford
    @jcrawford
    I guess the p could be left out of the structure since it could occur anywhere under any other element and would have to be determined in code. I am just trying to figure out how i could end up with something like h1 > h3 > p p p, h1 > h3 > p, h1 > p p, etc. if I grab all p's under h1 it could be 10 p's but i wouldn't know which h1 they belong to for indexing so I would want to loop over each structure element check for the existence of p and note it, then check for the next structure element provided etc. and create an object based on this, then i can use that to create these indexes. Keeping the structure of which element a p belongs to is required though and that is what I cannot figure out.
    Joseph Crawford
    @jcrawford
    I am thinking that I need to loop over each structure element passed in, then loop over each element found and for each grab any sibling p's and add them to a data structure, however once I loop over the elements such as h1 I am finding it tough to get any sibling p's. I cannot use nextUntil because I do not know if the next tag would be an h1, h2, h3, etc. Is there a way in cheerio that I could get all sibling p's for an element while looping over them and not get sibling p's for other elements in the loop until i loop over them?
    Gei0r
    @Gei0r
    Correct me if I'm wrong:
    Your problem is that semantically, the <p> is a child of <h2> which in turn is a child of <h1>, so there is a semantic parent-child relationship h1 > h2 > p
    but this relationship is not transferred to the html
    in the html, h1 and h2 and p are just some tags and neither is fundamentally "senior" to the other.
    Joseph Crawford
    @jcrawford
    @Gei0r correct they are all siblings there is no parent/child with h elements and p elements
    I have never seen a parent child relationship with h tags and p tags always seen them as siblings
    Gei0r
    @Gei0r
    yeah, you've never seen that relationship in html because html is not set up like this
    but that's also the reason why you won't get the parent->sibling structure from an html parser like cheerio
    So you will have to do it yourself. Starting at the top, read in the tags one by one and the remember the ones you've already seen
    nextUntil is not the right function for this use. You should use next
    Joseph Crawford
    @jcrawford
    @Gei0r thanks for that information
    Christopher Shelley
    @basiclaser
    hey all is it possible to 'next' twice?
    there is a UL that i want to get - the only identifier is on a sibling H1 - between the H1 and the UL is a SPAN tag. I can get the SPAN tag in 'next' of the H1 but dont know how to then get the UL after it
    Victor Aprea
    @vicatcu
    Hi gang, I seem to be having some trouble getting runtime errors trying to use cheerio in an Angular 7 project
    at first i was getting errors at build time about "can't find 'stream'"
    so i npm installed stream and the build-time error went away, only to be replaced by a very similar run-time error
    i didn't see anything quite like this in the issues... anyone have any ideas?
    Gei0r
    @Gei0r
    can you try to reduce your problem to a minimal working example and post it?
    Dj-jom2x
    @Dj-jom2x
    you use cheerio at client side? to do crawling?
    Ashish117
    @Ashish117
    Hey anyone here can anyone help me a bit
    mikku
    @mikku_g_twitter
    Hi
    I am unable to fetch <script> tag content using cheerio. Any settings/config am I missing?
    Damian Toczek
    @damiantoczek

    Hello guys, I've just started using CheerioJS and would like to know how to scrape a website. $("#bproducts div.p div.pname") will return an object with all the product names, but I want to get the div.pprice next to the div.pname.

    I was trying to make two arrays like:

    const pName = $("#bproducts div.p div.pname");
    const pPrice = $("#bproducts div.p div.pname");

    and then glue it together into one object as key:value.

    What I want is to get a key:value pair like pname:pprice

    Fausto A. Guerrero
    @enjoythelive1
    Even if you are using cheriojs, this is not the place to ask for question that does not require specific cheriojs knowledge. I would suggest you make the question in stack overflow
    Damian Toczek
    @damiantoczek
    This is exactly the place to ask this question because there is no way to understand what the hell cheerio is doing. It says that children is an array but it's an array that contains those weird object that makes no sense. I've spent now about 8h on cheerio and I'm still not able to get the children and create the product array. People say that CheerioJS is amazing but there are no tutorials and the documentation is very poor and doesn't explain how the object represents the html elements.
    hackerunet
    @hackerunet

    hi there, dear community, I have a question regarding some scrapping issues I'm having, I'm trying to get an 'a tag' text and each time I use $(element).text() I receive an empty string, looking at the html I can see everything there but there are also angular elements. my a tag scrapping result looks like this:

    { '0':
       { type: 'tag',
         name: 'a',
         namespace: 'http://www.w3.org/1999/xhtml',
         attribs:
          { 'ng-href': '{{treatPathProduct(product.Characteristics[\'Path\'])}}',
            'ng-bind-html': 'product.Characteristics[\'##ProductLabel\']' },
         'x-attribsNamespace': { 'ng-href': undefined, 'ng-bind-html': undefined },
         'x-attribsPrefix': { 'ng-href': undefined, 'ng-bind-html': undefined },
         children: [],
         parent:
          { type: 'tag',
            name: 'h2',
            namespace: 'http://www.w3.org/1999/xhtml',
            attribs: {},
            'x-attribsNamespace': {},
            'x-attribsPrefix': {},
            children: [Array],
            parent: [Object],
            prev: [Object],
            next: [Object] },
         prev: null,
         next: null },
      options:
       { withDomLvl1: true,
         normalizeWhitespace: false,
         xml: false,
         decodeEntities: true },
      length: 1 }

    My question is, how can I get the text and the href attribute? both are empty or seems empty

    radi001
    @radi001
    Hello, Could you please help me how to get child node index in Cheerios. For example I have one main DIV. Under that main DIV I am getting multiple child DIV's based on the database value. I want to check child DIV based on id/class and get the position(index) whether child DIV in which position like 2 position or 3 position. under that main DIV. If anyone have idea could you please help me on this?
    chivomerezi
    @chivomerezi
    Hi there! Im new at cheerio and I just want to know if there is a fast way to make a json file
    Rajat
    @rajataudichya
    Hello
    Anyone here
    I have been using cheerio to extract data, which I able to get on the terminal , but when I use the same code on the browser I am getting a cross origin error
    I want to understand why is it that I am getting the value in the terminal but not in the browser
    No one here?
    MikeBison
    @MikeBison
    hello guys, do you have any spider community website?
    Hampton
    @hamptonmoore
    Hello, is there a way to use cheerio as an esmodule?
    so that way I can use it from Deno