Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Tom Canac
    @tmos
    hi,
    I'm looking for a crawler that can be capable of interpreting front end javascript (to crawl ajax content). I'm not sure this can be achieved with node-crawler yet, but the zombie integration may solve this. Do you have any news about this ? Is the project still developped ?
    Mike Chen
    @mike442144
    I comment on the issue you mentioned.
    Abdul Diaz
    @diazabdulm
    Hey I have a question
    Mike Chen
    @mike442144
    Wow, can't believe I haven't logged in gitter for half a year
    Orlin Bobchev
    @Bobchev
    Thank you all for your work :)
    Mike Chen
    @mike442144
    You're welcome
    Fábio Ap. Oliveira Silva
    @TcheORfabio
    hi guys
    can you help me with a question about the crawler?
    Fábio Ap. Oliveira Silva
    @TcheORfabio
    const Crawler = require('crawler');
    const $ = require('cheerio');
    const debug = require('debug')('amazonScraper:crawler');
    
    
    /**
     * Método que retorna a uri para o scraper
     * @param {String} asin 
     */
    const URI = (asin) => `https://www.amazon.com/gp/video/detail/${asin}`;
    
    /**
     * Classe Scraper
     * 
     * Classe com os métodos scrapMovieById e scrapShowById
     * que procura filmes ou séries pelo ASIN e retorna os dados
     * do filme ou série informados
     */
    class Scraper {
            /**
         * Realiza o scrap da página
         * @param {String} asin 
         */
        scrapMovieById(asin) {
            const self = this;
            return new Promise(function(resolve, reject) {
                const crawler = new Crawler({
                    rateLimit: 500,
                    retries: 3,
                });
                let movie;
                crawler.direct({
                    uri: URI(asin),
                    callback: function(error, res) {
                        if (error) {
                            reject(error);
                        }
                        const $ = res.$;
                        movie = self.parseMovieData($);
                        movie.program.asin = asin;
                        debug('Movie dentro do Crawler:scrapById: %O', movie);
                        resolve(movie);
                    }
                });
            });
        };
    
        parseMovieData($) {
            const description = $('div[data-automation-id="synopsis"]').text();
            const releaseYear = $('span[data-automation-id="release-year-badge"]').text();
            const genres = $('dt[data-automation-id="meta-info-genres"]').next().children('a')
                .map(function() {
                    return $.trim($(this).text());
                });
            const title = $('h1[data-automation-id="title"]').text();
            const images = $('div.av-fallback-packshot > img')
                .map(function() {
                    return $.trim($(this).attr("src"));
                });
            const keywords = $('meta[name="keywords"]').attr("content");
            const cast = $( 'th:contains("Starring"), th:contains("Supporting actors")' ).next().text().split(',')
            const duration = parseInt($('div.av-badges > span').eq(2).text().split(' ')[0]);
            const movie = {
                program: {
                    description,
                    releaseYear,
                    genres,
                    title,
                    images,
                    keywords,
                    cast,
                    duration,
                },
            };
            debug('Movie em crawler:parsevideodata: %O', movie);
            return movie;
        };
    }
    
    module.exports = Scraper;
    the method scrapMovieById isnt returning the parsed info correctly, does someone have any idea? Am I using it correctly?
    Mike Chen
    @mike442144
    pls do not use direct
    use queue instead
    spikespiegel5112
    @spikespiegel5112
    hello??
    is anyone here?
    Mike Chen
    @mike442144
    ?
    what's the matter
    Douglas Ferguson
    @thedug
    Howdy, I'm doing a simple test and crawler seems to stop on the first page I give it, I must be missing something really simple. Is there a toggle for it to recurse? Or do I need to enable javascript execution or something?
    I'm using a simple instance of Crawler() with a call back that just prints the url to the console console.log(res.options.uri + " " + $("title").text())
    and I then have c.queue('http://www.amazon.com');
    And it stops after printing out http://www.amazon.com
    Mike Chen
    @mike442144
    @thedug what else do you expect?