Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    zootooz
    @zootooz
    Do I have to repeat the yarn deploy steps @adampash laid out all over again or is there some automated way to handle this? I'm finding the documentation available across the internet to be not well-focused to say the least.
    Bryan Hackett
    @BryanHackett_twitter
    Node 8.10 is losing support on AWS after 12/31. Are there any plans to update to support a newer version?
    zootooz
    @zootooz
    @BryanHackett_twitter This Gitter space seems to have gone dark, which is a concern as I don't know where we are supposed to get information. Perhaps there is a space attached to the GitHub repository?
    singularita-zz
    @singularita-zz

    Hi, could you help me please with passing errors and running .preview script? (PowerShell, Win10-64bit)

    Martin

    node ./preview https://archiweb.cz/n/domaci/v-opave-se-bude-stavet-novy-bazen-za-350-milionu-korun                                      Rebuilding Mercury
    'MERCURY_TEST_BUILD' is not recognized as an internal or external command,
    operable program or batch file.
    child_process.js:649
        throw err;
        ^
    
    Error: Command failed: MERCURY_TEST_BUILD=true npm run build
    'MERCURY_TEST_BUILD' is not recognized as an internal or external command,
    operable program or batch file.
    
        at checkExecSyncError (child_process.js:610:11)
        at execSync (child_process.js:646:15)
        at Object.<anonymous> (C:\app.martin\mercury-parser\mercury-parser\preview:20:3)
        at Module._compile (internal/modules/cjs/loader.js:1139:30)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:1159:10)
        at Module.load (internal/modules/cjs/loader.js:988:32)
        at Function.Module._load (internal/modules/cjs/loader.js:896:14)
        at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
        at internal/main/run_main_module.js:17:47 {
      status: 1,
      signal: null,
      output: [
        null,
        <Buffer >,
        <Buffer 27 4d 45 52 43 55 52 59 5f 54 45 53 54 5f 42 55 49 4c 44 27 20 69 73 20 6e 6f 74 20 72 65 63 6f 67 6e 69 7a 65 64 20 61 73 20 61 6e 20 69 6e 74 65 72 ... 59 more bytes>
      ],
      pid: 26728,
      stdout: <Buffer >,
      stderr: <Buffer 27 4d 45 52 43 55 52 59 5f 54 45 53 54 5f 42 55 49 4c 44 27 20 69 73 20 6e 6f 74 20 72 65 63 6f 67 6e 69 7a 65 64 20 61 73 20 61 6e 20 69 6e 74 65 72 ... 59 more bytes>
    }
    Adam Pash
    @adampash
    @singularita-zz I don't have a windows machine to test on but i'm guessing that declaring the environment variable in the command MERCURY_TEST_BUILD=true npm run build isn't supported on powershell? you may have to edit the preview script to play friendly with powershell. it assumes a *nix shell
    @zootooz sorry for missing this: like you suggested, you would have to re-deploy
    @BryanHackett_twitter Apologies for the slow response. A couple of weeks ago, we updated the parser api to a newer node :thumbsup:
    waplay
    @waplay
    Hi, how make custom extractor with Mercury API on AWS Lambda?
    waplay
    @waplay
    @zootooz Ok, how can I then transfer this to my lambda? I use: https://github.com/postlight/mercury-parser-api
    zootooz
    @zootooz
    I'm no expert here, but assuming you've already set up your lambda/mercury aws server, I believe you just have to re-deploy the files up to AWS.
    So for me that would be yarn deploy:prod
    waplay
    @waplay
    @zootooz thanks you
    Beyza
    @beyzacevik___twitter
    hey
    Dan Taylor
    @dantaylorseo
    Hey all, any advice on how to create a customExtractor to get the date on this page? https://www.90min.com/posts/afc-bournemouth-must-replace-key-stars-for-serious-promotion-push
    Matthew Krieger
    @matthewkrieger
    I get an enormous amount of http 502 bad gateway responses - is a 502 bad gateway generated by the parser API when it fails to download and or extract the web page, or is the 502 just passed from an upstream server to the web parser api which then returns it to me? In my individual testing I never have an issue browsing directly to the URLs that Mercury tells me 502 bad gateway for, but when they go through Mercury I get that error.
    Thomas Ladd
    @TLadd
    Is it possible to add a generic custom extractor? I want to pull out some additional info for every domain that I parse. From perusing the docs and source code I think the answer is no.
    robomantis98
    @robomantis98
    im having trouble with Mercury.parser(url)...
    robomantis98
    @robomantis98
    message: "The url parameter passed does not look like a valid URL. Please check your URL and try again."
    Jelv🎴
    @Jelv:matrix.org
    [m]
    @robomantis98: maybe state the version you are running. There is not much activity in this room so not sure if anyone can help xD
    robomantis98
    @robomantis98
    "@postlight/mercury-parser": "^2.2.0",
    subrandom
    @subrandom

    @robomantis98 what URL are you trying to parse, and what method are you using? Syntax might be to blame. A curl call to my installation, just for example, looks like: curl -H "x-api-key: redacted-api-key" "https://teopbfzvmg.execute-api.us-east-1.amazonaws.com/prod/parser?url=https://trackchanges.postlight.com/building-awesome-cms-f034344d8ed"

    If you provide the URL to parse, I'm happy to try it against my own installation to make sure it isn't something weird about the URL itself.

    Baadier Sydow
    @Baadier-Sydow

    I've been struggling to get mercury-parser installed on Firebase Cloud Functions. It fails when the firebase-tools packages and attempts to deploy. All I need to do in my function is just include mercury-parsers in the package.json and it will fail.

    Any ideas on how to get past this?

    miguelbelmar
    @miguelbelmar
    Hi my name is Miguel from Chile. I am trying to create a custom parser for the domain "www.reforma.com", in a local project. But the request always returns me to the login page. Test pliss: http://www.reforma.com/reducera-oma-emissions-de-carbono-para-2025/ar2389259
    subrandom
    @subrandom
    @miguelbelmar Miquel, for me that site redirects at least once, maybe twice. Assuming you have the parser set up correctly (test it on another page) it is probably the redirects that are messing it up.
    miguelbelmar
    @miguelbelmar
    @subrandom if you are right, there is automatic redirection in the browser. How can I capture the data of the final url?
    miguelbelmar
    @miguelbelmar

    Well, here I have the URL https://www.reforma.com/aplicacioneslibre/preacceso/articulo/default.aspx?__rval=1&urlredirect=/inmovilizan-10-mil-productos-por-incumplir-nom-051/ar2384885

    but the returned content is "<body> </body>". I'm trying to create a custom parser, but my project doesn't call my custom parser :(

    subrandom
    @subrandom

    When you say you are trying to create a custom parser, do you mean that you are modifying the Mercury Parser, or creating one from scratch?

    If you are modifying the existing parser, what changes are you trying to make?

    Does the URL work in the regular Mercury Parser, or have you not tried it?

    subrandom
    @subrandom

    @miguelbelmar I tried your URL in my parser and it doesn't work. The www.reforma.com site is a mess - it's full of iFrames and divs but is lacking the standard website structure. What I mean is that, for example, the main text of the page is not inside a <body> tag or <article> tag. It's in a div classs "texttoMovil" which is inside a div class "ParaDesktop".

    If you want to parse pages from this site, you'll have to look at a few pages and see if they use a consistent layout that you can extract data from.

    miguelbelmar
    @miguelbelmar
    Hi, I clone the project and then follow this "https://github.com/postlight/mercury-parser/tree/master/src/extractors/custom", but at the end in the call to my cutsom extractor I don't filter the data either what i want (title and description)
    subrandom
    @subrandom

    Can you post your extractor somewhere? I can link you a folder to upload into if you wan. If the software is installed correctly then the problem is in how you are parsing it. Can you parse a normal site with Mercury? Like this one: https://www.cnn.com/style/article/sr-71-blackbird-spy-plane-design/index.html

    Just to make sure that your parser is properly installed.

    miguelbelmar
    @miguelbelmar
    This error on add-extractor my custom extractor image.png
    subrandom
    @subrandom
    @miguelbelmar Are you sure about that syntax? I thought the custom extractors had to start with an export statement?
    miguelbelmar
    @miguelbelmar
    @subrandom this custom parser, I need only 4 data tests image.png
    subrandom
    @subrandom
    @miguelbelmar Ok, so you’ve got the correct syntax in there and it’s partially working. Which tests failed? I can’t tell from that screenshot.
    And are you sure all the tests are valid/passable? It’s easy enough to write a test that would fail, regardless of the code being correct.
    miguelbelmar
    @miguelbelmar
    @subrandom It is returning the result to me as if the custom parser did not exist. I don't know how to make it call my custom parser. image.png
    miguelbelmar
    @miguelbelmar
    @subrandom it is reading addExtractor, but not applying it. image.png
    subrandom
    @subrandom
    It should choose the custom extractor automatically based on the URL. I’m willing to take a look at it if you want, but you’ll have to upload me the contents of that folder (the index.js and index.tests.js) because doing it over screenshots is too hard.
    You can upload here, if you want
    miguelbelmar
    @miguelbelmar
    @subrandom I appreciate your time friend. I uploaded the complete project in .RAR
    subrandom
    @subrandom
    Ok. I’ll take a look at it soon and get back to you
    subrandom
    @subrandom

    @miguelbelmar OK this isn't a perfect answer. I haven't really worked with the parser in a while because I have not needed it. But I was able to get the page to extract data this way:

    mercury-parser "https://www.reforma.com/aplicacioneslibre/preacceso/articulo/default.aspx?__rval=1&urlredirect=/inmovilizan-10-mil-productos-por-incumplir-nom-051/ar2384885"

    The results are definitely different than if I parse it with the default, so your parser is working. It just looks like there's some more cleanup that you can do in terms of removing the crazy amount of DIVS that this site uses. This site (reforma) is very badly formatted and designed, maybe to actively prevent clipping of this kind.

    You can use clean and transform your extractor to remove some of those useless divs or turn them into other things.

    It looks like the custom extractors in src/extractors/custom are only for ones that are being packaged and built-in - so your first method of creating the extractor with var customextractor was correct. Work on that one, not the second one that starts with the export command unless you are planning on submitting it for inclusion into the project.

    subrandom
    @subrandom
    You might be get (better) results than me because every page requires a login, and since I can't read the language, I haven't signed up for an account!
    miguelbelmar
    @miguelbelmar
    @subrandom in my custom parsers I filter "date_published" (from the meta tag), but the result is always NULL. from what I deduce that my custom parser is not taking, if not, the default value. What result did you get?
    subrandom
    @subrandom
    @miguelbelmar your selector is misformed. You are including an array in the [ ] but not including the brackets for the selector itself. Same problem on the Author field. They should look like this:
        author: {
            selectors: [
                ['meta[name="cXenseParse:ref-parentsite"]', 'value']
            ]
        },
        date_published: {
            selectors: [
                ['meta[name="cXenseParse:recs:publishtime"]', 'value']
            ]
        },
    miguelbelmar
    @miguelbelmar
    image.png
    @subrandom In the first command I put a space and my custom parser works. In the second it shows urlredirect as a command, is it a bug? I'm on windows using PowerShell, what do you use? I think that's the problem.
    subrandom
    @subrandom
    @miguelbelmar It could be something in your network setup that is preventing the redirection. I'm on MacOS and I installed the Parser via NPM for this experiment. If I call the site with the specific extractor, I don't have the redirection issue:
    stephenbradley@bradleys-mbp ~ % mercury-parser "https://www.reforma.com/aplicacioneslibre/preacceso/articulo/default.aspx?__rval=1&urlredirect=/inmovilizan-10-mil-productos-por-incumplir-nom-051/ar2384885" --add-extractor /Users/stephenbradley/Downloads/mercury-parser/src/extractors/fixtures/www.reforma.com/index.js
    
    
    {
      "title": "Inmovilizan 10 mil productos por incumplir NOM en etiquetado",
      "content": "<div><div class=\"ParaDesktop\"><div class=\"cont_movil\" id=\"cuerpo\"><div class=\"img100\" id=\"imagenart\"><div><img alt class=\"imgH\" src=\"https://img.gruporeforma.com/imagenes/960x640/6/235/5234346.jpg\"><div class=\"txt_pie\">Los productos fueron inmovilizados debido a que incumplen con la NOM-051.<br> </div></div><div><img alt class=\"imgH\" src=\"https://img.gruporeforma.com/imagenes/960x640/6/235/5234347.jpg\"><div class=\"txt_pie\">Los productos fueron inmovilizados debido a que incumplen con la NOM-051.<br> </div></div><div><img alt class=\"imgH\" src=\"https://img.gruporeforma.com/imagenes/960x640/6/235/5234348.jpg\"><div class=\"txt_pie\">Los productos fueron inmovilizados debido a que incumplen con la NOM-051.<br> </div></div></div><div class=\"img100\" id=\"solofoto\"><center><ul><li><img alt class=\"imgH\" src=\"https://img.gruporeforma.com/imagenes/960x640/6/235/5234346.jpg\"><div class=\"txt_pie\">Los productos fueron inmovilizados debido a que incumplen con la NOM-051.<br> </div></li></ul></center></div><div class=\"container-ver-comentarios\" id=\"vercomentarios\"><div class=\"ver-comentarios\"><div class=\"bt_vercomentarios\"><div class=\"items_bt_comm\"><div class=\"ico_comm2\"></div></div></div></div></div><section class=\"container-comentarios\"></section>\n                        \n                          <div class=\"redess\" id=\"redessocialesart\">\n                        \n                      <blockquote class=\"twitter-tweet\"><p>Los productos fueron asegurados ya que presentan irregularidades en su etiquetado, tales como: omitir el sello de exceso calor&#xFFFD;as o exceso az&#xFFFD;cares, omitir leyendas de al&#xFFFD;rgenos presentar im&#xFFFD;genes interactivas en productos con sellos de advertencia o etiquetado derogado. <a href=\"https://t.co/1rqIshJTFV\">pic.twitter.com/1rqIshJTFV</a></p>&#x2014; COFEPRIS (@COFEPRIS) <a href=\"https://twitter.com/COFEPRIS/status/1514321491432816643?ref_src=twsrc%5Etfw\">April 13, 2022</a></blockquote> \n\n                        \n                          </div>\n                        \n                      </div></div></div>",
      "author": "reforma.com",
      "date_published": "2022-04-13T19:37:13.000Z",
      "lead_image_url": "https://reforma.com/aplicacioneslibre/compartir/ImageTransformer.jpg?wm=1&img=https://img.gruporeforma.com/imagenes/960x640/6/235/5234346.jpg&ang=0",
      "dek": null,
      "next_page_url": null,
      "url": "https://www.reforma.com/inmovilizan-10-mil-productos-por-incumplir-nom-en-etiquetado/ar2384885",
      "domain": "www.reforma.com",
      "excerpt": "Cofepris y Profeco inmovilizaron 10 mil productos de marcas como D'Gari, Oreo o Doritos por incumplir la NOM-051 en etiquetado de alimentos.",
      "word_count": 88,
      "direction": "ltr",
      "total_pages": 1,
      "rendered_pages": 1
    }
    stephenbradley@bradleys-mbp ~ %