A place to discuss using Mercury Web Parser, an open source utility that extracts content from the chaos of the web. https://github.com/postlight/mercury-parser https://github.com/postlight/mercury-parser-api
mercury-parser
cli tool as described in the README here https://github.com/postlight/mercury-parser/#usage
@MastaBaba The import
would happen in your javascript code. If you're not writing javascript/node, that's not something you'd need/want.
Re the API: No, the only step to deploying to AWS using that repo is making sure your AWS credentials are set up as described in the README. Other than that, it uses the Serverless framework to provision and deploy your code to AWS. It doesn't run in a server — it runs on AWS Lambda, with API Gateway providing you with a URL to make requests.
I got this to finally work:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"iam:GetPolicyVersion",
"logs:*",
"sns:Unsubscribe",
"dynamodb:*",
"iam:CreateRole",
"cloudformation:DescribeStackResource",
"xray:PutTraceSegments",
"iot:DescribeEndpoint",
"kinesis:ListStreams",
"cognito-sync:SetCognitoEvents",
"cloudformation:DescribeStackEvents",
"iam:ListAttachedRolePolicies",
"cloudformation:UpdateStack",
"sns:Subscribe",
"cloudformation:DescribeChangeSet",
"iam:ListRolePolicies",
"events:*",
"cloudformation:ListStackResources",
"iam:GetRole",
"sns:ListSubscriptionsByTopic",
"iam:GetPolicy",
"iot:GetTopicRule",
"sqs:SendMessage",
"cloudformation:DescribeStacks",
"iot:CreateTopicRule",
"kinesis:PutRecord",
"cloudwatch:*",
"iot:ListPolicies",
"ec2:DescribeSubnets",
"iot:ListThings",
"iam:GetRolePolicy",
"cloudformation:ValidateTemplate",
"iot:ReplaceTopicRule",
"tag:GetResources",
"xray:PutTelemetryRecords",
"iot:AttachThingPrincipal",
"cognito-identity:ListIdentityPools",
"sns:ListTopics",
"iot:CreatePolicy",
"iam:CreateUser",
"iam:PassRole",
"sns:Publish",
"cognito-sync:GetCognitoEvents",
"iot:CreateKeysAndCertificate",
"cloudformation:ListStacks",
"sqs:ListQueues",
"iot:ListTopicRules",
"iot:CreateThing",
"s3:*",
"iot:AttachPrincipalPolicy",
"iam:ListRoles",
"kinesis:DescribeStream",
"sns:ListSubscriptions",
"ec2:DescribeSecurityGroups",
"cloudformation:CreateStack",
"ec2:DescribeVpcs",
"kms:ListAliases",
"lambda:*"
],
"Resource": "*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": "apigateway:*",
"Resource": "arn:aws:apigateway:*::*"
}
]
}
But it feels like there are probably more permissions than needed.
@kinngh That's correct. If you don't want to keep your function running, you can remove the scheduled events by commenting out these lines:
https://github.com/postlight/mercury-parser-api/blob/master/serverless.yml#L43-L45
And these:
https://github.com/postlight/mercury-parser-api/blob/master/serverless.yml#L57-L59
import Mercury from "./mercuryFiles/src/mercury.js"
and then I can do something likeMercury.parse(url).then(result => console.log(result))
?
Hi there!
I found the information on the GitHub page to be a little incomplete. Although I'm a full time IT professional and part-time developer, it was not obvious to me how to proceed with the Mercury installation on Amazon Lambda.
Making the installation work requires some software that is not installed on any OS by default.
So, I made a note with the instructions that I used to create a working API on Amazon. Feel free to share and to get in touch if you have questions or need assistance.
There are two links, so you can choose your poison.
Evernote
https://www.evernote.com/l/AAPoJR49OThHu5lBZLt8b1fyKomly8gRz6Q
Notion
npm install @postlight/mercury-parser
. Then inside one of the folders in @postlight/mercury-parser is a Mercury.web.js file. Add a line at the end: export default Mercury
. Afterwards, I could just import it regularly into my extension and use it without any issues.
I'm using the mercury-parser in node, and it's working beautifully:
const Mercury = require('@postlight/mercury-parser');
const url = 'https://en.wikipedia.org/wiki/John_von_Neumann';
Mercury.parse(url).then(result => { console.log(result); } );
but it is unclear to me how to utilize the custom extractors. There is one for wikipedia here but I don't see instructions for how to incorporate that into the above code?
Mercury.parse()
? If so, that doesn't seem to be the case for me. In the example I posted above, the wikipedia.org parser doesn't seem to be executed, just the default Mercury parser. Is there some way to ensure that a custom parser is run that I'm missing?
Hello, I can't seem to pass the headers option through to the request successfully. I have it implemented as in the examples:
Mercury.parse(url, {
headers: {
dnt: '1',
cookie: '__cfduid=vqwo24522env9832hgo23gh3g23gkewe',
'accept-language': 'en-US,en;q=0.9,tr;q=0.8',
'accept-encoding': 'gzip, deflate, br',
accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'upgrade-insecure-requests': '1',
'cache-control': 'max-age=0,no-cache',
authority: 'medium.com'
},
contentType: 'markdown',
}).then(result => console.log(result));
Am I missing something? I have made an edit to the source to fix for me so long.
function parse(_x) {
if (arguments[1].headers)
REQUEST_HEADERS = arguments[1].headers;
return _parse.apply(this, arguments);
}
– mercury.js:6537
If this is an actual issue I will submit a PR