slashbit/spider-less
Web spider as a service, spider on serverless, the engine behind kmppp.com
repo name | slashbit/spider-less |
repo link | https://github.com/slashbit/spider-less |
homepage | https://kmppp.com |
language | JavaScript |
size (curr.) | 1182 kB |
stars (curr.) | 148 |
created | 2018-10-10 |
license | MIT License |
spider-less
Web spider on Serverless!
About Spiderless
Spiderless is the backend layer of KMPPP, a web spider as a service application, it allows you to monitor and get notified of nearly anything on the web. It is built on top of these technologies:
Technology | Used For |
---|---|
Bulma, Buefy | UI |
Vue.js | Front-end logic |
AWS S3 | Website hosting |
AWS Lambda | Backend API |
AWS SNS | Message queue |
AWS DynamoDB | Database |
AWS API Gateway | API gateway |
AWS Cloudfront | CDN |
AWS Route 53 | DNS |
Architecture
API Endpoints
GET
subscriptions
Description
Get a list of subscriptions (a maximum of 1 MB of data limited by DynamoDB).
Parameters
None
Request
curl /api/subscriptions
Response
[
{
"createdAt": 1544833435070,
"targets": [
{
"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span",
"label":"ratingCount"
}
],
"id": "b4d98de0-ffff-11e8-a4c9-9b9ee9089058",
"url": "https://www.imdb.com/title/tt0111161/",
"interval": 60
}
]
POST
subscriptions
Description
Create a new subscription to feed the spider.
Parameters
- url (required) - Target website url
- targets (required) - List of css selectors from which text contents are expected to be extracted
- interval (required) - The interval (in minutes) between scrape
Request
curl -X POST /api/subscriptions -d '{"url":"https://www.imdb.com/title/tt0111161/","targets":"[{\"label\":\"ratingCount\",\"selector\":\"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span\"}]","interval":"60"}' -H "Content-Type: application/json"
Response
{
"id": "ef417d30-ffff-11e8-a4c9-9b9ee9089058",
"url": "https://www.imdb.com/title/tt0111161/",
"targets": [
{
"label":"ratingCount",
"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span"
}
],
"interval": 60,
"createdAt": 1544833533059,
"updatedAt": 1544833533059
}
DELETE
subscriptions
Description
Delete a subscription.
Parameters
- id (required) - Subscription id
Request
curl -X DELETE /api/subscriptions/:id
Response
{
"id": "d72c05d0-ffff-11e8-a4c9-9b9ee9089058"
}
Functions List
scrape
Description
Scrape target websites and extract target contents.
Invoke
yarn invoke:local scrape -d '{"createdAt":1544833435070,"updatedAt":1544833435070,"targets":[{"selector":"#title-overview-widget > div.vital > div.title_block > div > div.ratings_wrapper > div.imdbRating > a > span","label":"ratingCount"}],"id":"b4d98de0-ffff-11e8-a4c9-9b9ee9089058","url":"https://www.imdb.com/title/tt0111161/","interval":60}'
Response
[
{
"label": "ratingCount",
"content": "2,025,796"
}
]
cron
Description
Fetch subscriptions from database and filter out the ones need to be executed.
Invoke
yarn invoke:local cron
Response
None
Development
# install dependencies
yarn install
# start api server on port 8090
yarn start
# invoke function locally
yarn invoke:local function_name
# invoke remote function
yarn invoke cron function_name
Deploy
# first setup your aws credentials https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
yarn deploy