gaojiuli/gain
Web crawling framework based on asyncio.
repo name | gaojiuli/gain |
repo link | https://github.com/gaojiuli/gain |
homepage | |
language | Python |
size (curr.) | 219 kB |
stars (curr.) | 1961 |
created | 2017-05-31 |
license | GNU General Public License v3.0 |
Web crawling framework for everyone. Written with asyncio
, uvloop
and aiohttp
.
Requirements
- Python3.5+
Installation
pip install gain
pip install uvloop
(Only linux)
Usage
- Write spider.py:
from gain import Css, Item, Parser, Spider
import aiofiles
class Post(Item):
title = Css('.entry-title')
content = Css('.entry-content')
async def save(self):
async with aiofiles.open('scrapinghub.txt', 'a+') as f:
await f.write(self.results['title'])
class MySpider(Spider):
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
start_url = 'https://blog.scrapinghub.com/'
parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]
MySpider.run()
Or use XPathParser:
from gain import Css, Item, Parser, XPathParser, Spider
class Post(Item):
title = Css('.breadcrumb_last')
async def save(self):
print(self.title)
class MySpider(Spider):
start_url = 'https://mydramatime.com/europe-and-us-drama/'
concurrency = 5
headers = {'User-Agent': 'Google Spider'}
parsers = [
XPathParser('//span[@class="category-name"]/a/@href'),
XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
]
proxy = 'https://localhost:1234'
MySpider.run()
You can add proxy setting to spider as above.
-
Run
python spider.py
-
Result:
Example
The examples are in the /example/
directory.
Contribution
- Pull request.
- Open issue.