YoongiKim/AutoCrawler
Google, Naver multiprocess image web crawler (Selenium)
repo name | YoongiKim/AutoCrawler |
repo link | https://github.com/YoongiKim/AutoCrawler |
homepage | |
language | Python |
size (curr.) | 114385 kB |
stars (curr.) | 719 |
created | 2018-11-21 |
license | Apache License 2.0 |
AutoCrawler
Google, Naver multiprocess image crawler (High Quality & Speed & Customizable)
How to use
-
Install Chrome
-
pip install -r requirements.txt
-
Write search keywords in keywords.txt
-
Run “main.py”
-
Files will be downloaded to ‘download’ directory.
Arguments
usage:
python3 main.py [--skip true] [--threads 4] [--google true] [--naver true] [--full false] [--face false]
--skip true Skips keyword if downloaded directory already exists. This is needed when re-downloading.
--threads 4 Number of threads to download.
--google true Download from google.com (boolean)
--naver true Download from naver.com (boolean)
--full false Download full resolution image instead of thumbnails (slow)
--face false Face search mode
Full Resolution Mode
You can download full resolution image of JPG, GIF, PNG files by specifying –full true
Data Imbalance Detection
Detects data imbalance based on number of files.
When crawling ends, the message show you what directory has under 50% of average files.
I recommend you to remove those directories and re-download.
Remote crawling through SSH on your server
sudo apt-get install xvfb <- This is virtual display
sudo apt-get install screen <- This will allow you to close SSH terminal while running.
screen -S s1
Xvfb :99 -ac & DISPLAY=:99 python3 main.py
Customize
You can make your own crawler by changing collect_links.py