Website Scraper
scraperWebsite
Extracts links and specified data from a website and save it into the database.
The data is specified by CSS selectors.
Links and extracted data (extracts) are saved into two different database collections.
1. Links are saved into collection defined by the input.mo_coll_links.
2. Extracts are saved into collection defined by the input.mo_coll_extracts.
What this scraper can do what others can not ?
- Scrape a website simultaneously from many different IP addresses
- No need for external, 3rd party proxy companies
- Change robot IP address in case that old IP is banned with only one click
- START, STOP and PAUSE the scraping process
Input parameters:
name | type | description | default | example |
---|---|---|---|---|
seed_urls | string[] | first URLs to be crawled (the scraper starting point) | '' | www.adsuu.com |
http_client | string | HTTP client: dex8 (dex8-sdk::httpClient), chrome (puppeteer) | dex8 | chrome |
concurency | number | maximum of the concurrently opened pages e.g. browser tabs | 1 | 5 |
timeout | number | how long should the scraper wait for website response (in miliseconds) | 400 | 3000 |
device_name | string | puppeteers device names plus "Desktop Windows", "Desktop Linux", "Desktop Macintosh", "DEX8bot", "Googlebot", "Bingbot", "Baiduspider", "YandexBot", "Slurp", "DuckDuckBot" | Desktop Windows | Desktop Linux |
robots_txt | boolean | respect robots.txt or not | false | true |
resources_block | string[] | Do not load "image", "stylesheet", "script", "font","other", "manifest" resources. Only valid when http_client: 'chrome' | [] | ["image", "font","other", "manifest"] |
headers | object | Add extra headers | {} | {'X-MyHeader': 55} |
links_flush | boolean | empty input.mo_coll_links collection before crawler starts | false | true |
links_containText | string[] | save and follow links with the words in the anchor tag | [] | ['pets', 'cat'] |
links_follow | string[] | save and follow links with the words in the URL (use regular expressions) | [] | ['pets', 'cat'] |
links_unfollow | string[] | do not save and do not follow links with the words in the URL (use regular expressions) | [] | ['dog'] |
links_max_depth | number | maximal crawl depth. The depth 0 is for the links defined in "input.seed_urls" | 5 | |
extracts_flush | boolean | empty input.mo_coll_extracts collection before crawler starts | false | true |
extractors | any[] | array of the CSS selectors.
tip: 'text' | 'html' | 'attr' attribute: HTML tag attribute (when 'attr' is selected) selector: - CSS selector var_name: - variable name in the mo_coll_extracts collection var_type: - 'string', 'number', 'array' |
[] | [{ "tip": "text", "selector": "title", "var_name": "title", "var_type": "string" }, { "tip": "attr", "attribute": "content", "selector": "meta[name=\"keywords\"]", "var_name": "meta_keywords", "var_type": "string" }] |
preProcess | false | string | Custom defined function which executes before scraping process start. It can be: login, cookie management, etc. Function parameters are (x. lib) | false | "console.log(lib); return x;" |
preSaveLinks | false | string | Custom defined function which executes before saving extracted links. It can be: filter, parser, modifier, corrector, etc. Function parameters are (x. lib) | false | "console.log(lib); return x;" |
preSaveEctracts | false | string | Custom defined function which executes before saving extracted data (extracts). It can be: filter, parser, modifier, corrector, etc. Function parameters are (x. lib) | false | "console.log(lib); return x;" |
mo_uri | string | MongoDB URI | mongodb://dex8_freeuser:freedom@5.189.161.70:27017/dex8-freepool01 | |
mo_coll_links | string | MongoDB collection name where links are saved | ads_com_links | |
mo_coll_extracts | string | MongoDB collection name where extracted data is saved | ads_com_extracts | |
debug | boolen | Debugging for developers | false | true |
Input example
{
"seed_urls": ["https://www.adsuu.com"],
"http_client": "dex8",
"concurrency": 10,
"timeout": 21000,
"device_name": "Desktop Windows",
"robots_txt": false,
"resources_block": ["image", "font","other", "manifest"],
"extra_headers": {},
"links_flush": true,
"links_containText": [],
"links_follow": ["alati"],
"links_unfollow": [],
"links_max_depth": 5,
"extracts_flush": true,
"extractors": [
{
"tip": "text",
"selector": "title",
"var_name": "title",
"var_type": "string"
},
{
"tip": "attr",
"attribute": "content",
"selector": "meta[name=\"keywords\"]",
"var_name": "meta_keywords",
"var_type": "string"
},
{
"tip": "text",
"selector": "h1",
"var_name": "h1",
"var_type": "string"
},
{
"tip": "text",
"selector": ".ClassifiedDetailSummary-priceDomestic",
"var_name": "price",
"var_type": "string"
},
{
"tip": "text",
"selector": ".ClassifiedDetailDescription-text",
"var_name": "opis",
"var_type": "string"
}
],
"preProcess": false,
"preSaveLinks": false,
"preSaveExtracts": false,
"mo_uri": "mongodb://dex8_freeuser:freedom@5.189.161.70:27017/dex8-freepool01",
"mo_coll_links": "links",
"mo_coll_extracts": "extracts",
"debug": false
}
PRICE: 80.00 EUR /month
To buy this product you need to sign up for a free account and login .