turnkey tasks » Website Scraper

extract data from the whole website

  Website Scraper

scraperWebsite

Extracts links and specified data from a website and save it into the database.
The data is specified by CSS selectors.
Links and extracted data (extracts) are saved into two different database collections.
    1. Links are saved into collection defined by the input.mo_coll_links.
    2. Extracts are saved into collection defined by the input.mo_coll_extracts.

What this scraper can do what others can not ?

  1. Scrape a website simultaneously from many different IP addresses
  2. No need for external, 3rd party proxy companies
  3. Change robot IP address in case that old IP is banned with only one click
  4. START, STOP and PAUSE the scraping process


Input parameters:

name type description default example
seed_urls string[] first URLs to be crawled (the scraper starting point) '' www.adsuu.com
http_client string HTTP client: dex8 (dex8-sdk::httpClient), chrome (puppeteer) dex8 chrome
concurency number maximum of the concurrently opened pages e.g. browser tabs 1 5
timeout number how long should the scraper wait for website response (in miliseconds) 400 3000
device_name string puppeteers device names plus "Desktop Windows", "Desktop Linux", "Desktop Macintosh", "DEX8bot", "Googlebot", "Bingbot", "Baiduspider", "YandexBot", "Slurp", "DuckDuckBot" Desktop Windows Desktop Linux
robots_txt boolean respect robots.txt or not false true
resources_block string[] Do not load "image", "stylesheet", "script", "font","other", "manifest" resources. Only valid when http_client: 'chrome' [] ["image", "font","other", "manifest"]
headers object Add extra headers {} {'X-MyHeader': 55}
links_flush boolean empty input.mo_coll_links collection before crawler starts false true
links_containText string[] save and follow links with the words in the anchor tag [] ['pets', 'cat']
links_follow string[] save and follow links with the words in the URL (use regular expressions) [] ['pets', 'cat']
links_unfollow string[] do not save and do not follow links with the words in the URL (use regular expressions) [] ['dog']
links_max_depth number maximal crawl depth. The depth 0 is for the links defined in "input.seed_urls" 5
extracts_flush boolean empty input.mo_coll_extracts collection before crawler starts false true
extractors any[] array of the CSS selectors.
tip: 'text' | 'html' | 'attr'
attribute: HTML tag attribute (when 'attr' is selected)
selector: - CSS selector
var_name: - variable name in the mo_coll_extracts collection
var_type: - 'string', 'number', 'array'
[] [{ "tip": "text", "selector": "title", "var_name": "title", "var_type": "string" }, { "tip": "attr", "attribute": "content", "selector": "meta[name=\"keywords\"]", "var_name": "meta_keywords", "var_type": "string" }]
preProcess false | string Custom defined function which executes before scraping process start. It can be: login, cookie management, etc. Function parameters are (x. lib) false "console.log(lib); return x;"
preSaveLinks false | string Custom defined function which executes before saving extracted links. It can be: filter, parser, modifier, corrector, etc. Function parameters are (x. lib) false "console.log(lib); return x;"
preSaveEctracts false | string Custom defined function which executes before saving extracted data (extracts). It can be: filter, parser, modifier, corrector, etc. Function parameters are (x. lib) false "console.log(lib); return x;"
mo_uri string MongoDB URI mongodb://dex8_freeuser:freedom@5.189.161.70:27017/dex8-freepool01
mo_coll_links string MongoDB collection name where links are saved ads_com_links
mo_coll_extracts string MongoDB collection name where extracted data is saved ads_com_extracts
debug boolen Debugging for developers false true


Input example


  {
  "seed_urls": ["https://www.adsuu.com"],

  "http_client": "dex8",
  "concurrency": 10,
  "timeout": 21000,
  "device_name": "Desktop Windows",
  "robots_txt": false,

  "resources_block": ["image", "font","other", "manifest"],
  "extra_headers": {},

  "links_flush": true,
  "links_containText": [],
  "links_follow": ["alati"],
  "links_unfollow": [],
  "links_max_depth": 5,


  "extracts_flush": true,
  "extractors": [
    {
      "tip": "text",
      "selector": "title",
      "var_name": "title",
      "var_type": "string"
    },
    {
      "tip": "attr",
      "attribute": "content",
      "selector": "meta[name=\"keywords\"]",
      "var_name": "meta_keywords",
      "var_type": "string"
    },
    {
      "tip": "text",
      "selector": "h1",
      "var_name": "h1",
      "var_type": "string"
    },
    {
      "tip": "text",
      "selector": ".ClassifiedDetailSummary-priceDomestic",
      "var_name": "price",
      "var_type": "string"
    },
    {
      "tip": "text",
      "selector": ".ClassifiedDetailDescription-text",
      "var_name": "opis",
      "var_type": "string"
    }
  ],


  "preProcess": false,

  "preSaveLinks": false,

  "preSaveExtracts": false,


  "mo_uri": "mongodb://dex8_freeuser:freedom@5.189.161.70:27017/dex8-freepool01",
  "mo_coll_links": "links",
  "mo_coll_extracts": "extracts",

  "debug": false
}

PRICE: 80.00 EUR /month

To buy this product you need to sign up for a free account and login .