The term “search engine” (SE) is often misused to describe both directories and pure search engines. In fact, they are not the same; the difference lies in how result listings are generated.
There are four major search engine types you should know about. They are:
· crawler-based (traditional, common) search engines;
· directories (mostly human-edited catalogs);
· hybrid engines (META engines and those using other engines’ results);
· pay-per-performance and paid inclusion engines.
Crawler-based SEs, also referred to as spiders or Web crawlers, use special software to automatically and regularly visit websites to create and supplement their giant Web page repositories.
This software is referred to as a “bot”, “robot”, “spider”, or “crawler”. All these terms denote the same concept. These programs run on the search engines. They browse pages that already exist in their repositories, and find your site by following links from those pages. Alternatively, after you have submitted pages to a search engine, these pages are queued for scanning by a spider; it finds your page by looking through the lists of pages pending review in this queue.
After a spider has found a page to scan, it retrieves this page via HTTP (like any ordinary Web surfer who types an URL into a browser’s address field and presses “enter”). Just like any human visitor, the crawling software leaves a record on your server about its visit. Therefore, it’s possible to know from your server log when a search engine has dropped in on your online estate.
Your Web server returns the HTML source code of your page to the spider. The spider then reads it (this process is referred to as “crawling” or “spidering“) and this is where the difference begins between a human visitor and crawling software.
