World Wide Web Robots
Let’s first start with Robots, what exactly are Robots and how does this web robots work which we call as spiders, crawlers, bots etc.
“A robot is a program that automatically traverses the Web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.Note that “recursive” here doesn’t limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.Normal Web browsers are not robots, because they are operated by a human, and don’t automatically retrieve referenced documents (other than inline images).Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.”
For more details on WWW Robots, check here at http://www.robotstxt.org/wc/faq.html
Today I was browsing through my Parenting Forum and I see that there are currently 5 active users, when I checked further to see the 4 other guest members to my forum, there they were all browsing along with me, two googlebots, one MSNbot and one Yahoo Slurp!. Well I think, they do love my site and they are my forum’s regular visitors always on the lookout of a new post and content to spider it. See the attached screenshot.

Now, let’s see how useful these Robots are.
The web robots reach out and grab pages from the Internet and if it’s a new page or a page that has been updated since the last time that they visited they will take a copy of the data. They find these pages either because the web author has gone to a search engine and asked for their site to be indexed, or the robot has found their site by following a link from another page. As a result, if the author doesn’t tell the engines about a particular page, and doesn’t have any links to it, it’s highly unlikely that the page will be found.
Robots are working all the time; the ones employed by AltaVista for example will spider about 10,000,000 pages a day. If your website has been indexed by a search engine, you can be assured that at some point a robot has visited your site and by following all your links, will have copied all the pages that it can find. So it is entirely upto the webmaster whether to allow indexing of the pages or not and accordingly write the instructions in robots.txt file. For more details on robots.txt content, check out Robots FAQ.
If a webmaster wants to restrict the visits of these robots, then they can add directions in their robots.txt file and all the well behaved robots follow the directions specified in your robots.txt file.
If you have noticed, I mentioned well behaved, which means there are also, some not well behaved robots out there and as MSN has pointed out that web master can check if the spiders, bots coming to their site are authentic or not. As mentioned in the Live search blog
“But what about crawlers that aren’t so well-behaved? After all, anyone could call themselves ‘MSNBot’, and proceed to be as rude and aggressive as they like. Fortunately, there is a way you can catch these impersonators. Here is how it works:
1. When you get a page view request, it specifies a user-agent and an IP address. As I described above, all requests from Live Search use a user agent starting with the word ‘MSNBot’.
2. If you see the MSNBot user-agent, it’s time to check the identity of the bot. Starting with the IP address (i.e. 207.46.98.149), you can use reverse DNS lookup to find out the registered name of the machine.
3. Once you have the host name (in this case, livebot-207-46-98-149.search.live.com), you can check that it really is coming from Live Search. The name of all live search crawlers will end with ‘search.live.com’. If the name doesn’t end with ‘search.live.com’, you know it’s not really our crawler.
4. Finally, you need to verify that the name is accurate. In order to do this, you can use Forward DNS to see the IP address associated with the host name. This should match the IP address you used in Step 2 – if it doesn’t, it means the name was fake.”
So this gives a fair idea on how to check if the crawlers visiting your webpages are fake or for real. You might have often seen those CAPTCHA images on web forms ( to fight spam), just to test if the person filling up the form is a human being or a robot. We will discuss about CAPTCHA in another post.
No Comments
Be the first to comment!









