Previous post Next post
Tags:
Bothersome search engines
Creepy Crawlers
Published in: Articles |Mar 2010 | #Comments: 0 Write comment

With the overwhelming amount of content on the Internet today, it would be next to useless without the possibility to search through it quickly. But how can search engines know what is out there? They have to release the spiders.

The largest search engine, Google, is reported having over a billion searches per dayGoogle unveils top political searches of 2009 - CNN, leading to twenty petabytes1 petabyte = 1000 terabytes, 1 terabyte = 1000 gigabytes. of user-generated data every day. What makes Google such a popular search engine is partly the amount of different sites they have in their index, and how quickly they register updates to these sites.

In an earlier blog post, called Google reconsideration, I talked a little about how a search engine works. In this post I will mostly talk about the bot that moves around the Internet and downloads pages to index. In that post I wrote:

..robot that crawls the web, looking for pages, called Googlebot. The bot request thousands of pages simultaneously, and ships the page to the indexer. Any links the crawler finds will be added to a list of future sites to check out.

This principle is the same for any search engine. They send out these crawlers or spiders to find websites, and store the information that seems relevant on them.

Updating the information

Google and other search engines regularly visits this site, checking for new links that they can index. When I write a new blog it will after a while show up if you search for it. This is very nice, and I don't really mind the extra traffic at all. You can put a special file on your server, called robots.txt, in which you can specify where the spiders are not to go, and how often they can visit. You can even tell some search engines to stop visiting completely. I feel that perhaps Google, Yahoo and MSN will cover as good as all searches that are relevant to my site, and Google much more than the two others.

Evil spiders

But not all spiders are good spiders, like the ones from Google. Some are out there to do evil. They collect all email addresses on a site, and starts sending spam to them. They take your content and use it without permission. They can even automatically buy the best seats at a concert where the tickets have just been released. These malicious spiders will of course not bother about the rules you've written in your robots.txt file either. About a week ago I noticed a large amount of traffic from China. I quickly found that it was the largest Chinese search engine Baidu that were visiting my site perhaps 20 times a day. Since I'm not interested in being listed in Chinese search engines (my content is not Chinese), I added a disallow statement for these spiders. To my surprise this did not stop them. They blatantly did not care about the rules. In my eyes, this makes them evil spiders, no matter how popular the engine is in China. How can you stop them then? Every spider, good or bad, and even humans browsing the web, has some information in the header that enters the site first. For me, this user agent sayswhatsmyuseragent.com:

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100214 Ubuntu/9.10 (karmic) Firefox/3.5.8

It tells me that I'm using Firefox 3.5.8 on a Linux system, with Ubuntu - Karmic installed. When a Baidu spider enters it also has some distinct information, and one can block these user agents from you site. After doing this my visitor count has dropped, but at least it's mostly human and good spiders now. The graph below shows what happened after I banned the Baidu spiders, and restricted some others.

 
No comments