Home Newsletter Resources

Go Back   Small Business Forum > Small Business Forum Administration > Small Business Articles
Register FAQ Members List Calendar Search Today's Posts Mark Forums Read

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 03-11-2008, 01:20 PM
bsgcorp's Avatar
bsgcorp bsgcorp is offline
Junior Member
 
Join Date: Nov 2007
Posts: 11
*** Talk About Io Management - Different Queues

*** TALK ABOUT IO MANAGEMENT - DIFFERENT QUEUES

Crawler Reliability
The WWW is very heterogeneous, which is a delight to surfers who like variety but quite a burden on a program which must handle anything. In our crawls, we encountered infinite web pages, infinite URLs, many varied kinds of communication errors, and anything else one might imagine. As an amusing example, a number of hosts had their IP address resolve to 127.0.0.1 - the local host. As a result, during early runs, we were surprised how many web pages matched terms from our own home page.

Social Issues
It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those users who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system was trying to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page on the whole web and cause the crawler to crash, or worse, unpredictable or incorrect behavior.

Since such large numbers of people are looking at their web logs every day, if only one out of ten thousand people contact us we will be drowning in email. As a result, systems which access large parts of the Internet need to be designed to be very ro**** and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted reading the email and dealing with these problems as they come up.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply



Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
Read this first before you talk about markets & companies. Lord Brar General Small Business Discussion 0 03-21-2005 03:01 AM
Things to Never Talk about! Lord Brar General Small Business Discussion 3 02-17-2005 02:54 PM
Lets talk about ethics please StLRook General Small Business Discussion 7 11-27-2004 01:08 PM
Just when you thought that they don't talk brain on busses! Lord Brar Marketing and Sales 2 11-08-2004 10:23 PM


All times are GMT -4. The time now is 02:49 PM.


Powered by vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.0.0 RC5
smallbusinessforum.com

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29