| |||||||||||||||
![]() | ![]() | ![]() |
| ||||
| *** Talk About Io Management - Different Queues *** TALK ABOUT IO MANAGEMENT - DIFFERENT QUEUES Crawler Reliability The WWW is very heterogeneous, which is a delight to surfers who like variety but quite a burden on a program which must handle anything. In our crawls, we encountered infinite web pages, infinite URLs, many varied kinds of communication errors, and anything else one might imagine. As an amusing example, a number of hosts had their IP address resolve to 127.0.0.1 - the local host. As a result, during early runs, we were surprised how many web pages matched terms from our own home page. Social Issues It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those users who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system was trying to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page on the whole web and cause the crawler to crash, or worse, unpredictable or incorrect behavior. Since such large numbers of people are looking at their web logs every day, if only one out of ten thousand people contact us we will be drowning in email. As a result, systems which access large parts of the Internet need to be designed to be very ro**** and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted reading the email and dealing with these problems as they come up.
__________________ Grantfundsnow.com - government grant education |
| Thread Tools | |
| Display Modes | |
| |
| | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Read this first before you talk about markets & companies. | Lord Brar | General Small Business Discussion | 0 | 03-21-2005 03:01 AM |
| Things to Never Talk about! | Lord Brar | General Small Business Discussion | 3 | 02-17-2005 02:54 PM |
| Lets talk about ethics please | StLRook | General Small Business Discussion | 7 | 11-27-2004 01:08 PM |
| Just when you thought that they don't talk brain on busses! | Lord Brar | Marketing and Sales | 2 | 11-08-2004 10:23 PM |