Web crawling is a resource intensive process, both in terms of processing and in terms of communication. Distributing the crawling activity among multiple machines can distribute processing, and spreading out the distribution geographically can significantly reduce the communication cost. The reduction in communication is because of the following reasons.
  • By choosing a crawler nearer to a web server being crawled, the http fetch of the content on the web server travels a shorter distance
  • Each crawler while sending back the index to the central indexing location, can compress the information as compared to uncompressed content that would have otherwise traveled over http.

Download the report attached below to gain more information on Geographically Distributed Web Crawler