Essay on The Crawling Module and Web Pages

790 Words4 Pages
With the first one, a collection can have various copies of web pages grouped according to the crawl in which they were found. For the second one, only the most recent copy of web pages is to be saved. For this, one has to maintain records of when the web page changed and how frequently it was changed. This technique is more efficient than the previous one but it requires an indexing module to be run with the crawling module. The authors conclude that an incremental crawler can bring brand new copies of web pages more quickly and maintain the storage area fresher than a periodic crawler.
III. CRAWLING TERMINOLOGY
The web crawler keeps a list of unvisited URLs which is called as frontier. The list is initiate with start URLs which may be
…show more content…
There must have timeouts of particular we page or web server to make sure that an unnecessary amount of time is not spent on web servers which is slow or in reading large web pages.
Parsing:
When a web page is obtained, then content of web pages is parsed to extract information that will provide and possibly direct the prospect path of the web crawler. Parsing involves the URL extraction from HTML pages or it may involve the more difficult process of meshing up the HTML content.
IV. PROPOSED WORK
The functioning of Web crawler [10] is beginning with a set of URLs which is called as seed URLs. They download web pages with the help of seed URLs and take out new links which is present in the downloaded pages. The retrieved web pages are stored and well indexed on the storage area so that by the help of these indexes they can later be retrieved as and when required. The URLs which is extracted from the downloaded web page are confirmed to know whether their associated documents have already been downloaded or not. If associated document are not downloaded, the URLs are again allocated to web crawlers for further downloading. The same process is repeated till no more URLs are missing for downloading. Millions of web pages are downloaded daily by a crawler to complete the target. Fig. 1 illustrates the proposed crawling processes.

Fig. 1 Proposed Crawling
Get Access