The General Search Engine Architecture

1137 Words5 Pages
Different search engines such as Google are complex, sophisticated, distributed systems. Below we reproduce the general search engine architecture discussed in “Searching the Web”, Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan (Stanford University). ACM Transactions on Internet Technology (TOIT), Volume 1, Issue 1 (August 2001). The main components include, parallel crawlers/ and crawler control (when and where to crawl), page repository, indexer, analysis, collection of data structures (index tables, structure, utility), and query engine and ranking module. Such a general architecture would take a significant amount of time to code. In this course, we implement stripped down versions of the main…show more content…
Besides that, there are also other difficulties as well. There might be too many relevant pages for a simple query. Also, it is difficult to compare two search engines, because of their continuous improvement. There are three ranking algorithm that are proposed by Yuwono and Lee. They are Boolean spread, vector spread, and most-cited. The first two are the normal ranking algorithms of the Boolean and vector model extended to include pages pointed to by a page in the answer or pages that point to a page in the answer. The third, most-cited, is based only on the terms included in pages having a link to pages in the answer.(Yuwono, 1996) There are also other approaches as well. WebQuery allows visual browsing of the Web pages. Their ranking algorithm is based on how connected each Web page is. The Page Rank algorithm uses the equation defined below here a is the page that we want to rank PR(a) is page rank of page a; q is the probability of the page being accessed; p1 to pn is the pages that point to page a. C(a) is the number of out- going link in page a. Google search engine uses the Page Rank algorithm as well. It simulates users using the search engine and applies the equation to rank the Web pages. It uses a citation graph. The graph contains 518 million links. It allows rapid calculation. 26 million Web pages can be computed in a few hours. The page has high ranking when many other pages point to it or if some other high-ranking pages point to it. (Brin, 1998)
Open Document