2.1 PAGE CHANGE DETECTION ALGORITHM
2.1.1 Introduction: About 60% of the content on the web is dynamic. It is quiet possible that after downloading a particular web page, the local copy of the page residing in the repository of the web pages becomes obsolete compared to the copy on the web. Therefore a need arises to update the database of web pages. Once a decision has been taken to update the pages, it should be ensured that minimal resources are used in the process. Updating only those elements of the database, which have actually undergone a change, can do this. Importance of web pages to be downloaded has been discussed in the above section. It also checks whether the page is already there in the database or not and lowers its priority value if it is referred rather frequently. In this section, we discuss some algorithms to derive certain parameters, which can help in deriving the fact whether the page has changed, or not. These parameters will be calculated at the time of page parsing. When the client again counters the same URL, it just calculates the code by parsing the page without downloading the page and compares it to the current parameters. If changes in parameters are detected, it is concluded that the page has changed and needs to be downloaded again. Otherwise the URL is discarded immediately without further processing. The following changes are of importance when considering changes in a web page:
• Change in page structure.
• Change in text contents.
•
The first versions of WWW ((what most people call “The Web”))) provide means for people around the world to exchange information between, to work together, to communicate, and to share documentation more efficiently. Tim Berners-Lee wrote the first browser (called WWW browser) and Web server in March 1991, allowing hypertext documents to be stored, fetched, and viewed. The Web can be seen as a tremendous document store where these documents (web pages) can be fetched by typing their address into a web browser. To do that, two im- portant techniques have been developed. First, a language called Hypertext Markup Languag (HTML) tells the computers how to display documents which contain texts, photos, sounds, visuals (video), and animation, interactive
In these paper author focus on finding the gap in understanding how complex individual Web sites are and how this complexity impacts on the usersperformance. Also characterize the Web site both at content level (like, number and size of images) and service level (like, number of servers/origins). It may happen that some categories are more complex than other such as 'News '. Out of hundred 60% of Web sites fetched content from minimum five non-origin sources, and these give more than 35% of the bytes downloaded. In addition, they examine which metrics are most suitable for predicting page render and load times and catch that the number of objects requested is the most important factor. With respect to variability in load times, however, they alsofind number of servers is the best indicator.
In this present web-savvy era, URL is a genuinely basic abbreviation which is broadly utilized as a word as a part of itself, without much thought for what it remains or what it is included. In this paper, the fundamental ideas of URLs and internet Cookies are discussed about with spotlight on its significance in Analytics perspective.
Crawler must avoid the overloading of Web sites or network links while doing its task. Unless it has unlimited computing resources and unlimited time, it must carefully decide what URLs to scan and in what order as it deals with huge volumes of data .Crawler must decide how frequently to revisit pages it has already seen, in order to keep its client informed of changes on the
Ever since this mode of execution was established in 1977, it was challenged by Defense attorneys. However, most of these challenges were declared as delayed tactics and without merit, hence dismissed.
HTML is the basic language understood by all WWW (World Wide Web) clients. It can execute on a PC under any operating system such as Windows, Mac, Linux, or on a Unix workstation. However, it is limited in its computational power intentionally because it can prevent the execution of dangerous programs on the client machine. Web programmers, who are now much more sophisticated in their applications, provide different type of services to a growing demand of interactive content. Today, most users have competent client machines which are capable of doing much more than HTML allows. Fortunately, there is steady development in the field, and today the number of capable applications is expanding. We can easily build database-driven websites with various
Nevertheless, it has obtained gigantic awareness best in the up to date years [41-58, 60-64]. Targeted crawlers avoid the crawling method on a certain set of issues that characterize a narrow area of the online. A focused or a topical internet crawler makes an attempt to download websites critical to a suite of pre-outlined subject matters. Hyperlink context varieties and most important part of web headquartered understanding retrieval assignment. Topical crawlers follow the hyperlinked constitution of the online making use of the supply of understanding to direct themselves towards topically relevant pages. For deriving the proper expertise, they mine the contents of pages which are already fetched to prioritize the fetching of unvisited pages. Topical crawlers depend especially on contextual understanding. This is considering that topical crawlers need to predict the advantage of downloading unvisited pages based on the understanding derived from pages that have been downloaded. One of the vital fashioned predictors is the anchor textual content of the hyperlinks [59]. The area targeted search engines like google and yahoo use these targeted crawlers to download selected
Focused crawlers “seek, acquire, index, and maintain pages on a specific set of topics that represent a narrow segment of the web” (Chakrabarti et al. 1999). The need to collect high-quality, domain-specific content results are important characteristics for such crawlers. Some of these characteristics are specific to focused and/or hidden web crawling while others are relevant to all types of spiders. Some of the important considerations for hidden web spiders include accessibility, collection type and content richness, URL ordering features and techniques, and collection update procedures.
The Internet archive preserves the live web by saving snapshots of the websites made with a specific date which can be browsed or searched for various reasons. Its object is to save the whole web without being in favor of a specific lan- guage, domain or geographical location. The importance of archiving made it important to check its coverage. In this paper, we try to determine how well Arabic websites are archived and indexed, and if the number of archived and indexed websites is affected by by country code top level domain, geographic location, creation date and depth. We also crawled for Arabic hyperlinks and checked its archiving and indexing.
A URL (Uniform Resource Locator) is defined as human-readable text that is designed to be used in place of IP addresses. Computers use these text-based addresses to communicate with servers. Entering a URL in a web browser is the mechanism for retrieving an identified resource. A URL has many important factors but perhaps the most important factor is its ease of discovery. Visitors on the web have to be able to find a website based on the URL name. All major search engines (Google, Bing, etc.) return search results extracted from millions of web pages based on what the search engine considers to be most relevant to the user. Search results listed on a search engine are ranked based on relevancy. How the content on a website coordinates with a URL is part of that ranking. A search engine optimization (SEO) analyst’s job is to find, attract and engage internet users. To make sure a website is easily discoverable, a URL should be tailored to the content of the website. There are a number of factors that should be considered when creating a URL and they will be discussed in this report.
In 1990, Tim Berners-Lee who invented the World Wide Web and gave theoretical and technological background for a new hypertext based linked information system, pointed out the problem of keywords. Searching for a particular information, document or webpage is a far more complex and longer process then it should be, mainly because two people never seem to choose the same keyword for the same concept. (Berners-Lee, 1990) This problem becomes more and more acute as we enter the age of the Social Web characterized by collaborative and continuous creation, adaptation and alteration of content. The first generation of web tools, between 1990 and 2003, allowed users to publish information on a static page which could be read using
A slightly different implementation is employed in content based approach, where the main idea, is to compare the contents of the site, rather than URL. Content based approach is also known as visual similarity based approach in which the contents like text, images and styles are compared with the contents of original site and the similarity is evaluated. This process is lengthier and consumes time because the entire content is compared, and then the decision is made.
INTRODUCTION: In this section we propose a method to derive a code for images to determine whether they have undergone a change or not. Ideally a change in a link to an image hyperlink will be reflected in the label of the hyperlink for that image and the same will be depicted by the formula proposed above. But in case the text does not change but the image is replaced, it will still be left undetected. We propose the following method for image change detection:
Although association rule methods have advantages, there are also some limitations that might cause loosing information. Exemplary association rules concentrate on the co-occurrence of items like purchased products, visited web pages, etc. within the transaction set. A single transaction can be a payment for purchased products or services, an order with a set of items with a historical session in a web portal. Alternate independence of items, products and web pages, is one of the most significant hypotheses of the technique, but it is not fulfilled in the web domain. Web pages are linked with each other by using hyperlinks, and they often calibrate all potential navigational paths. A user can enter the required web page address URL to a browser. However, most navigation is completed with the help of hyperlinks created by site administrators. Hence, the web structure sorely incarcerates visited list of pages, user sessions, which are not independent of one another as products in a ideal store. To access a page, the user is usually imposed to
A Web crawler is a type of a computer program that browses the World Wide Web in a logical, automated approach. Cothey (2004) affirms that Web crawlers are used to generate a copy of all the visited web pages (p. 1230). These pages are later processed by a search engine that indexes the downloaded pages to provide quick searches. Crawlers can also be applied to the maintenance of automated tasks on a Web site; such as checking relations or verifying HTML code. Crawlers are employed when gathering a specific type of data from Web pages, such as e-mail addresses collection. Web search engines are becoming gradually more, essential as the main means of tracking relevant information.