| Wills CE, Mikhailov M. Towards a better understanding of Web resources and server responses for improved caching. Proceedings of the Eighth World-Wide Web Conference, 1999. |
....explains some factors in the survival and change dynamics of documents. Cho et al. [9] computed the lifespan of pages in five different domains, namely .gov, net, org, edu, and .com, and showed that it varies widely. Smaller studies on how often web pages change were performed by Wills et al. [19] and Douglis et al. [11] Huberman et al. [1] presented a theory for the growth dynamic of the Web that takes into account the growth rates in the number of pages per site, as well as the fact that new sites are created at different times. Brewington [7] developed a different model of web page ....
C. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proc. 8th WWW Conf, 1999. 8
....of terabytes. The growth rate of the Web is even more dramatic. According to [41, 42] the size of the Web has doubled in less than two years, and this growth rate is projected to continue for the next two years. Aside from these newly created pages, the existing pages are continuously updated [52, 58, 24, 17]. For example, in our own study of over half a million pages over 4 months [17] we found that about 23 of pages changed daily. In the .com domain 40 of the pages changed daily, and the half life of pages is about 10 days (in 10 days half of the pages are gone, i.e. their URLs are no longer ....
....is more meaningful. 2. How should the crawler refresh pages Once the crawler has downloaded a significant number of pages, it has to start revisiting the downloaded pages in order to detect changes and refresh the downloaded collection. Because Web pages are changing at very di#erent rates [18, 58], the crawler needs to carefully decide what page to revisit and what page to skip, because this decision may significantly impact the freshness of the downloaded collection. For example, if a certain page rarely changes, the crawler may want to revisit the page less often, in order to visit ....
Craig E. Wills and Mikhail Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International World-Wide Web Conference, 1999.
....and the time of day that modifications occur. Since people generally determine what they will access based on categories (e.g. news, business) rather than content type (e.g. text, video, GIF) it would be more informative to measure the data from that realistic perspective. Wills and Mikhailov [15] did a similar study to determine the nature and rate of change of documents using URLs rather than testing log samples. They use MD5 Checksum to obtain the di#erence in documents [15] The MD5 checksum algorithm has been found to produce collisions when computing the hash functions [5] ....
....GIF) it would be more informative to measure the data from that realistic perspective. Wills and Mikhailov [15] did a similar study to determine the nature and rate of change of documents using URLs rather than testing log samples. They use MD5 Checksum to obtain the di#erence in documents [15]. The MD5 checksum algorithm has been found to produce collisions when computing the hash functions [5] Therefore, our study will make use of the UNIX DIFF and CMP commands to determine the di#erence among documents. These commands provide less possibility of error than the computation of a ....
[Article contains additional citation context not shown here]
C. E. Wills and Mikhail Mikhailov. Toward a better understanding of web resources and server responses for improved caching. In 8th International World-wide Web Conference, 1999. ~mikhail/papers/www8.ps.gz>.
....search engines has not improved much over the past few years [20, 21] Even with increasing hardware and bandwidth resources at their disposal, search engines cannot keep up with the growth of the Web. The retrieval challenge is further compounded by the fact that Web pages also change frequently [36, 7, 4]. For example Cho and Molina, in their daily crawl of 720,000 pages for a period of about 4 months, found that it takes only about 50 days for 50 of the Web to change [7] They also found that the rate of change varies across domains. For example it takes only 11 days in the .com domain versus 4 ....
CE Wills and M Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Proc. 8th Intl. World Wide Web Conference, 1999. 44
....reachable. Brewster Kahle in 1997 estimated that 600GB of the Web changes every month[65] and that the average lifetime of a URL is 44 days. Moreover, Cho 17 and Garcia Molina find that di#erent pages change at di#erent rates [43] In general, where links don t rot pages change at a high rate [95, 105, 42]. 5 How Much of the Web is Indexed Some of the Web numbers became widely known due to this famous paper by Lawrence and Giles [74] on the growth of the Web and di#culty search engines have in finding anything (only 30 to 40 of the Web is indexed) For an update of this study, see [75] As of ....
C. E. Wills and M. Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Proceedings of the 8th International World Wide Web Converence (WWW8), 1999.
....transfers and other kinds of waste. However, redundant transfers can also occur if mechanisms introduced into HTTP 1.1 to improve cache correctness are used in strange but compliant ways. For instance, identical payloads served by a single site are sometimes accompanied by different entity tags [55], causing new style If None Match revalidation attempts to fail where old fashioned If Modified Since requests might succeed. In this case, the server is compliant with the specification, but not with the most efficient possible implementation. full trace reduced trace Clients 37,201 37,165 ....
....different entity tags. This curious phenomenon can cause If None Match revalidation attempts to fail needlessly, resulting in redundant payload transfers. Other researchers have explained this problem, which arises when large server farms fail to harmonize entity tags across server replicas [55]. Mogul investigated erroneous HTTP timestamps in a large trace and reported that 38 of responses contained impossible Date header values, and 0.3 had impossible Last Modified values [34] Some timestamp errors might cause transparency failures; others might cause needless revalidations. Wills ....
[Article contains additional citation context not shown here]
C. E. Wills and M. Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Proc. 8th WWW Conf., May 1999.
....waste. Even fully compliant behavior can lead to redundant payload transfers as mechanisms introduced into HTTP 1.1 to improve cache correctness sometimes lead to reduced performance. For instance, identical data payloads served by a single site are sometimes accompanied by different entity tags [55], causing new style If None Match revalidation attempts to fail where old fashioned If Modified Since requests might have succeeded. In this case, the server is compliant with the specification, but not with the most efficient possible implementation. Furthermore, several common practices ....
....different entity tags. This curious phenomenon can cause If None Match revalidation attempts to fail needlessly, resulting in redundant payload transfers. Other researchers have explained this problem, which arises when large server farms fail to harmonize entity tags across server replicas [55]. Mogul investigated erroneous HTTP timestamps in a large trace and reported that 38 of responses contained impossible Date header values, and 0.3 had impossible Last Modified values [34] Some timestamp errors might cause transparency failures; others might cause needless revalidations. Wills ....
[Article contains additional citation context not shown here]
C. E. Wills and M. Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Proc. 8th WWW Conf., May 1999.
....2.2 and having 100 Mbps network connections. Tables 3 and 4 show the distribution of responses sent by the server proxy to the client proxies and, respectively, the bandwidth savings from using compression and delta encoding. We noticed a strange phenomenon during this experiment (also noted by [4]) in which servers returned full responses for resources that had not changed. This behavior can be attributed to cases where the last modification times, etags or important headers (such as Expires ) have changed, but the actual resources did not. Also, we noticed that most of these responses ....
....accesses was 25.4 hours, although spikes in access probability were present at certain intervals such as 1 minute and 1 day, and 5) 22 of the resources referenced were accessed more than once, but about half of the references were to those multiply referenced resources. In a different study [4], the researchers retrieved a set of popular resources (HTML and images) at fixed intervals for a period of time, and determined if resources have changed based on the content MD5 checksums. They found that for a significant portion of the resources retrieved the content does not change, but the ....
Craig E. Wills, and Mikhail Mikhailov. "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching. Technical Report WPI-CSTR -98-27, Computer Science Department, Worcester Polytechnic Institute, December 1998.
....Eventually, the repository may need to support ordered streams, where pages can be returned at high speed in some order. For instance, a data mining application may wish to examine pages by increasing modified date, or in decreasing page rank. Large updates: The web changes rapidly [12] 8][16]. Therefore, the repository needs to handle a high rate of modifications. As new versions of web pages arrive, the space occupied by old versions must be reclaimed (unless a history is maintained, which we do not consider here) This means that there will be substantially more space compaction or ....
Craig E. Wills and Mikhail Mikhailov, Towards a better understanding of web resources and server responses for improved caching, Proc. of the 8th Intl. WWW Conference, May 1999.
....the UpdateModule, to improve freshness of the collec 3 Many search engines report numbers similar to this. tion. We believe these references are complementary to our work, because we present an incremental crawler architecture, which can use any of the algorithms in these papers. References [13] and [6] experimentally study how often web pages change. Reference [11] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is different ....
....study how often web pages change. Reference [11] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is different from ours. Reference [13] investigates page changes to improve web caching policies, and reference [11] studies how page changes are related to access patterns. 7Conclusion In this paper we have studied how to build an effective incremental crawler. To understand how the web evolves over time, we first described a ....
C. E. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th World-Wide Web Conference, 1999. 209
....use of the model and some of its variants on simulated but realistic data. 2. PREVIOUS WORK There have been several studies of web crawling in its relatively short history, but most of them have had a focus rather di erent from ours. Some have concentrated on aspects relating to caching, e.g. [13] and [9] Others havebeen principally interested in the most ecient and e ectiveway to update a xed size database extracted from the web, often for some speci c function, such as data mining, see eg the work of Cho et al. 5, 6, 7] These studies were performed over time periods ranging from a ....
....the study. They show that the rates of change of pages they crawled can be approximated by a Poisson distribution, with the proviso that the gures for pages whichchange more often than daily or less often than four monthly are inaccurate. Using di erent collection procedures, Wills and Mikhailov [13] derive similar conclusions. A disadvantage of all these models is that they deal only with a xed size repository of a limited subset of the web. In contrast, our model is exible, adaptive, based upon the whole web and caters gracefully for its growth. Table 1: Cumulative Probability ....
C. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th World Wide Web Conference (WWW8), 1999. 113
....It is impossible for caches to deterministically know when cached objects EO1 EO5 become stale because servers can not accurately predict object expiration times and heuristic TTLs are imprecise by de nition. Also, servers can inadvertantly provide misleading expiration and last modi cation times [15]. As a result, caches may serve stale objects to their clients. Caches also generate unnecessary trac and place additional load on the origin servers when they validate objects that have expired in the cache but are unchanged at the origin server. Studies show that such validation requests ....
Craig E. Wills and Mikhail Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Eighth International World Wide Web Conference,Toronto, Canada, May 1999.
....The algorithms described in these references can be used for the UpdateModule, to improve freshness of the collection. We believe these references are complementary to our work, because we present an incremental crawler architecture, which can use any of the algorithms in these papers. References [WM99] and [DFK99] experimentally study how often web pages change. Reference [PP97] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is ....
....study how often web pages change. Reference [PP97] studies the relationship between the desirability of a page and its lifespan. However, none of these studies are as extensive as ours in terms of the scale and the length of the experiment. Also, their focus is di#erent from ours. Reference [WM99] investigates page changes to improve web caching policies, and reference [PP97] studies how page changes are related to access patterns. 7 Conclusion In this paper we have studied how to build an e#ective incremental crawler. To understand how the web evolves over time, we first described a ....
Craig E. Wills and Mikhail Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th World-Wide Web Conference, 1999. 18
No context found.
Craig E. Wills and Mikhail Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the Eighth International World Wide Web Conference, May 1999.
....It is impossible for caches to deterministically know when cached objects EO1 EO5 become stale because servers can not accurately predict object expiration times and heuristic TTLs are imprecise by de nition. Also, servers can inadvertantly provide misleading expiration and last modi cation times [15]. As a result, caches may serve stale objects to their clients. Caches also generate unnecessary trac and place additional load on the origin servers when they validate objects that have expired in the cache but are unchanged at the origin server. Studies show that such validation requests ....
Craig E. Wills and Mikhail Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. In Eighth International World Wide Web Conference, Toronto, Canada, May 1999.
No context found.
Wills CE, Mikhailov M. Towards a better understanding of Web resources and server responses for improved caching. Proceedings of the Eighth World-Wide Web Conference, 1999.
No context found.
Wills, C. E. and Mikhailov, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th World-Wide Web Conference. Winkler, R. L. 1972. An Introduction to Bayesian Inference and Decision. Holt, Rinehart and Winston, Inc.
No context found.
C. E. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the International World-Wide Web Conference, May 1999.
No context found.
C. E. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In WWW8, 1999.
No context found.
Craig E. Wills, Mikhail Mikhailov, "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching", Computer Networks, Vol. 31, No. 11--16 (Proc. of WWW8), pp. 1231--1243, May 1999.
No context found.
C. Wills, and M. Mikhailov, "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching," The Eighth International World Wide Web Conference, Toronto, Canada, May 1999
No context found.
C. E. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of WWW Conference, May 1999.
No context found.
Craig E. Wills, Mikhail Mikhailov, "Towards a Better Understanding of Web Resources and Server Responses for Improved Caching", Computer Networks, Vol. 31, No. 11--16 (Proc. of WWW8), pp. 1231--1243, May 1999.
No context found.
C. E. Wills and M. Mikhailov. Towards a better understanding of Web resources and server responses for improved caching. Computer Networks (Amsterdam, Netherlands: 1999.
No context found.
C. E. Wills and M. Mikhailov. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the International World-Wide Web Conference, May 1999.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST at NUS Add search form to your site Submit documents Feedback
CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.