| Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. On random sampling over joins. In Proc. of SIGMOD, pages 263--274, Philadelphia, PA, 1999. |
....to choose the cheapest of the possible equivalent expressions, it requires a simple cost model that can predict the cardinalities of intermediate results. Such a cardinality estimate draws heavily on a selectivity model. As such, selectivity estimation has been subject to extensive research [9, 66, 42, 30, 39, 11, 32, 20, 45, 10]. 9 In [25, 38] predicate reordering techniques have been proposed that are even more sophisticated. NF databases require a more sophisticated optimizer in general since the NF paradigm allows nesting of relations [52] This also holds for OO databases. As mentioned in Section 2 the modern ....
S. Chaudhuri, R. Motwani, and V. Narasayya. On Random Sampling over Joins. In Data, pages 263--274. ACM Press, 1999.
....employ uniform random sampling, to pick a set of points and guide the subsequent clustering e ort. In data mining, Toivonen [Toi96] examined the problem of using sampling during the discovery of association rules. Sampling has also been recently successfully applied in query optimization [CMN99, CMN98] as well as approximate query answering [GM98, AGPR99] Independently Palmer and Faloutsos [PF00] developed an algorithm to sample for clusters by using density information, under the assumption that clusters have a zip an distribution. Their technique is designed to nd clusters when ....
....is small. Biased sampling according to data density can use any density estimation technique. There is a number of methods suggested in the literature employing di erent ways to nd the density estimator for multi dimensional datasets. These include, computing multi dimensional histograms [PI97] CMN99] KJJ99] APR99] using various transforms, like the wavelet transformation [VWI98] MVW98] SVD [PI97] or the discrete cosine transform [LKC99] on the data, using kernel estimators [BKS99] Sco92] Sil86] as well as sampling [OR90] LNS90] HS92] Although our biased sampling technique can use ....
S. Chaudhuri, R. Motwani, and V. Narasayya. On Random Sampling Over Joins. Proceedings of SIGMOD, Philadelphia, PA, pages 263-274, June 1999.
.... for planning global queries in a Web mediator [27] 28] This broad importance of statistics management has led to a plethora of approximation techniques, for which [15] have coined the general term data synopses : advanced forms of histograms [30, 16, 20] spline synopses [22, 23] sampling [6, 17, 14], and parametric curve fitting techniques [34, 9] all the way to highly sophisticated methods based on kernel estimators [2] or Wavelets and other transforms [26, 25, 4] However, most of these techniques take the local viewpoint of optimizing the approximation error for a single data distribution ....
....result approximation) For join queries that access attributes from multiple datasets R 1 , R l it is conceivable to construct a result approximation or result size estimation from multiple synopses. On the other hand, it is known that this approach may lead to unbounded approximation errors [6]. Therefore, we have adopted the approach of [1] to use special join synopses for this purpose. A join synopsis can be viewed as a regular synopsis that is derived from a virtual dataset that materializes the full result of the join. Such a materialized join view does not really have to be stored, ....
[Article contains additional citation context not shown here]
S. Chaudhuri, R. Motwani, and V. R. Narasayya. On Random Sampling over Joins. In Proceedings of the ACM SIGMOD Conference, pages 263--274, 1999.
.... random sampling from the inputs R and S of a join, or biased sampling from R and S without taking the distribution of the other relation into account, can greatly skew the output of the join, and lead in the worst case to an empty join output even though the actual size of the join is very large [8]. Semantic Load Shedding. In this paper, we address the problems outlined above by introducing the notion of semantic load shedding. In semantic load shedding, we approximate the output of an operator by maximizing a userdefined similarity measure between the exact answer and the (approximate) ....
....arrival rates. Load is shed by simple random eviction. Our work addresses more complex memory allocation problems based on the values of single tuples (hence the notion of semantic load shedding introduced in this paper) In that respect our work is also related to uniform sampling over joins [8]. However, our goal is to maximize the accuracy of the output, not its statistical properties (e.g. being a uniform sample) Adaptive query processing systems like Telegraph [18] NiagaraCQ [9] sensor database systems [5] and adaptive techniques as proposed in [7, 20, 24] aim at providing the ....
S. Chaudhuri, R. Motwani, and V. R. Narasayya. On random sampling over joins. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 263--274, 1999.
.... 0 and 1) #: number of pages to sample in each iteration Procedure [1] S = # ## # ##### # Set of sites to be sampled [2] Loop while we have download resources [3] For each site # # S [4] Sample # pages from # # [5] # # = Estimate of # value for # # base on the samples so far [6] ## # ## # # = ### ## confidence interval for # # [7] Compute threshold # # from the distribution of estimated # # s [8] For each Web site # # in S [9] If (# # ## # ) S = S # # # # too low. We do not download from # # [10] If (# # ## # ) download all pages in # # and S = S # # ....
....) S = S # # # # too low. We do not download from # # [10] If (# # ## # ) download all pages in # # and S = S # # # # very high. We download pages from # # Figure 3: Algorithm of the adaptive sampling policy value and its ### ## confidence interval for each site (Steps [3] through [6]) Given the distribution of the estimated # # values, it can predict the threshold # # value (Step [7] For example, if we can download about half of the sites in each download cycle, and if half of the estimated # # s are above ###, # # ####. After estimating # # , it compares the confidence ....
[Article contains additional citation context not shown here]
S. Chaudhuri, R. Motwani, and V. R. Narasayya. On random sampling over joins. In Proc. of SIGMOD conf., 1999.
....or other precomputed summaries of the data. Another example is that data streams are generated as intermediate results of pipelined operators during evaluation of a query plan in an SQL database without materializing some or all of the temporary result, only one pass on the data is possible [3]. In most applications, the goal is to make decisions based on the statistics or models gathered over the recently observed data elements. For example, one might be interested in gathering statistics about packets processed by a set of routers over the last day. Moreover, we would like to ....
S. Chaudhuri, R. Motwani, V. R. Narasayya. On random sampling over joins. In Proc.
....of the sampling operation in the access plan. The naive sampling strategy applies sampling after all other operations have been executed. This in most cases is inefficient, because the time consuming operations like joins work on all the data even if the desired sample size is relatively small. In [2] further join sample strategies were proposed. These also need statistical information from histograms. 3 Reducing the Data for Fast Query Approximation In this section two methodologies of reducing the data are introduced. Afterwards the idea of combining them to a compound technique is ....
S. Chaudhuri, R. Motwani, and V.R. Narasayya. On Random Sampling over Joins. In A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors, SIGMOD
....of the sampling operation in the access plan. The naive sampling strategy applies sampling after all other operations have been executed. This in most cases is inefcient, because the time consuming operations like joins work on all the data even if the desired sample size is relatively small. In [2] further join sample strategies were proposed. These also need statistical information from histograms. 3 Reducing the Data for Fast Query Approximation In this section two methodologies of reducing the data are introduced. Afterwards the idea of combining them to a compound technique is ....
S. Chaudhuri, R. Motwani, and V.R. Narasayya. On Random Sampling over Joins. In A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors, SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,USA, pages 263--274. ACM Press, 1999.
....the predicted selectivity values match the measured values. 1. 1 Related work In the literature, a large number of e#orts has been reported on the prediction of selectivity factors in di#erent contexts and under di#erent assumptions [Car75, Yao77, IB86, LNS90, CR94, IP95, GGMS96, PIHS96, CMN98, CMN99] Roughly two directions can be distinguished in the prediction of selectivity factors. Research in the first direction has been focussed to the prediction of the number of page or block accesses, to retrieve # tuples from R tuples which are randomly distributed on B blocks. This problem has ....
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya, On Random Sampling over Joins, Proceedings of the 1999 ACM SIGMOD International Conference on the Management of Data, ACM Press, 1999, pp. 263--274.
....[20] and hierarchical [8] clustering techniques employ uniform random sampling to improve the efficiency of the techniques. Toivonen [28] examined the problem of using sampling during the discovery of association rules. Sampling has also been recently successfully applied in query optimization [6, 5], as well as approximate query answering [7, 1] The focus of this paper is in the design of alternative forms of sampling that can be used to expedite data mining tasks such as cluster and outlier detection. A natural question arises regarding the potential benefits of such sampling techniques. ....
....technique for estimating the density of a dataset is paramount for the success of our method. There is a number of methods suggested in the literature employing different ways to find the density estimator for multi dimensional datasets. These include, computing multi dimensional histograms [23] [6] [16] 2] using various transforms, like the wavelet transformation [30] 19] SVD [23] or the discrete cosine transform [17] on the data, using kernel estimators [4] as well as sampling [21] 18] 10] Although our biased sampling technique can use any density estimation method, kernel density ....
S. Chaudhuri, R. Motwani, and V. Narasayya. On Random Sampling Over Joins. In Proc. of ACM SIGMOD, Philadelphia, PA, pages 263--274, June 1999.
....DBMS. The typical query is also called the all pairs query or spatial distance join , as in the example, Estimate the number of schools that are within 5 miles from libraries . Spatial distance joins are considered to be among the most exxential joins in application areas, like data mining [CMN99] NH 94] They are useful in multiple settings, such as the following. In geographic information systems (GIS) under the nameof overlay queries: for example, Find all houses within 2 miles of a river . In urban planning, business planning, commercial intelligence: Howmany households are ....
S. Chaudhuri, R. Motwani, V. R. Narasayya - "On Random Sampling over Joins". SIGMOD 1999, pp. 263-274.
....not reduce the resource requirements of , it will reduce the output rate. We can also take an existing sample operator and push it down the query plan. However, we must be careful to ensure that we don t introduce biased sampling when we do so, especially in the presence of joins as discussed in [8]. 5.2 Dynamic Approximation In our second and more challenging approach, dynamic approximation, queries are unchanged, but the system may not always provide precise query answers. Dynamic approximation has some important advantages over static approximation: The level of approximation can ....
S. Chaudhuri, R. Motwani, and V.R. Narasayya. On random sampling over joins. In Proc. ACM SIGMOD Intl. Conf. on Management of Data, pages 263--274, Philadelphia, Pennsylvania, June 1999.
....with D at query processing time. This example generalizes to a wider class of queries in which multiple dimension tables are joined to the same fact table which contains the aggregate column. It is known that sampling based methods (including ours) do not work well for more general join queries [5]. 5 Handling low selectivity and small groups: Exploiting workload information In this section, we examine the problem of low selectivity queries and small groups in group by queries. Our approach to this problem is to use weighted sampling (instead of uniform sampling) by leveraging information ....
Chaudhuri S., Motwani R., and Narasayya V. On random sampling over joins. In Proceedings of the ACM SIGMOD Conference, 1998.
No context found.
Surajit Chaudhuri, Rajeev Motwani, and Vivek R. Narasayya. On random sampling over joins. In Proc. of SIGMOD, pages 263--274, Philadelphia, PA, 1999.
No context found.
S. Chaudhuri, R. Motwani, and N. Narasayya. On random sampling over joins. In ACM SIGMOD, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, pages 263--274, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, pages 263--274, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, pages 263--274, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. R. Narasayya. On random sampling over joins. In Proc. of ACM SIGMOD, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In Proc. of the 1999.
No context found.
Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. On random sampling over joins. In Proceedings of the ACM SIGMOD International Conference on Managment of Data, pages 263--274, Philadelphia, PA, June 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. R. Narasayya. On random sampling over joins. In Proc. of ACM SIGMOD, 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya, "On Random Sampling over Joins," Proc. SIGMOD, pp. 263-274, June 1999.
No context found.
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD Proceedings, 1999.
No context found.
S. Chaudhuri, R. Motwani, V. R. Narasayya. On random sampling over joins. In Proc. 1999.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST at NUS Add search form to your site Submit documents Feedback
CiteSeer.IST at NUS - Copyright Penn State and NEC. Hosted by the School of Computing, National University of Singapore.