1. Field of the Invention
The embodiments of the invention generally relate to sorting data streams and, more particularly, to a method of obtaining uniform data samples from selected intervals in a data stream and of estimating the distance from monotonicity (i.e., sortedness) of the data stream based on the samples.
2. Description of the Related Art
In applications that access data in a streaming fashion (e.g. large data sets and internet packet routing), it is often desirable to estimate the sortedness of the data stream; however, is difficult to estimate sortedness without actually sorting the data stream.
More particularly, a sequence a of length n over an alphabet Σ={1, . . . , m}, is said to be montone (or in increasing sorted order) if:
σ(1)≦σ(2) . . . ≦σ(n)
The distance from monotonicity of a sequence σ denoted by Ed(σ) is the minimum number of edit operations needed to make it monotone. A single edit operation consists of deleting a character and inserting it in a new position. If m=n and σ consists of n distinct characters, then Ed(σ) corresponds to the so-called Ulam distance between σ and the identity permutation. If we think of σ as a deck of cards, then this is the minimum number of moves needed to sort the deck. Thus, it is a natural measure of the degree of sortedness of a sequence.
An estimation of sortedness may be useful in data streams corresponding to network routing. For example, a router is a computer networking device that forwards data packets across a network toward their destinations. In several protocols, including, internet protocols (IP), a packet flow (i.e., a sequence of packets that is sent from a single source to a single destination) is not guaranteed to maintain its order. That is, the packets are not guaranteed to arrive in the order in which they were sent. Typically, packets that arrive out of order indicate that the path used to route the flow is suboptimal. For example, the flow may be routed using multiple paths in the network, and if one of these paths is significantly more congested than the others, packets using this path will be routed much slower than the other packets. Typically, the sender annotates the packets in a flow with increasing numeric identifiers and, therefore, the destination node (and also routers along the way) can estimate the quality of the current routing policy by measuring the sortedness of the received packets.
An estimation of sortedness may also be useful when comparing very long rankings (i.e., ordered lists of distinct items). For example, a ranking may describe all the pages on the web ordered by some score function and there may be a need to compare today's ranking with that of yesterday. In this case, one of the rankings will play the role of the increasing sequence and the crucial issue is to be able to determine the order of two items (according to the first ranking). Clearly, if the first ranking is assumed to be increasing, then an estimation of sortedness can be used immediately to compare the second ranking with the first ranking. Otherwise, this first ranking requires some computation which may be provided by a suitable service (or a server). Even though the ranking is very large, accessing this service may be very fast if it is actually implemented using a large-scale distributed system with many servers (which is a common infrastructure for web search engines).
However, as mentioned above, estimating sortedness in a feasible manner (i.e., without having to scan the data stream more than once and/or without having to actually sort the entire sequence) is difficult. Therefore, there is a need in the art for a method that requires only one pass over a sequence of data elements, adequately samples data elements from the sequence during that one pass and estimates how close the sequence is to being sorted based on the samples.
SUMMARYIn view of the foregoing, disclosed herein is a method of scanning a data stream one time in order to obtain uniform data samples from selected intervals in that data stream as well as a method of using the obtained data samples to estimate the degree of sortedness of that data stream. The degree of sortedness is also referred to as the distance from monotonicity or the number of elements that must be deleted from the data stream in order to place the elements in the data stream in a sorted order (e.g., an increasing value order). More particularly, disclosed herein are two embodiments of a method of sampling elements from a data stream. In each of these embodiments multiple samples are obtained (either all from a single data bucket or each from a corresponding smaller data bucket) and each of the samples obtained comprises a uniform sample from a specified interval immediately prior to a selected point in time. These multiple data samples are then used to estimate the degree of sortedness of the data stream. Also, disclosed are exemplary algorithms that may be used to implement the various embodiments of the invention.
In one embodiment of the sampling method of the invention, the elements of the data stream are scanned in one pass. As the elements are scanned, a predetermined number of the scanned elements are randomly selected for storage in a single data bucket. The single data bucket can be a single storage area within memory, where a collection of the scanned elements, including their attributes (e.g., time scanned), are recorded, stored and maintained. This selection process is random; however, later scanned elements are selected for storage in the data bucket with a greater probability than earlier scanned elements. Once an element is selected, it is stored in the data bucket and the order in which it was scanned-in is recorded. Furthermore, as the scanning process progresses and the data bucket is filled, the elements in the data bucket are periodically modified (i.e., some elements are deleted from the data bucket and others are added). Again this modification process is random; however, earlier scanned elements are selected for removal from storage with a smaller probability than later scanned elements. As the modification process proceeds, the predetermined number of scanned elements in the bucket is maintained.
In addition to storing scanned elements in the single data bucket, at multiple selected times during the scanning process, a sample of the scanned elements from the data bucket is obtained. That is, for each sample associated with a selected time, some of the scanned elements from the data bucket are randomly selected for inclusion in a sample. Thus, multiple samples (i.e., a different sample associated with each of the selected times) are obtained. While the selection of the scanned elements is random, the process is such that each of the samples comprises a uniform sample from a specified interval immediately prior to its associated selected time. More particularly, each of the samples are obtained by selecting a point in time during the scanning process, identifying a specified interval for the sample immediately prior to that selected point in time (e.g., a sample of elements chosen from the last 100 elements scanned) and randomly selecting a second predetermined number of the scanned elements that were stored in the data bucket during that specified interval. Contrary, to the process of selecting scanned elements for inclusion in the bucket, later scanned elements are selected from the data bucket with a lesser probability than earlier scanned elements. Thus, there is a balance of probabilities between selection of scanned elements for inclusion in the bucket and selection of scanned elements for inclusion in the sample, so that any of the elements that were scanned during the specified interval, whether earlier in the interval or later, are included in the sample with equal (i.e., uniform) probability.
In another embodiment of the sampling method of the invention, the elements of the data stream are similarly scanned in one pass. However, as the elements are scanned, a predetermined number of the scanned elements are randomly selected for storage in multiple smaller data buckets. The multiple data buckets can be separate storage areas within memory, where corresponding separate collections of scanned elements, including their attributes (e.g., time scanned) are recorded, stored and maintained. The selection process for each of the buckets is independent and, thus, each of the buckets may contain some of the same data as well as different data, by chance. However, later scanned elements are selected for storage in each of the data buckets with a greater probability than earlier scanned elements. Once an element is selected for storage in a particular data bucket, it is stored in that data bucket and the order in which it was scanned-in is recorded. Furthermore, as the scanning process progresses and each of the data buckets are filled, the elements in the different data buckets are periodically modified (i.e., some elements are deleted from the data bucket and others are added). Again this modification process is random; however, earlier scanned elements are selected for removal from storage with a smaller probability than later scanned elements. As the modification process proceeds, the predetermined number of scanned elements in each of the data buckets is maintained.
In addition to storing scanned elements in the multiple data buckets, at selected times during the scanning process, a sample of the scanned elements is obtained by selecting one or more elements from each of the multiple data buckets. That is, at a selected point in time one or more of the scanned elements from each of the multiple data buckets are randomly selected for inclusion in one of the samples. At another selected point in time one or more of the scanned elements from each of the multiple data buckets are selected for inclusion in another one of the samples, and so on. Thus, multiple samples are obtained. While the selection of the scanned elements is random, the process is such that each of the samples comprises a uniform sample for a specified interval immediately prior to the selected time.
More particularly, each of the samples are obtained by selecting a point in time during the scanning process, identifying a specified interval for the sample immediately prior to that selected point in time (e.g., a sample of elements chosen from the last 100 elements scanned) and randomly selecting a second predetermined number of the scanned elements that were stored in the multiple data buckets during that specified interval. The selected points in time associated with each of the multiple samples may be the same or different. However, if the selected points in time for any of the samples are the same, then the specified interval for those samples should be different. Thus, each of the different samples will be associated with a different interval and, possibly, with a different selected point in time. Furthermore, it is anticipated that the same number of elements (e.g., one) will be selected from each of the multiple data buckets.
Contrary, to the process of selecting scanned elements for inclusion in the data buckets, later scanned elements are selected from the multiple data buckets with a lesser probability than earlier scanned elements. Thus, there is a balance of probabilities between selection of scanned elements for inclusion in each of the buckets and selection of scanned elements for inclusion in each of the samples, so that any of the elements that were scanned during the specified interval, whether earlier in the interval or later, are included in the samples with equal (i.e., uniform) probability.
Once the multiple samples are obtained and, specifically, once the uniform samples associated with specified intervals in the data stream are obtained, then based on these samples, the number of elements that must be deleted from the data stream in order to sort the data stream can be estimated (i.e., the number of elements that must be deleted to place the elements in the data stream in an increasing value order can be estimated). The process of estimating this sortedness number (i.e., the distance from monotonicity) can be accomplished by first identifying the right-most element in each of the multiple samples. Then, for each sample, a determination is made as to whether or not a majority of other elements in the sample have a value that is greater than the right-most element. If so, then the right-most element is placed in a set of right most-elements. The number of elements that must be deleted to place the elements of the data stream in an increasing value order is then estimated as being between half and twice the size of the set.
These and other aspects of the embodiments of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating preferred embodiments of the invention and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the embodiments of the invention without departing from the spirit thereof, and the embodiments of the invention include all such modifications.
BRIEF DESCRIPTION OF THE DRAWINGSThe embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
As mentioned above, there is a need in the art for a method that requires only one pass over a sequence of data elements, samples data elements from the sequence during that one pass and estimates how close the sequence is to being sorted. Therefore, disclosed herein is a method of scanning a data stream one time in order to obtain uniform data samples from selected intervals in that data stream as well as a method of using the obtained data samples to estimate the degree of sortedness of that data stream. The degree of sortedness is also referred to as the distance from monotonicity or the number of elements that must be deleted from the data stream in order to place the elements in the data stream in a sorted order (e.g., an increasing value order). More particularly, disclosed herein are two embodiments of a method of sampling elements from a data stream.
In each of these embodiments multiple samples are obtained (either all from a single large data bucket (see
Referring to
In addition to storing scanned elements in the single data bucket (112), at multiple selected times during the scanning process, a sample of the scanned elements from the data bucket is obtained (124-132). That is, for a sample associated with a selected time, some of the scanned elements from the data bucket are randomly selected for inclusion in the sample (128). Thus, multiple samples (i.e., a different sample associated with each of the selected times) are obtained. While the selection of the scanned elements is random, the process is such that each of the samples comprises a uniform sample for a specified interval immediately prior to its associated selected time. More particularly, each of the samples are obtained by selecting a point in time during the scanning process, identifying a specified interval for the sample immediately prior to that selected point in time (e.g., a sample of elements within the last 100 elements scanned) (126) and randomly selecting (128) a second predetermined number (130) of the scanned elements that were stored in the data bucket during that specified interval. Contrary, to the process of selecting scanned elements for inclusion in the bucket, later scanned elements are selected from the data bucket with a lesser probability than earlier scanned elements (132). Thus, there is a balance of probabilities between selection of scanned elements for inclusion in the bucket and selection of scanned elements for inclusion in the sample, so that any of the elements that were scanned during the specified interval, whether earlier in the interval or later, are included in the sample with equal (i.e., uniform) probability.
Referring to
In addition to storing scanned elements in the multiple data buckets, at selected times during the scanning process, a sample of the scanned elements is obtained by selecting one or more elements from each of the multiple data buckets (224-232). That is, at a selected point in time one or more scanned elements from each of the multiple data buckets are randomly selected for inclusion in one of the samples (228). At another selected point in time one or more of the scanned elements from each of the multiple data buckets are similarly randomly selected for inclusion in another one of the samples, and so on. Thus, multiple samples are obtained. While the selection of the scanned elements for inclusion in the sample is random, the process is such that each of the samples comprises a uniform sample for a specified interval immediately prior to the selected time.
More particularly, each of the samples are obtained by selecting a point in time during the scanning process, identifying a specific interval for the sample that is immediately prior to that selected point in time (e.g., a sample of elements chosen from the last 100 elements scanned) (226) and randomly selecting a second predetermined number of the scanned elements that were stored in the multiple data buckets during that specified interval (228-230).
It should be noted that the selected points in time associated with each of the multiple samples may be the same or different. However, if the selected points in time for any of the samples are the same, then the specified interval for those samples should be different. Thus, each of the different samples will be associated with a different interval and, possibly, with a different selected point in time. Furthermore, it is anticipated that the same number of elements (e.g., one) will be selected from each of the multiple data buckets.
Contrary, to the process of selecting scanned elements for inclusion in the data buckets, later scanned elements are selected from the data buckets with a lesser probability than earlier scanned elements (232). Thus, there is a balance of probabilities between selection of scanned elements for inclusion in each of the buckets and selection of scanned elements for inclusion in each of the samples, so that any of the elements that were scanned during the specified interval, whether earlier in the interval or later, are included in the samples with equal (i.e., uniform) probability.
Referring to
While the embodiments of the sampling techniques illustrated in
More particularly, a sequence σ of length n over an alphabet Σ={1, . . . , m}, is said to be montone (or increasing) if:
σ(1)≦σ(2) . . . ≦σ(n)
The distance from monotonicity of a sequence σ denoted by Ed(σ) is the minimum number of edit operations needed to make it monotone. That is, the minimum number of edit operations needed to sort the elements in increasing value order. A single edit operation consists of deleting a character and inserting it in a new position. If m=n and σ consists of n distinct characters, then Ed(σ) corresponds to the Ulam distance between σ and the identity permutation. If we think of σ as a deck of cards, then this is the minimum number of moves needed to sort the deck. Thus, it is a natural measure of the degree of sortedness of a sequence. Equivalently, we can define a single edit operation as modifying the value of σ at a particular location. The minimum number of edits required to make σ montone is still Ed(σ). But this way, a permutation may not stay a permutation.
A subsequence (i1, . . . , ik) of σ where i1<i2 . . . <ik is said to be monotone or increasing if σ(i1)≦σ(i2) . . . ≦σ(ik). Let LIS(σ) denote the length of the largest monotone subsequence of σ. The least number of edit operations required to make σ monotone is to identify a longest increasing subsequence and insert all the other elements into their correct position in this subsequence. So it holds that:
LIS(σ)+Ed(σ)=n
Thus, the embodiments of the method of the invention illustrated in
The problem of computing various metrics on permutations in a streaming fashion has been studied extensively and efficient algorithms are known for several metrics other than edit distance. Ajtai et al. (see reference [1]) and Gupta and Zane (see reference [2]) considered the problem of counting the number of inversions in a stream of numbers and addressed the need for an efficient algorithm for estimating the distance from monotonicity. Cormode et al. (see reference [3]) provided data-stream algorithms for transposition distance and inversion distance. Some results for the related problem of finding the longest increasing sequence in a data stream are given by Liben-Nowel et al. (see reference [4]), but these results do not provide a data-stream algorithm using sub-linear space. The techniques used in the embodiments of the present invention build on a body of work (e.g., see references [5]-[7]) on the subject of property testing algorithms. Informally, property testing algorithms are highly efficient algorithms that estimate a function without reading the entire input, just by probing the input data at a few randomly chosen locations. However, the quality of the estimate is usually quite low.
A first approach to computing ED(σ) could be to relate it to the number of inversions in σ. However, it is well known that these quantities can be far apart. Ergun et al. [5] showed that a suitable variation of this idea can be used to give a lower-bound on ED(σ). Specifically, Ergun et al. considered the set of indices that are endpoints of an interval where the majority of elements are inverted with respect to the endpoint and showed that the cardinality of this set is a lower bound on ED(σ). Extending this observation, Ailon et al. [8] showed that this lower bound actually gives a factor-2 approximation to ED(σ).
The embodiments of the present invention, and particularly, the technique of estimating the distance from monotonicity, illustrate that it actually suffices to consider the set of indices that are right end-points of such an interval (namely, where the majority of elements are inverted with respect to the endpoint), since the cardinality of this set provides a factor-4 approximation to ED(σ) (see processes 306-312). This technical difference from [5] and [8] is in fact crucial to the design of a data-stream algorithm, since when passing over the data from left to right, it is not known how to check if the index is the left endpoint of such an interval. In contrast, embodiments of the present invention disclose a novel sampling scheme that can actually test whether the index is the right endpoint of such an interval. The embodiments of the sampling scheme of the present invention (see
The following discussion provides a more detailed description of the embodiments of the sampling schemes of the present invention (as set out in
The technique used to estimate the degree of sortedness can be characterized via inversions. That is, a pair of indices (i,j) is said to be inverted in σ if i>j but σ(i)<σ(j). For a given index i let Inv(i) denote the set of indices j that are inverted with respect to i. A set R can be defined which consists of all indices i that are the right end-points of intervals where the majority of elements are inverted with respect to i (see items 306-310 of
Interpret the majority as being a strict majority. In particular, if (i−1, i) is an inversion, then i∈R. More generally, for δ≦½ define Rδ to consist of all indices i that are the right end-points of an interval where a δ fraction of elements are inverted with respect to i. Rδ={i∈[n]|∃ j s.t. more than δ fraction of indices in [j, i−1] lie in Inv(i)}
Lemma 1: The following bound holds for all δ≦½: Ed ( σ ) 2 ≤ R ≤ R δ ≤ Ed ( σ ) δ
First, to see that Ed(σ)≦2|R|, it suffices to give an algorithm that deletes at most 2|R| indices and returns an increasing subsequence of σ is shown (see 312 of
Proposition 2: The algorithm deletes at most 2|R| indices and returns an increasing sequence (see 312 of
Proof of Proposition 2: A majority of the indices that are deleted at any step lie in R. To see this, let i∉R and let j be the largest index such that j<i and j does not belong to either of Inv(i) or R. Every element in [j+1, i−1] lies in Inv(i) or R. However, since i∉R, at least half the indices from [j+1, i−1] do not lie in Inv(i) hence they lie in R.
The algorithm returns a subsequence (i1, . . . , ik) so that (il−1, il) is not an inversion. Thus, consecutive elements are in the right order, so the entire sequence is monotone. The inclusion R⊂Rδ follows from the definition. Thus, |R|≦|Rδ|. To prove the upper bound on |Rδ|, fix a set D⊂[n] of indices of size Ed(σ) so that deleting D leaves a monotone sequence. Note that the set D may not unique. Define Dc to be the complement of D, and Sδ to consist of all indices i∈DC that are the right end-points of a interval where a δ fraction of elements lie in D. Sδ={i∈[n]|i∈Dc,∃ j s.t. more than δ fraction of indices in [j, i−1]lie in D}
This j is a witness for the membership of i in Sδ. The algorithm then scans left to right and computes the set Sδ iteratively: Start with the smallest index j∈D. Find the smallest index k>j so that at most δ fraction of [j, k−1] lies in D. Add the indices in [j, k−1]∩Dc to Sδ. Let l be the smallest index greater than k that lies in D. Set j=l and repeat. Proposition 3 _ : For every δ ≤ 1 2 , S δ ≤ Ed ( σ ) ( 1 δ - 1 )
Proof of Proposition 3: Assuming that the set output by the algorithm is in fact Sδ, it is clear that D + S δ ≤ 1 δ D .
The bound follows since |D|=Ed(σ). Sδ must be computed correctly. Furthermore, the algorithm must correctly compute the set Sδ∩[1, l−1]. It is clear that all indices in [1, j] do not lie in Sδ. Fix an index i∈[j, k−1]∩Dc. Since i<k so by the choice of k, at least a δ fraction of [j, i−1] lies in D which shows i∈Sδ.
To show k∉Sδ, let j′ be a potential witness. If j′∈[j, k−1], partition the interval [j, k−1] into [j, j′−1] and [j′, k−1]. By the choice of k, more than δ fraction of [j, j′−1] lies in D. If the same holds for [j′, k−1], then it also holds for [j, k−1] but this contradicts the choice of k. So j′ cannot be a witness. On the other hand, if j′<i, then the ratio of elements from D only decreases, so k∉Sδ. Similarly, any i∈[k, l−1] does not lie in Sδ. Hence, the algorithm correctly identifies the set Sδ∩[1, l−1].
Also, if i>l lies in Sδ, there is always a witness j≧l. This is because if j<l is a witness, the interval [j, i−1] can be partitioned into [j, l−1] and [l, i−1]. The first interval has at most a δ fraction from D since l−1∉Sδ. Hence [l, i−1] contains more than δ fraction from D so l serves as a witness. Proposition 4 _ : For every δ ≤ 1 2 , R δ ≤ Ed ( σ ) δ
Proof of Proposition 4: Rδ can be partitioned into Rδ∩D and Rδ∩Dc. Clearly, |Rδ∩D|≦|D|=Ed(σ) . The size of R∩Dc can be bounded. Note that the set Dc forms an increasing sequence. Thus, if i, j∈Dc, then they are not inverted. Hence, for any i∈Dc, Inv(i)⊂D. Thus, if i∈Rδ∩Dc, then i∈Sδ. Hence, R δ ⋂ D c ≤ S δ ≤ Ed ( σ ) ( 1 δ - 1 )
Both bounds are asymptotically tight. For the first bound, let k<n/4 and take the permutation π=k+1, . . . , n/2, n, . . . ,n−k+1,1, . . . ,k, n/2+1, . . . , n−k . Here Ed(π)=2k, whereas |R|=k. For the second bound, let k<δn. Consider the permutation σ=n,n−1, . . . , n−k+1,1,2 . . . , n−k. In this case it can be verified that Ed(σ)=k, whereas |Rδ|=k/δ−2.
Consequently, the above characterization via inversions, suggests a naive algorithm that may be used to implement the estimation technique of
Referring particularly to
This probability is denoted by p(j, i) and defined to be 1 for j=i. Note that for every j, p(j, i)≧p(j, i+1). Assume a right distribution at time i. To achieve it at time i+1, add the element σ(i) to the bucket, and for each element σ(j) already in the bucket (j<i), retain it with probability p(j, i)/p(j, i+1).
The following proposition upper bounds the size of the sample that is retained, which immediately implies a similar bound on the space (storage requirement) of the algorithm (see process 122 of
Proposition 5: At time i, E[|B|]≦C log2(2i). Further |B|=O(log2i) with probability i−C.
Proof of Proposition 5: Let Xj be the indicator variable for index j being in the bucket at time i. Note that the various Xis are independent and |B|=Σj<iXj since Pr [ X j ] = min ( 1 , C log i ) i - j ) , and E [ B ] = ∑ j < 1 min ( 1 , C log ( 2 i ) i - j ) ≤ ∑ j < 1 C log ( 2 i ) i - j ≤ C log 2 ( 2 i )
The high probability bound can be proved by a Chernoff-type argument.
Furthermore, described below is a procedure using B to test whether a near-majority of elements from Ij=[j, i−1] lie in Inv(i) (see processes 306-308 of
(1) Set Sj←Ø.
(2) For k∈[j,i−1],
(3) If k∈B, add it to Sj with probability p(j,i)/p(k,i).
(4) If at least ( 1 2 - ɛ )
fraction of Sj lies in Inv (i), return Fail.
(5) Else return Pass.
The set Sj is our set of samples from Ij. It is easy to see that for all k∈Ij, Pr[k∈Sj]=p(j, i). Furthermore, the events for different k (but the same j and i) are independent. The ratio S j ⋂ Inv ( i ) S j
is a fairly good approximation to I j ⋂ Inv ( i ) I j .
The error probability of the test is bound by (2i)−O(C), where the constant in the O(C) depends on ε. However, this can be compensated for by choosing C appropriately.
Lemma 6: If a majority of Ij lies in Inv(i), i.e. I j ⋂ Inv ( i ) I j > 1 / 2 ,
then the probability TestBucket (j,i) returns Fail is at least 1−(2i)−O(C).
Proof of Lemma 6: Suppose I j ⋂ Inv ( i ) > I j 2 . Hence , E [ S j ] = ∑ k ∈ I j C log ( 2 i ) j - i = C log ( 2 i ) and E [ | S j ⋂ Inv ( i ) ] = ∑ k ∈ I j ⋂ ( i ) C log ( 2 i ) j - i ≥ 1 2 C log ( 2 i ) .
Then, it can be shown using Chernoff bounds that with probability 1−i−O(C), |Sj|≦(1+ε/2)C log(2i) and |Sj∩Inv(i)|≧(½−ε/2)C log(2i). Thus, |Sj∩Inv(i)|≧(½−ε)|Sj| and Test (B,i) will return Fail.
Lemma 7: If less than (½−3ε) fraction of Ij lies in Inv (i), i.e. I j ⋂ Inv ( i ) I j < ( 1 / 2 - 3 ɛ ) ,
then the probability TestBucket(j,i) returns Pass is at least 1−(2i)−O(C).
Proof of Lemma 7: Suppose |Ij∩Inv(i)|<(½−3ε)|Ij|. Hence , E [ S j ] = ∑ k ∈ I j C log ( 2 i ) j - i = C log ( 2 i ) and E [ | S j ⋂ Inv ( i ) ] = ∑ k ∈ I j ⋂ Inv ( i ) C log ( 2 i ) j - i ≤ ( 1 / 2 - 3 ɛ ) C log ( 2 i ) .
Then, it can be shown using Chernoff bounds that with probability 1−(2i)−O(C) the following bounds hold: |Sj|≧(1−ε)C log(2i) and |Sj∩Inv(i)|≦(½−2ε)C log(2i). Thus, |Sj∩Inv(i)|≦(½−ε)|Sj| and Test (B, i) will return Pass.
An exemplary algorithm to estimate the distance from monotonicity can then be described using Test B (i, j). An estimate d is maintained which is initialized to d=0. For each element i, Test B (i,j) is run for j<i. If one of them returns fail, increment d. The bucket B is updated (at process 116 of
Lemma 8: For every i, with probability 1−Σi(2i)−O(C) the following inclusion holds R⊂{circumflex over (R)}⊂{dot over (R)}1/2−ε.
Proof of Lemma 8: Assume that i∈R and let j be a witness to this. Then, by Lemma 6 running Test (i, j) will return Fail with probability 1−(2i)−O(C). Hence, Pr[i∉{circumflex over (R)}]≦(2i)−O(C). Assume on the other hand that i∉R1/2−3ε. Then, for every j<i, fewer than ½−3ε elements in [j, i−1] belong to (i). By applying Lemma 7 and taking union bound over all i such intervals, the chance that Test B (i, j) returns Fail on any of these intervals is (2i)−O(C). Hence, Pr[i∈{circumflex over (R)}]≦(2i)−O(C).
Hence, the inclusions hold with probability 1−Σi(2i)−O(C), which can be






