Abstract
Most of the histograms, maintained by the actual DBMSs, make the uniform frequency assumption and most commonly approximate all frequencies in a bucket by their average. Thus, these histograms require storing the average frequency for each bucket. Hence, the accuracy of any estimation performed using the histogram depends highly on the technique used for approximating values into each bucket. Several approaches for approximating the set of attribute values with in a bucket have been studied in the literature. Some of histograms record every distinct value that appears in each bucket and other ones make crude assumptions about it. The most significant are the continuous values assumption, the uniform spread assumption and finally, the point value assumption. Other existing approaches are based on sampling techniques to approximate values inside a histogram bucket. The problem here is that all the proposed techniques assume that attribute values have equal spreads. Motivated by the inaccuracy of previous approaches in approximating value sets with non uniform spreads and by the significant estimation error that can be reached with the various assumptions, we need to compute d distinct values v(1), v(2), ..., v(d) that lie between the lowest and highest values in the range of each bucket without making any assumption about the values spreadsheet. For this reason, we propose an efficient algorithm for calculating these d values dynamically as new values are inserted into the attribute. The problem can be returned to calculate values of (d-2) quantiles; namely, the 1/d-, 2/d-, ..., (d-2)/d-quantiles, along with the lowest and highest values in the bucket. For each quantile to be estimated, we maintain a set of five markers that are updated after every new value inserted in the attribute. The results of a set of experiments comparing the accuracy of the proposed algorithm to the uniform spread assumption using various sets of values, over different types of histograms, show the effectiveness of our technique especially when values have non-equal spreads.