Massive Algorithms: kth Quantiles - CLRS Exercises 9.3-6

算法导论 Exercises 9.3-6 - GoAgent - 博客园
The kth quantiles of an n-element set are the k - 1 order statistics that divide the sorted set into
k equal-sized sets (to within 1). Give an O(n lg k)-time algorithm to list the kth quantiles of a set.
问题描述：
一个集合的第k分位数是指这样k - 1个数：
若将一个大小为n的集合从小到大排好序，这k - 1个数能将这个有序的数列分为k组，并且每组元素的个数相差不超过1。
现给出一个长为n的数组（无序），要求以O(n lg k)的时间复杂度找到其 k 分位数。
比如：
8 4 5 3 2 6 1 7 9 排序后是 1 2 3 4 5 6 7 8 9
其 3 分位数是 3 6，将排序后的数列分成 1 2 3； 4 5 6； 7 8 9 三段，每段元素个数均为3。
2 9 3 4 3 3 9 1 3 8 1 排序后是 1 1 2 3 3 3 3 4 8 9 9
其 4 分位数是 2 3 8，将排序后的数列分成 1 1 2； 3 3 3； 3 4 8； 9 9 四段，每段元素个数分别为 3 3 3 2。

解决方案：
首先，我们知道在一个长度为n的无序数组里找到第 i 个数的时间复杂度是O(n)，具体见这篇文章里的算法 ithSmallestLinear 。
所以最直观的方法是调用k次ithSmallestLinear，但这样总时间复杂度是O(kn)而不是题目要求的O(n lg k)。
为了实现从 k 到 lgk 的突破，必然采用分治的方法（我们先假定 n/k 刚好整除）：
0、如果 k == 1 ，return
1、i = k / 2     ……O(1)
    将数组 a 划分成两部分 A 和 B，使得 A 中所有元素不大于 B，且 A 中元素的个数为 tmpSize = (n / k) * i。    ……O(n)
2、在 A 中找第 i 分位数    ……规模减小为 k / 2
3、输出a[tmpSize - 1]    ……O(1)
4、在 B 中找第 k - i 分位数    ……规模减小为 k / 2
由上面的分析，这个算法的时间复杂度满足题设要求O(n lg k)。

如果 n/k 不是刚好整除（假设 x = n%k，x != 0），那么我们定义k分位数的分组为：
前 x 组每组的元素个数为⌊n/k⌋ + 1，后 k - x 组每组元素个数为⌊n/k⌋。
比如15个元素分为4组，则每组为 4 4 4 3 个元素。

对应的，将tmpSize更改为 (n / k) * i + (n % k < i ? n % k : i)
（ PS：若 k==n ，这实际上是一个O(n lg n)的排序算法。）

 3 void kthQuantiles(int a[], int beg, int end, int k)
 4 {
 5     if (k == 1)
 6     {
 7         return;
 8     }
 9 
10     int len = end - beg + 1;
11     int i = k / 2;
12     int tmpSize = (len / k) * i + (len % k < i ? len % k : i);
13     int pivotLoc = ithSmallestLinear(a, beg, end, beg + tmpSize);
14     pivotLoc = partitionSpecifyPivot(a, beg, end, pivotLoc);
15 
16     kthQuantiles(a, beg, beg + tmpSize - 1, i);
17     std::cout << a[beg + tmpSize - 1] << " ";
18     kthQuantiles(a, beg + tmpSize, end, k - i);
19 }

http://clrs.skanev.com/09/03/06.html

If k=1 we return an empty list.
If k is even, we find the median, partition around it, solve two similar subproblems of size ⌊n/2⌋ and return their solutions plus the median.
If k is odd, we find the ⌊k/2⌋ and ⌈k/2⌉ boundaries and the we reduce to two subproblems, each with size less than n/2. The worst case recurrence is:

T (n, k) = 2 T (⌊ n / 2 ⌋, k / 2) + O (n)

Which is the desired bound

O(nlgk).

This works easily when the number of elements is

ak+k−1 for a positive integer

a. When they are a different number, some care with rounding needs to be taken in order to avoid creating two segments that differ by more than 1.

https://github.com/gzc/CLRS/blob/master/C09-Medians-and-Order-Statistics/exercise_code/k-quantile.py
https://github.com/gzc/CLRS/blob/master/C09-Medians-and-Order-Statistics/exercise_code/k-close2median.py

def k_quantiles(items, k):
    index = median_index(len(items))

    if k == 1:
        return []
    elif k % 2:
        n = len(items)
        left_index  = math.ceil((k // 2) * (n / k)) - 1
        right_index = n - left_index - 1

        left  = select(items, left_index)
        right = select(items, right_index)

        partition(items, left)
        lower = k_quantiles(items[:left], k // 2)
        partition(items, right)
        upper = k_quantiles(items[right + 1:], k // 2)

        return lower + [left, right] + upper
    else:
        index = median_index(len(items))
        median = select(items, index)
        partition(items, median)

        return k_quantiles(items[:index], k // 2) + \
                    [median] + \
                    k_quantiles(items[index + 1:], k // 2)

def median_index(n):
    if n % 2:
        return n // 2
    else:
        return n // 2 - 1

def partition(items, element):
    i = 0

    for j in range(len(items) - 1):
        if items[j] == element:
            items[j], items[-1] = items[-1], items[j]

        if items[j] < element:
            items[i], items[j] = items[j], items[i]
            i += 1

    items[i], items[-1] = items[-1], items[i]

    return i

def select(items, n):
    if len(items) <= 1:
        return items[0]

    medians = []

    for i in range(0, len(items), 5):
        group = sorted(items[i:i + 5])
        items[i:i + 5] = group
        median = group[median_index(len(group))]
        medians.append(median)

    pivot = select(medians, median_index(len(medians)))
    index = partition(items, pivot)

    if n == index:
        return items[index]
    elif n < index:
        return select(items[:index], n)
    else:
        return select(items[index + 1:], n - index - 1)

http://blog.csdn.net/z84616995z/article/details/18903483

/*
 *获取随机数，在start，end范围内
*/
int getRank(int start, int end)
{
    int i=-1;
    while(i<start)
    {
        i = rand()%(end+1);
    }
    return i;
}
/*
 *以随机数作为分隔依据
*/
int partition(int *a, int start, int end)
{

    int q = getRank(start,end);
    int p=start,r=end;
    int temp = a[q];
    a[q]=a[p];
    while(p<r)
    {
        while(a[r]>=temp&&p<r)r--;
        a[p]=a[r];
        while(a[p]<temp&&p<r)p++;
        a[r]=a[p];
    }
    a[p]=temp;
    return r;
}

/*
 *k:中位数
*/
int select(int *a, int start, int end, int k)
{

    int q=partition(a,start,end);
    int p=start,r=end;
    int i =q-p+1;
    while(i!=k)
    {
        if(i>k)
        {
            r = q-1;
        }
        else
        {
            p = q+1;
            k = k - i;
        }
        q = partition(a,p,r);
        i = q-p+1;
    }
    return i;
}

/*
 *start,end数组开始和结束
 *numk 该数组中的中位数的个数
 *k 位置
*/
void kth_select(int *a,int *b,int start, int end)
{
   //int q;
   if(end-start+1<=t)
   {
       return;
   }
   int sub_k= (end-start+1)/t;//在子数组中有几个段

   int mid_k = (sub_k+1)/2;//在子数组中 计算出中间位置的k
   int actual_k = t*mid_k; //在子数组中，k的实际位置
   int golbal_k = start-1+actual_k;//k在完整数组中的位置
   //cout<<end<<" "<<start<<" "<<start-1+t*((tempk+1)/2)<<" "<<tempk<<endl;
   select(a,start,end,actual_k);
   b[(golbal_k)/t] = a[golbal_k];//根据golbal_k，可以知道该中位数在哪个段中
   //cout<<a[t*((tempk+1)/2)]<<" "<<t*((tempk+1)/2)<<endl;
   //以中位数为分割点
   kth_select(a,b,start,golbal_k);
   kth_select(a,b,golbal_k+1,end);

}

http://code2.riaos.com/?p=5056537

https://classes.soe.ucsc.edu/cmps102/Fall05/hw/hw5sol.pdf
The k quantiles of an n-element sorted array A are:
A[b1 · n/kc], A[b2 · n/kc], . . . , A[b(k − 1) · n/kc].
We have as inputs an unsorted array A of distinct keys, an integer k, and an
empty array Q of length k − 1 and we want to find the kth quantiles of A.
QUANTILES(A, k, Q)
1. if k = 1 then return
2. else
3. n ← length[A]
4. i ← b k
2
c
5. x ← SELECT (A, bi · n/kc)
6. PARTITION (A, x)
7. Add to list Q: QUANTILES(A[1]..A[bi · n/kc], bk/2c, Q)
8. Add to list Q: QUANTILES(A[bi · n/kc + 1]..A[n], dk/2e, Q)
9. return x
The output Q of the recursive algorithm QUANTILES contains the kth quantiles
of A. The algorithm first finds the key that turns out to be the lower
median of Q, and then partitions the input array about this key. So we have
two smaller arrays that each contain at most (k − 1)/2 of the k − 1 order statistics
of the original array A. At each level of the recursion the number of order
statistics in the array is at most half the number of the previous array.
Consider a recurrsion tree for this algorithm. At the top level we need to find
k − 1 order statistics, and it costs O(n) to find one. The root has two children,
one contains at most b(k−1)/2c order statistics, and the other d(k−1)/2e order
statistics. The sum of the costs for these two nodes is O(n).

At depth i we find 2i order statistics. The sum of the costs of all nodes at
depth i is O(n), for 0 ≤ i ≤ log2
(k − 1), because the total number of elelements
at any depth is n. The depth of the tree is d = log2(k − 1). Hence, the worstcase
running time of QAUNTILES is Θ(n lg k).
https://github.com/lgrcyanny/Algorithm/blob/master/src/com/algorithm/orderstatistics/Quantile.java
Read full article from 算法导论 Exercises 9.3-6 - GoAgent - 博客园

kth Quantiles - CLRS Exercises 9.3-6

Labels

Popular Posts