## Monday, November 30, 2015

### Sampling - Programming Pearls

http://www.hawstein.com/posts/make-thiner-programming-pearls.html

``````void GenKnuth(int m, int n) {
for(int i=0; i<n; ++i) {
if((bigrand()%(n-i)) < m) {
cout<<i<<endl;
--m;
}
}
}``````

``````void GenSets(int m, int n) {
set<int> s;
while(s.size() < m)
s.insert(bigrand() % n);
set<int>::iterator i;
for(i=s.begin(); i!=s.end(); ++i)
cout<<*i<<endl;
}``````

``````void GenShuf(int m, int n) {
int x[n];
for(int i=0; i<n; ++i)
x[i] = i;
for(int i=0; i<m; ++i) {
int j = randint(i, n-1);
swap(x[i], x[j]);
}
sort(x, x+m);
for(int i=0; i<m; ++i)
cout<<x[i]<<endl;
}``````
`http://dongxicheng.org/data-mining/hadoop-sampling/`
(1) 第一种方法来自Knuth的《The art of Computer Programming, Volume 2: Seminumerical Algorithms》

 1 2 3 4 5 6 7 8 9 10 11 12 13 `select = m`   `remaining = n`   `for` `I = [0 n )`   `  ``if` `(bigrand() % remaining) < select`   `    ``print i`   `    ``select—`   `  ``remaining—`

C++的实现如下：
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 `void` `genKnuth(``int` `m, ``int` `n) {`   `  ``for``( ``int` `i = 0; i < n; i++) {`   `   ``if``(bigrand() % (n - i) < m） {`   `    ``cout <

（2）第二种方法的复杂度只与m有关，采用了set（实际上是红黑树）节省时间。代码如下：
 1 2 3 4 5 6 7 8 9 10 11 12 13 `void` `gensets(``int` `m, ``int` `n) {`   `  ``set<``int``> S;`   `  ``while``(S.size() < m) {`   `   ``S.insert(bigrand() % n);`   `  ``}`   `  ``// print S`   `}`

（3）第三种方法克服了（2）的缺点，代码如下：
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 `void` `genshuf(``int` `m, ``int` `n) {`   `  ``int` `i, j;`   `  ``int` `*x = ``new` `int``[n];`   `  ``for``(i = 0; i < n; i++) {`   `   ``x[i] = i;`   `  ``}`   `  ``for``(i = 0; i < m; i++) {`   `   ``j = randint(i, n-1);`   `   ``int` `t = x[i]; x[i] = x[j]; x[j] = x;`   `  ``}`   `  ``sort(x, x+m);`   `  ``//print result`   `}`

（4）当m接近n时，基于集合的算法生成的很多随机数都要丢掉，因为之前的数已经存在于集合中了，为了改进这一点，算法如下：
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 `void` `genfloyd(``int` `m, ``int` `n)`   `{`   `  ``set<``int``> S;`   `  ``set<``int``>::iterator i;`   `  ``for``(``int` `j=n-m; j < n; j++) {`   `   ``int` `t = bigrand()%（j+1);`   `   ``if``(S.find(t) == S.end()){`   `   ``S.insert(t); ``// t not in S`   `  ``} ``else` `{`   `   ``S.insert(j); ``// t in S`   `  ``}`   ` ``}`   `  ``//print results`   `}`