程序员编程艺术:第十章、如何给10^7个数据量的磁盘文件排序



第一节、如何给磁盘文件排序
问题描述:
输入:给定一个文件,里面最多含有n个不重复的正整数(也就是说可能含有少于n个不重复正整数),且其中每个数都小于等于n,n=10^7。
输出:得到按从小到大升序排列的包含所有输入的整数的列表。
条件:最多有大约1MB的内存空间可用,但磁盘空间足够。且要求运行时间在5分钟以下,10秒为最佳结果

1、归并排序。你可能会想到把磁盘文件进行归并排序,但题目要求你只有1MB的内存空间可用,所以,归并排序这个方法不行。
    2、位图方案。熟悉位图的朋友可能会想到用位图来表示这个文件集合。例如正如编程珠玑一书上所述,用一个20位长的字符串来表示一个所有元素都小于20的简单的非负整数集合,边框用如下字符串来表示集合{1,2,3,5,8,13}:
0 1 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0
上述集合中各数对应的位置则置1,没有对应的数的位置则置0。
采取这个位图的方案是因为我们面对的这个问题的特殊性:1、输入数据限制在相对较小的范围内,2、数据没有重复,3、其中的每条记录都是单一的整数,没有任何其它与之关联的数据。
    所以,此问题用位图的方案分为以下三步进行解决:
  • 第一步,将所有的位都置为0,从而将集合初始化为空。
  • 第二步,通过读入文件中的每个整数来建立集合,将每个对应的位都置为1。
  • 第三步,检验每一位,如果该位为1,就输出对应的整数。
不过很快,我们就将意识到,用此位图方法,严格说来还是不太行,空间消耗10^7/8还是大于1M(1M=1024*1024空间,小于10^7/8)。
    既然如果用位图方案的话,我们需要约1.25MB(若每条记录是8位的正整数的话,则10000000/(1024*1024*8) ~= 1.2M)的空间,而现在只有1MB的可用存储空间,那么究竟该作何处理呢?
updated && correct:
   @yansha: 上述的位图方案,共需要扫描输入数据两次,具体执行步骤如下:
  • 第一次,只处理1—4999999之间的数据,这些数都是小于5000000的,对这些数进行位图排序,只需要约5000000/8=625000Byte,也就是0.625M,排序后输出。
  • 第二次,扫描输入文件时,只处理4999999-10000000的数据项,也只需要0.625M(可以使用第一次处理申请的内存)。
    因此,总共也只需要0.625M
位图的的方法有必要强调一下,就是位图的适用范围为针对不重复的数据进行排序,若数据有重复,位图方案就不适用了。

多路归并
1、内存排序由于要求的可用内存为1MB,那么每次可以在内存中对250K的数据进行排序,然后将有序的数写入硬盘。
那么10M的数据需要循环40次,最终产生40个有序的文件。
2、归并排序
  1. 将每个文件最开始的数读入(由于有序,所以为该文件最小数),存放在一个大小为40的first_data数组中;
  2. 选择first_data数组中最小的数min_data,及其对应的文件索引index;
  3. 将first_data数组中最小的数写入文件result,然后更新数组first_data(根据index读取该文件下一个数代替min_data);
  4. 判断是否所有数据都读取完毕,否则返回2。
http://www.cnblogs.com/harryshayne/archive/2011/07/02/2096196.html
在编程珠玑中,描述了三种解决方法,分别是外排多路归并法、多通道排序法和位图排序法,在待排序文件中不含重复数的情况下,位图排序法是最高效的,但在更一般的情况下,外排多路归并法具有通用性.

当多路归并的路数比较小时,败者树的优势体现不出来,但是当路数达到一定规模时,败者树可以显著地减少排序时间,当然,由于败者树只作用于内存的最小关键字选择,所以直接提高的也只是内存的速度而已。但是别忘了,外排时所需读写外存的次数是和归并的次数成正比的,路数越多,归并的次数越少,也就可以间接减少外存读写次数了,所以说,败者树的优势是相当强大的~~

下面是算法的具体实现,这是针对JULY算法的修改,因为在JULY的算法中,并没有利用败者树来进行归并排序,在每次选择多路文件在数组中的元素的最小值的时候,只是简单地遍历数组来进行选择,没有利用败者树来进行选择,因此在效率上会有一定的差别(对磁盘的访问效率是一样的,不同的是内部选择的时候)。
class ExternSort 31 { 32 public: 33 void sort() 34 { 35 time_t start = time(NULL);
37 //将文件内容分块在内存中排序,并分别写入临时文件
38 k = memory_sort(); //
40 //归并临时文件内容到输出文件 41 //merge_sort(file_count);
42 ls=new int[k]; 43 b=new int[k+1]; 44 K_Merge(); 45 delete []ls; 46 delete []b;
48 time_t end = time(NULL); 49 printf("total time:%f\n", (end - start) * 1000.0/ CLOCKS_PER_SEC); 50 }
52 //input_file:输入文件名 53 //out_file:输出文件名 54 //count: 每次在内存中排序的整数个数
55 ExternSort(const char *input_file, const char * out_file, int count) 56 { 57 m_count = count; 58 m_in_file = new char[strlen(input_file) + 1]; 59 strcpy(m_in_file, input_file); 60 m_out_file = new char[strlen(out_file) + 1]; 61 strcpy(m_out_file, out_file); 62 }
69 private: 70 int m_count; //数组长度
71 char *m_in_file; //输入文件的路径
72 char *m_out_file; //输出文件的路径
73 int k;//归并数,此数必须要内排序之后才能得到,所以下面的ls和b都只能定义为指针(注意和书上区别)
74 LoserTree ls;//定义成为指针,之后动态生成数组
75 External b;//定义成为指针,在成员函数中可以把它当成数组使用 76 //int External[k];
77 protected: 78 int read_data(FILE* f, int a[], int n) 79 { 80 int i = 0; 81 while(i < n && (fscanf(f, "%d", &a[i]) != EOF)) i++; 82 printf("read:%d integer\n", i); 83 return i; 84 } 85 void write_data(FILE* f, int a[], int n) 86 { 87 for(int i = 0; i < n; ++i) 88 fprintf(f, "%d ", a[i]); 89 fprintf(f,"%d",MAX);//在最后写上一个最大值
90 } 91 char* temp_filename(int index) 92 { 93 char *tempfile = new char[100]; 94 sprintf(tempfile, "temp%d.txt", index); 95 return tempfile; 96 } 97 static int cmp_int(const void *a, const void *b) 98 { 99 return *(int*)a - *(int*)b; 100 }
102 int memory_sort() 103 { 104 FILE* fin = fopen(m_in_file, "rt"); 105 int n = 0, file_count = 0; 106 int *array = new int[m_count]; 107 108 //每读入m_count个整数就在内存中做一次排序,并写入临时文件
109 while(( n = read_data(fin, array, m_count)) > 0) 110 { 111 qsort(array, n, sizeof(int), cmp_int); 112 //这里,调用了库函数阿,在第四节的c实现里,不再调用qsort。
113 char *fileName = temp_filename(file_count++); 114 FILE *tempFile = fopen(fileName, "w"); 115 free(fileName); 116 write_data(tempFile, array, n); 117 fclose(tempFile); 118 } 119 120 delete [] array; 121 fclose(fin); 122 123 return file_count; 124 }
126 void Adjust(int s)127 {//沿从叶子节点b[s]到根节点ls[0]的路径调整败者树
128 int t=(s+k)/2;//ls[t]是b[s]的双亲节点
129 while(t>0)130 {131 if(b[s]>b[ls[t]])//如果失败,则失败者位置s留下,s指向新的胜利者
132 {133 int tmp=s;134 s=ls[t];135 ls[t]=tmp;136 }137 t=t/2;138 }139 ls[0]=s;//ls[0]存放调整后的最大值的位置
140 }
142 void CreateLoserTree()143 {144 b[k]=MIN;//额外的存储一个最小值
145 for(int i=0;i<k;i++)ls[i]=k;//先初始化为指向最小值,这样后面的调整才是正确的146 //这样能保证非叶子节点都是子树中的“二把手”
147 for(i=k-1;i>=0;i--)148 Adjust(i);//依次从b[k-1],b[k-2]...b[0]出发调整败者树
149 }150
151 void K_Merge()152 {//利用败者数把k个输入归并段归并到输出段中153 //b中前k个变量存放k个输入段中当前记录的元素154 //归并临时文件
155 FILE *fout = fopen(m_out_file, "wt"); 156 FILE* *farray = new FILE*[k]; 157 int i; 158 for(i = 0; i < k; ++i) //打开所有k路输入文件
159 { 160 char* fileName = temp_filename(i); 161 farray[i] = fopen(fileName, "rt"); 162 free(fileName); 163 } 164 165 for(i = 0; i < k; ++i) //初始读取
166 { 167 if(fscanf(farray[i], "%d", &b[i]) == EOF)//读每个文件的第一个数到data数组
168 {169 printf("there is no %d file to merge!",k);170 return;171 }172 } 173 // for(int i=0;i<k;i++)input(b[i]);
174
175 CreateLoserTree();176 int q;177 while(b[ls[0]]!=MAX)//178 {179 q=ls[0];//q用来存储b中最小值的位置,同时也对应一路文件180 //output(q);
181 fprintf(fout,"%d ",b[q]);182 //input(b[q],q);
183 fscanf(farray[q],"%d",&b[q]);184 Adjust(q);185 }186 //output(ls[0]);
187 fprintf(fout,"%d ",b[ls[0]]);188 //delete [] hasNext; 189 //delete [] data;
190 191 for(i = 0; i < k; ++i) //清理工作
192 { 193 fclose(farray[i]); 194 } 195 delete [] farray; 196 fclose(fout); 197 }

Labels

LeetCode (1432) GeeksforGeeks (1122) LeetCode - Review (1067) Review (882) Algorithm (668) to-do (609) Classic Algorithm (270) Google Interview (237) Classic Interview (222) Dynamic Programming (220) DP (186) Bit Algorithms (145) POJ (141) Math (137) Tree (132) LeetCode - Phone (129) EPI (122) Cracking Coding Interview (119) DFS (115) Difficult Algorithm (115) Lintcode (115) Different Solutions (110) Smart Algorithm (104) Binary Search (96) BFS (91) HackerRank (90) Binary Tree (86) Hard (79) Two Pointers (78) Stack (76) Company-Facebook (75) BST (72) Graph Algorithm (72) Time Complexity (69) Greedy Algorithm (68) Interval (63) Company - Google (62) Geometry Algorithm (61) Interview Corner (61) LeetCode - Extended (61) Union-Find (60) Trie (58) Advanced Data Structure (56) List (56) Priority Queue (53) Codility (52) ComProGuide (50) LeetCode Hard (50) Matrix (50) Bisection (48) Segment Tree (48) Sliding Window (48) USACO (46) Space Optimization (45) Company-Airbnb (41) Greedy (41) Mathematical Algorithm (41) Tree - Post-Order (41) ACM-ICPC (40) Algorithm Interview (40) Data Structure Design (40) Graph (40) Backtracking (39) Data Structure (39) Jobdu (39) Random (39) Codeforces (38) Knapsack (38) LeetCode - DP (38) Recursive Algorithm (38) String Algorithm (38) TopCoder (38) Sort (37) Introduction to Algorithms (36) Pre-Sort (36) Beauty of Programming (35) Must Known (34) Binary Search Tree (33) Follow Up (33) prismoskills (33) Palindrome (32) Permutation (31) Array (30) Google Code Jam (30) HDU (30) Array O(N) (29) Logic Thinking (29) Monotonic Stack (29) Puzzles (29) Code - Detail (27) Company-Zenefits (27) Microsoft 100 - July (27) Queue (27) Binary Indexed Trees (26) TreeMap (26) to-do-must (26) 1point3acres (25) GeeksQuiz (25) Merge Sort (25) Reverse Thinking (25) hihocoder (25) Company - LinkedIn (24) Hash (24) High Frequency (24) Summary (24) Divide and Conquer (23) Proof (23) Game Theory (22) Topological Sort (22) Lintcode - Review (21) Tree - Modification (21) Algorithm Game (20) CareerCup (20) Company - Twitter (20) DFS + Review (20) DP - Relation (20) Brain Teaser (19) DP - Tree (19) Left and Right Array (19) O(N) (19) Sweep Line (19) UVA (19) DP - Bit Masking (18) LeetCode - Thinking (18) KMP (17) LeetCode - TODO (17) Probabilities (17) Simulation (17) String Search (17) Codercareer (16) Company-Uber (16) Iterator (16) Number (16) O(1) Space (16) Shortest Path (16) itint5 (16) DFS+Cache (15) Dijkstra (15) Euclidean GCD (15) Heap (15) LeetCode - Hard (15) Majority (15) Number Theory (15) Rolling Hash (15) Tree Traversal (15) Brute Force (14) Bucket Sort (14) DP - Knapsack (14) DP - Probability (14) Difficult (14) Fast Power Algorithm (14) Pattern (14) Prefix Sum (14) TreeSet (14) Algorithm Videos (13) Amazon Interview (13) Basic Algorithm (13) Codechef (13) Combination (13) Computational Geometry (13) DP - Digit (13) LCA (13) LeetCode - DFS (13) Linked List (13) Long Increasing Sequence(LIS) (13) Math-Divisible (13) Reservoir Sampling (13) mitbbs (13) Algorithm - How To (12) Company - Microsoft (12) DP - Interval (12) DP - Multiple Relation (12) DP - Relation Optimization (12) LeetCode - Classic (12) Level Order Traversal (12) Prime (12) Pruning (12) Reconstruct Tree (12) Thinking (12) X Sum (12) AOJ (11) Bit Mask (11) Company-Snapchat (11) DP - Space Optimization (11) Dequeue (11) Graph DFS (11) MinMax (11) Miscs (11) Princeton (11) Quick Sort (11) Stack - Tree (11) 尺取法 (11) 挑战程序设计竞赛 (11) Coin Change (10) DFS+Backtracking (10) Facebook Hacker Cup (10) Fast Slow Pointers (10) HackerRank Easy (10) Interval Tree (10) Limited Range (10) Matrix - Traverse (10) Monotone Queue (10) SPOJ (10) Starting Point (10) States (10) Stock (10) Theory (10) Tutorialhorizon (10) Kadane - Extended (9) Mathblog (9) Max-Min Flow (9) Maze (9) Median (9) O(32N) (9) Quick Select (9) Stack Overflow (9) System Design (9) Tree - Conversion (9) Use XOR (9) Book Notes (8) Company-Amazon (8) DFS+BFS (8) DP - States (8) Expression (8) Longest Common Subsequence(LCS) (8) One Pass (8) Quadtrees (8) Traversal Once (8) Trie - Suffix (8) 穷竭搜索 (8) Algorithm Problem List (7) All Sub (7) Catalan Number (7) Cycle (7) DP - Cases (7) Facebook Interview (7) Fibonacci Numbers (7) Flood fill (7) Game Nim (7) Graph BFS (7) HackerRank Difficult (7) Hackerearth (7) Inversion (7) Kadane’s Algorithm (7) Manacher (7) Morris Traversal (7) Multiple Data Structures (7) Normalized Key (7) O(XN) (7) Radix Sort (7) Recursion (7) Sampling (7) Suffix Array (7) Tech-Queries (7) Tree - Serialization (7) Tree DP (7) Trie - Bit (7) 蓝桥杯 (7) Algorithm - Brain Teaser (6) BFS - Priority Queue (6) BFS - Unusual (6) Classic Data Structure Impl (6) DP - 2D (6) DP - Monotone Queue (6) DP - Unusual (6) DP-Space Optimization (6) Dutch Flag (6) How To (6) Interviewstreet (6) Knapsack - MultiplePack (6) Local MinMax (6) MST (6) Minimum Spanning Tree (6) Number - Reach (6) Parentheses (6) Pre-Sum (6) Probability (6) Programming Pearls (6) Rabin-Karp (6) Reverse (6) Scan from right (6) Schedule (6) Stream (6) Subset Sum (6) TSP (6) Xpost (6) n00tc0d3r (6) reddit (6) AI (5) Abbreviation (5) Anagram (5) Art Of Programming-July (5) Assumption (5) Bellman Ford (5) Big Data (5) Code - Solid (5) Code Kata (5) Codility-lessons (5) Coding (5) Company - WMware (5) Convex Hull (5) Crazyforcode (5) DFS - Multiple (5) DFS+DP (5) DP - Multi-Dimension (5) DP-Multiple Relation (5) Eulerian Cycle (5) Graph - Unusual (5) Graph Cycle (5) Hash Strategy (5) Immutability (5) Java (5) LogN (5) Manhattan Distance (5) Matrix Chain Multiplication (5) N Queens (5) Pre-Sort: Index (5) Quick Partition (5) Quora (5) Randomized Algorithms (5) Resources (5) Robot (5) SPFA(Shortest Path Faster Algorithm) (5) Shuffle (5) Sieve of Eratosthenes (5) Strongly Connected Components (5) Subarray Sum (5) Sudoku (5) Suffix Tree (5) Swap (5) Threaded (5) Tree - Creation (5) Warshall Floyd (5) Word Search (5) jiuzhang (5)

Popular Posts