Sort Big File - 小土刀的面试刷题笔记
Imagine you have a 20 GB file with one string per line. Explain how you would sort the file
As we can't store everything in the memory, we need to just store part of the data. Suppose we have 2 GB memory, we will split the data into 10 sets and each time we use in-place sorting algorithm to sort the 2GB data.
After 10 times of sorting, now the 20GB data is partially in-order. What we need to do is merge these 10 sets one by one to get the whole data sorted.
As we can't store everything in the memory, we need to just store part of the data. Suppose we have 2 GB memory, we will split the data into 10 sets and each time we use in-place sorting algorithm to sort the 2GB data.
After 10 times of sorting, now the 20GB data is partially in-order. What we need to do is merge these 10 sets one by one to get the whole data sorted.
We'll divide the file into chunks, which are x megabytes each, where x is the amount of memory we have available. Each chunk is sorted separately and then saved back to the file system.
Once all the chunks are sorted, we merge the chunks, one by one. At the end, we have a fully sorted file.
This algorithm is known as external sort.
Read full article from Sort Big File - 小土刀的面试刷题笔记