Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.
Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).
-
Until finished reading the large file
-
Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
-
Sort those records in memory.
-
Write them to a (new) file
-
Open each of the files you created above
-
Read the top record from each file
-
Until no record exists in any of the files (or until you have read the entirety of every file)
-
Write the smallest record to the sorted file
-
Read the next record from the file that had the smallest record
Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.
Update: I noticed a slight bug in the algorithm. The line "Read one record from each file"
was inside the last loop, but should have
been outside of it. The post was changed to reflect the correct way to do it.
Update 2: Here's the
Java source code for external merge sort.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?
Posted by
Ben Nadel
on May 10, 2007 at 12:36 PM UTC - 5 hrs
Sounds good to me!
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 5 hrs
Although, I thought CF read the entire file anyway... ?
Posted by
Sam
on May 10, 2007 at 04:46 PM UTC - 5 hrs
You'll just have to wait and see ;)
Posted by
Ben Nadel
on May 10, 2007 at 04:47 PM UTC - 5 hrs
Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)
Posted by
Sam
on May 10, 2007 at 06:00 PM UTC - 5 hrs
Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)
Posted by
Sam
on May 10, 2007 at 08:07 PM UTC - 5 hrs
Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.
Posted by Shankar Vasudevan
on May 13, 2007 at 12:32 PM UTC - 5 hrs
Sure Shankar, I'll dig it up tomorrow and post its location here.
Posted by
Sam
on May 13, 2007 at 12:52 PM UTC - 5 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.
Thank you.
Denk.
Posted by denk
on May 13, 2007 at 04:18 PM UTC - 5 hrs
Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.
Thank you.
Tomer.
Posted by Tomer
on May 20, 2007 at 12:35 PM UTC - 5 hrs
thanks a lot
Posted by Nasim
on Oct 16, 2007 at 07:14 PM UTC - 5 hrs
Thx !
Posted by
Linage pvp
on Jan 31, 2010 at 11:58 AM UTC - 5 hrs
Sam,
Really nice Post. Was helpful
Thanks
Vinod
Posted by Vinod
on Mar 16, 2010 at 03:17 AM UTC - 5 hrs
Here is a Scala program that implements this idea
// MMergeSort.scala, Hakon G, 23 mars 2010
// Deletion of temporary *.ord files not implemented here
import scala.util.Sorting.quickSort
def fileLines(file: java.io.File) = scala.io.Source.fromFile(file,"utf-8", scala.io.Source.DefaultBufSize*100).getLines // toList
abstract class rowSource {
def hasNext : Boolean
def getRow() : String
}
class singleFileSource(file : java.io.File) extends rowSource {
val fileSource : scala.Iterator[String] = fileLines(file)
def hasNext : Boolean = fileSource.hasNext
def getRow() : String = if (hasNext) fileSource.next else null
}
class duoSource(leftList : List[java.io.File], rightList : List[java.io.File]) extends rowSource {
val leftsource = new multiFileSource(leftList)
val rightsource = new multiFileSource(rightList)
var myNext : String = null
var rightHasNext : Boolean = rightsource.hasNext
var leftHasNext : Boolean = leftsource.hasNext
var rightMyNext : String = if (rightHasNext) rightsource.getRow() else null
var leftMyNext : String = if (leftHasNext) leftsource.getRow() else null
def takeFromLeft() = {
myNext = leftMyNext
leftHasNext = leftsource.hasNext
leftMyNext = if (leftHasNext) leftsource.getRow() else null
}
def takeFromRight() = {
myNext = rightMyNext
rightHasNext = rightsource.hasNext
rightMyNext = if (rightHasNext) rightsource.getRow() else null
}
def getRow() : String = {
if (leftHasNext && rightHasNext)
if (leftMyNext > rightMyNext) takeFromRight() else takeFromLeft()
else if (leftHasNext) takeFromLeft()
else if (rightHasNext) takeFromRight()
else myNext = null
return myNext
}
def hasNext : Boolean = ( leftHasNext || rightHasNext )
}
class multiFileSource(fileList: List[java.io.File]) extends rowSource {
var rowsource : rowSource = null
if (fileList.length < 2) rowsource = new singleFileSource(fileList.head)
else {
val (lf, rf) = fileList splitAt (fileList.length / 2)
rowsource = new duoSource(lf,rf)
}
def hasNext : Boolean = rowsource.hasNext
def getRow() : String = rowsource.getRow()
}
var lines = 0
var fileNum = 1
val batch = 500000
val inputFileSource : scala.Iterator[String] = fileLines(new java.io.File(args(0)))
var inputArray = new Array[String](batch)
var ordFileList : List[java.io.File] = List()
val s0=System.currentTimeMillis
while (inputFileSource.hasNext) {
val line = inputFileSource.next
inputArray(lines)=line
lines += 1
if (lines == batch) {
val s1=System.currentTimeMillis
quickSort(inputArray)
println("Done quick sorting "+ (System.currentTimeMillis - s1).toInt)
val outputFile = args(0)+"_"+fileNum+".ord"
val bout = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(outputFile)),1024*1000)
val s2 = System.currentTimeMillis
inputArray.foreach(x => bout.write(x))
bout.close()
println("Done writing "+ (System.currentTimeMillis - s2).toInt)
fileNum += 1
lines = 0
inputArray = new Array[String](batch)
println("Wrote to file "+outputFile)
ordFileList = new java.io.File(outputFile) :: ordFileList
}
}
if (lines > 0) {
println("Left overs "+lines)
val finalArray = new Array[String](lines)
for (i <- 0 to lines-1) finalArray(i) = inputArray(i)
quickSort(finalArray)
println("Final quicksort done")
val outputFile = args(0)+"_"+fileNum+".ord"
val bout = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(outputFile)),1024*1000)
finalArray.foreach(x => bout.write(x))
bout.close()
ordFileList = new java.io.File(outputFile) :: ordFileList
}
val out = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(args(0)+".mmsort")),1024*100);
println("Now merging the files")
val rowsource = new multiFileSource(ordFileList)
while (rowsource.hasNext) {
val aline = rowsource.getRow();
out.write(aline)
}
out.close
println("Done multi merge sorting "+ (System.currentTimeMillis - s0).toInt)
Posted by hakon
on Mar 25, 2010 at 09:43 AM UTC - 5 hrs
@hakon - Nice, Thanks for sharing!
Posted by
Sammy Larbi
on Mar 25, 2010 at 10:48 AM UTC - 5 hrs
excellent sam!
Posted by alejandro varela
on Jan 10, 2011 at 06:57 AM UTC - 5 hrs
excellent sam!
Posted by alejandro varela
on Jan 10, 2011 at 06:58 AM UTC - 5 hrs
You can sort a file larger than 1GB with PilotEdit.
Posted by Draco
on Oct 07, 2011 at 11:17 PM UTC - 5 hrs
nice stuff.....
Posted by amarkant
on Sep 24, 2012 at 09:28 PM UTC - 5 hrs
Good post!
I have just built some abstract structures called big queue and big array to simplify big data sorting and searching task on a single machine with limited memory. Basically, the algorithm used is similar to yours - external merge sort.
I can sort 128GB data in 9 hours on a single machine, and then binary search the sorted data with almost no time.
Here is a post about how to search big data by using the big queue and big array structures:
http://bulldog2011.github.com/blog/2013/01/25/merg...
Posted by bulldog
on Jan 26, 2013 at 01:31 AM UTC - 5 hrs
@bulldog - Awesome! I had a read and enjoyed it quite a bit. Very impressive!
Posted by
Sammy Larbi
on Jan 26, 2013 at 05:04 AM UTC - 5 hrs
You can sort the many huge files (the sorted result can be terabytes and bigger) with [ZZZServer][1] it is free for non-commercial use:
ZZZServer -sortinit -sort file1.txt
ZZZServer -sort file2.txt
ZZZServer -sort file3.txt
...
ZZZServer -sortsave sorted.txt
After sorting the result is saved in
sorted.txt
P.S. Your input files must be encoded in the UTF-8 or ASCII format!
The ZZZServer using about 1MB RAM on sorting big files!
You can use ZZZServer from Java through its [NoSQL ZZZBase Java driver][2].
[1]:
http://demo.zzz.bg/en/#download [2]:
https://github.com/alphasoftwarebg/zzzserver-java-...
Posted by
Zlatin Georgiev
on May 07, 2018 at 01:48 PM UTC - 5 hrs
Leave a comment