Sorting really BIG files

home | about | contact | privacy statement

Posted by Sam on May 10, 2007 at 11:22 AM UTC - 6 hrs

Until a few weeks ago, something I've never needed to do was sort a file that was huge - like unable to fit in memory huge. I think the basic algorithm for an external merge sort is easy enough, but it did take some thought and I didn't find much useful in a web search, so I decided it was probably worthy of posting even though it turns out to be rather simple.

Here's the basic algorithm for an external sort in English (I can provide it in Java on request, since that's what I wrote it in, but I'm just posting it in English to keep it generally useful).

   Until finished reading the large file
   
         Read a large chunk of the file into memory (large enough so that you get a lot of records, but small enough such that it will comfortably fit into memory).
      
          Sort those records in memory.
      
          Write them to a (new) file  
      
      Open each of the files you created above

         Read the top record from each file

   Until no record exists in any of the files (or until you have read the entirety of every file)
   
         Write the smallest record to the sorted file
      
         Read the next record from the file that had the smallest record

Does that make sense? I kept it in very high level language, but I'm happy to answer any questions regarding smaller details.

Update: I noticed a slight bug in the algorithm. The line "Read one record from each file" was inside the last loop, but should have been outside of it. The post was changed to reflect the correct way to do it.

Update 2: Here's the Java source code for external merge sort.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

Last modified on May 14, 2007 at 08:10 AM UTC - 6 hrs

Comments

Leave a comment

This sounds really cool. I would love to give it a go in ColdFusion and then maybe we could compare notes (for fun)?

Posted by Ben Nadel on May 10, 2007 at 12:36 PM UTC - 6 hrs

Sounds good to me!

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 6 hrs

Although, I thought CF read the entire file anyway... ?

Posted by Sam on May 10, 2007 at 04:46 PM UTC - 6 hrs

You'll just have to wait and see ;)

Posted by Ben Nadel on May 10, 2007 at 04:47 PM UTC - 6 hrs

Ok, for extra credit then, do it on a CSV and sort by an arbitrary column. Then, incorporate it into all that work you were doing some time ago... =)

Posted by Sam on May 10, 2007 at 06:00 PM UTC - 6 hrs

There is absolutely no way I am doing this for a CSV :)

But, here is is for a more simple data:

http://www.bennadel.com/index.cfm?dax=blog:698.vie... That was a fun little exercise. I am starving now, gotta go grab some dinner and watch some TV :D

Posted by Ben Nadel on May 10, 2007 at 07:47 PM UTC - 6 hrs

Oops, that links didn't come through so well. Let me try again:

Link: ( http://bennadel.com/index.cfm?dax=blog:698.view )

Posted by Ben Nadel on May 10, 2007 at 07:48 PM UTC - 6 hrs

Actually, I was in the middle of reading it as you sent this, but I keep getting interrupted =)

Posted by Sam on May 10, 2007 at 08:07 PM UTC - 6 hrs

Hi Sam, can you please post the source-code in java please as I am trying to implement the same as well.

Posted by Shankar Vasudevan on May 13, 2007 at 12:32 PM UTC - 6 hrs

Sure Shankar, I'll dig it up tomorrow and post its location here.

Posted by Sam on May 13, 2007 at 12:52 PM UTC - 6 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: nofxdenk@gmail.com I'm trying to implement the same thing.

Thank you.

Denk.

Posted by denk on May 13, 2007 at 04:18 PM UTC - 6 hrs

Hi Sam, could you please post a java source of this sorting algorithm? or could you send it to my e-mail: tomerab@hotmail.com I'm trying to implement the same thing.

Thank you.

Tomer.

Posted by Tomer on May 20, 2007 at 12:35 PM UTC - 6 hrs

As per the update, the source can be found http://www.codeodor.com/index.cfm/2007/5/14/Re-Sor...

Other than what is described there, you'll need to customize it to your requirements.

Posted by Sam on May 20, 2007 at 12:46 PM UTC - 6 hrs

thanks a lot

Posted by Nasim on Oct 16, 2007 at 07:14 PM UTC - 6 hrs

Thx !

Posted by Linage pvp on Jan 31, 2010 at 11:58 AM UTC - 6 hrs

Sam,

Really nice Post. Was helpful

Thanks

Vinod

Posted by Vinod on Mar 16, 2010 at 03:17 AM UTC - 6 hrs

Here is a Scala program that implements this idea

// MMergeSort.scala, Hakon G, 23 mars 2010
// Deletion of temporary *.ord files not implemented here

import scala.util.Sorting.quickSort

def fileLines(file: java.io.File) = scala.io.Source.fromFile(file,"utf-8", scala.io.Source.DefaultBufSize*100).getLines // toList

abstract class rowSource {
def hasNext : Boolean
def getRow() : String
}

class singleFileSource(file : java.io.File) extends rowSource {

val fileSource : scala.Iterator[String] = fileLines(file)
def hasNext : Boolean = fileSource.hasNext
def getRow() : String = if (hasNext) fileSource.next else null
}

class duoSource(leftList : List[java.io.File], rightList : List[java.io.File]) extends rowSource {

val leftsource = new multiFileSource(leftList)
val rightsource = new multiFileSource(rightList)

var myNext : String = null
var rightHasNext : Boolean = rightsource.hasNext
var leftHasNext : Boolean = leftsource.hasNext
var rightMyNext : String = if (rightHasNext) rightsource.getRow() else null
var leftMyNext : String = if (leftHasNext) leftsource.getRow() else null

def takeFromLeft() = {
myNext = leftMyNext
leftHasNext = leftsource.hasNext
leftMyNext = if (leftHasNext) leftsource.getRow() else null
}
def takeFromRight() = {
myNext = rightMyNext
rightHasNext = rightsource.hasNext
rightMyNext = if (rightHasNext) rightsource.getRow() else null
}

def getRow() : String = {
if (leftHasNext && rightHasNext)
if (leftMyNext > rightMyNext) takeFromRight() else takeFromLeft()
else if (leftHasNext) takeFromLeft()
else if (rightHasNext) takeFromRight()
else myNext = null

return myNext
}

def hasNext : Boolean = ( leftHasNext || rightHasNext )
}

class multiFileSource(fileList: List[java.io.File]) extends rowSource {

var rowsource : rowSource = null
if (fileList.length < 2) rowsource = new singleFileSource(fileList.head)
else {
val (lf, rf) = fileList splitAt (fileList.length / 2)
rowsource = new duoSource(lf,rf)
}
def hasNext : Boolean = rowsource.hasNext
def getRow() : String = rowsource.getRow()
}

var lines = 0
var fileNum = 1
val batch = 500000

val inputFileSource : scala.Iterator[String] = fileLines(new java.io.File(args(0)))
var inputArray = new Array[String](batch)
var ordFileList : List[java.io.File] = List()

val s0=System.currentTimeMillis

while (inputFileSource.hasNext) {
val line = inputFileSource.next
inputArray(lines)=line
lines += 1
if (lines == batch) {

val s1=System.currentTimeMillis
quickSort(inputArray)
println("Done quick sorting "+ (System.currentTimeMillis - s1).toInt)

val outputFile = args(0)+"_"+fileNum+".ord"
val bout = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(outputFile)),1024*1000)
val s2 = System.currentTimeMillis
inputArray.foreach(x => bout.write(x))
bout.close()
println("Done writing "+ (System.currentTimeMillis - s2).toInt)
fileNum += 1
lines = 0
inputArray = new Array[String](batch)
println("Wrote to file "+outputFile)
ordFileList = new java.io.File(outputFile) :: ordFileList
}
}
if (lines > 0) {

println("Left overs "+lines)
val finalArray = new Array[String](lines)
for (i <- 0 to lines-1) finalArray(i) = inputArray(i)
quickSort(finalArray)
println("Final quicksort done")
val outputFile = args(0)+"_"+fileNum+".ord"
val bout = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(outputFile)),1024*1000)
finalArray.foreach(x => bout.write(x))
bout.close()
ordFileList = new java.io.File(outputFile) :: ordFileList
}

val out = new java.io.BufferedWriter(new java.io.OutputStreamWriter(new java.io.FileOutputStream(args(0)+".mmsort")),1024*100);

println("Now merging the files")

val rowsource = new multiFileSource(ordFileList)
while (rowsource.hasNext) {
val aline = rowsource.getRow();
out.write(aline)
}
out.close

println("Done multi merge sorting "+ (System.currentTimeMillis - s0).toInt)

Posted by hakon on Mar 25, 2010 at 09:43 AM UTC - 6 hrs

@hakon - Nice, Thanks for sharing!

Posted by Sammy Larbi on Mar 25, 2010 at 10:48 AM UTC - 6 hrs

excellent sam!

Posted by alejandro varela on Jan 10, 2011 at 06:57 AM UTC - 6 hrs

excellent sam!

Posted by alejandro varela on Jan 10, 2011 at 06:58 AM UTC - 6 hrs

You can sort a file larger than 1GB with PilotEdit.

Posted by Draco on Oct 07, 2011 at 11:17 PM UTC - 6 hrs

You may have find the logmerge tool possible for your tasks. http://code.google.com/p/logmerge/

Posted by siberia-man on Jun 25, 2012 at 01:19 AM UTC - 6 hrs

nice stuff.....

Posted by amarkant on Sep 24, 2012 at 09:28 PM UTC - 6 hrs

Good post!

I have just built some abstract structures called big queue and big array to simplify big data sorting and searching task on a single machine with limited memory. Basically, the algorithm used is similar to yours - external merge sort.

I can sort 128GB data in 9 hours on a single machine, and then binary search the sorted data with almost no time.

Here is a post about how to search big data by using the big queue and big array structures:
http://bulldog2011.github.com/blog/2013/01/25/merg...

Posted by bulldog on Jan 26, 2013 at 01:31 AM UTC - 6 hrs

@bulldog - Awesome! I had a read and enjoyed it quite a bit. Very impressive!

Posted by Sammy Larbi on Jan 26, 2013 at 05:04 AM UTC - 6 hrs

You can sort the many huge files (the sorted result can be terabytes and bigger) with [ZZZServer][1] it is free for non-commercial use:

ZZZServer -sortinit -sort file1.txt
ZZZServer -sort file2.txt
ZZZServer -sort file3.txt
...
ZZZServer -sortsave sorted.txt
After sorting the result is saved in

sorted.txt
P.S. Your input files must be encoded in the UTF-8 or ASCII format!

The ZZZServer using about 1MB RAM on sorting big files!

You can use ZZZServer from Java through its [NoSQL ZZZBase Java driver][2].

[1]: http://demo.zzz.bg/en/#download
[2]: https://github.com/alphasoftwarebg/zzzserver-java-...

Posted by Zlatin Georgiev on May 07, 2018 at 01:48 PM UTC - 6 hrs

Leave a comment

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit

RSS 2.0: Full Post | Short Blurb
Subscribe by email: