Due to a couple of requests in the comments about
Sorting really BIG files,
I'm posting the Java source code that does the job.
First, we have the main
externalSort
algorithm. It expects parameters
relation
(the name of a table, which in this case is the .csv file to sort without the ".csv" in the file name)
and the column to sort on (it checks the first row of the .csv file for the value passed). Now, of course
that is easy enough to change to fit your own needs - it worked fine for me and my CSVs.
There is a magic number in there that you might want to pull out: 10000. That is the number of rows to
read at a time.
Clearly, ten thousand rows is way smaller than we would want to sort on - depending on how many characters
are in each row, we could probably quite easily fit 10 million rows into memory. But, I didn't want to
spend time generating 100 million row files to test this on - so I just tried it with 100 thousand or so.
For a real use scenario, you'd probably want to go with something bigger - if not even calculate the size
and average row length.
private static void externalSort(String relation, String attribute)
{
try
{
FileReader intialRelationInput = new FileReader(relation + ".csv");
BufferedReader initRelationReader = new BufferedReader(intialRelationInput);
String [] header = initRelationReader.readLine().split(",");
String [] row = header;
int indexToCompare = getIndexForColumn(header,attribute);
ArrayList<Integer[]> tenKRows = new ArrayList<Integer[]>();
int numFiles = 0;
while (row!=null)
{
// get 10k rows
for(int i=0; i<10000; i++)
{
String line = initRelationReader.readLine();
if (line==null)
{
row = null;
break;
}
row = line.split(",");
tenKRows.add(getIntsFromStringArray(row));
}
// sort the rows
tenKRows = mergeSort(tenKRows, indexToCompare);
// write to disk
FileWriter fw = new FileWriter(relation + "_chunk" + numFiles + ".csv");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(flattenArray(header,",")+"\n");
for(int i=0; i<tenKRows.size(); i++)
{
bw.append(flattenArray(tenKRows.get(i),",")+"\n");
}
bw.close();
numFiles++;
tenKRows.clear();
}
mergeFiles(relation, numFiles, indexToCompare);
initRelationReader.close();
intialRelationInput.close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Now, there were a couple of functions that the
externalSort()
method depended on.
First we see the
getIndexForColumn
. This method just wraps code that finds the index
where the value of
attribute
occurs in the array
header
. That's
easy enough to write, so I won't add to the clutter and post it here.
Next, we have
getIntsFromStringArray
. This is also simple and just uses
parseInt
on each index in the array and returns a new one for
int
s. I'm assuming these are all integers,
so you may need to change that line to deal with
String
s or other data types.
Then we have the
mergeSort
algorithm, which is shown below.
flattenArray
is
also omitted here because it simply takes an array, puts a comma between each element, and returns
a
String
. Finally,
mergeFiles
is also shown below.
// sort an arrayList of arrays based on the ith column
private static ArrayList<Integer[]> mergeSort(ArrayList<Integer[]> arr, int index)
{
ArrayList<Integer[]> left = new ArrayList<Integer[]>();
ArrayList<Integer[]> right = new ArrayList<Integer[]>();
if(arr.size()<=1)
return arr;
else
{
int middle = arr.size()/2;
for (int i = 0; i<middle; i++)
left.add(arr.get(i));
for (int j = middle; j<arr.size(); j++)
right.add(arr.get(j));
left = mergeSort(left, index);
right = mergeSort(right, index);
return merge(left, right, index);
}
}
// merge the the results for mergeSort back together
private static ArrayList<Integer[]> merge(ArrayList<Integer[]> left, ArrayList<Integer[]> right, int index)
{
ArrayList<Integer[]> result = new ArrayList<Integer[]>();
while (left.size() > 0 && right.size() > 0)
{
if(left.get(0)[index] <= right.get(0)[index])
{
result.add(left.get(0));
left.remove(0);
}
else
{
result.add(right.get(0));
right.remove(0);
}
}
if (left.size()>0)
{
for(int i=0; i<left.size(); i++)
result.add(left.get(i));
}
if (right.size()>0)
{
for(int i=0; i<right.size(); i++)
result.add(right.get(i));
}
return result;
}
There is nothing special about the
mergeSort
and
merge
methods above. They simply
follow the well known
merge sort algorithm, accounting
for the fact that we are sorting on one element in an array of arrays, rather that an array with only one
dimension.
Finally, we have the
mergeFiles
function, which simply reads all the files, one row at a time,
and puts the smallest row at the top of the sorted file.
private static void mergeFiles(String relation, int numFiles, int compareIndex)
{
try
{
ArrayList<FileReader> mergefr = new ArrayList<FileReader>();
ArrayList<BufferedReader> mergefbr = new ArrayList<BufferedReader>();
ArrayList<Integer[]> filerows = new ArrayList<Integer[]>();
FileWriter fw = new FileWriter(relation + "_sorted.csv");
BufferedWriter bw = new BufferedWriter(fw);
String [] header;
boolean someFileStillHasRows = false;
for (int i=0; i<numFiles; i++)
{
mergefr.add(new FileReader(relation+"_chunk"+i+".csv"));
mergefbr.add(new BufferedReader(mergefr.get(i)));
// get each one past the header
header = mergefbr.get(i).readLine().split(",");
if (i==0) bw.write(flattenArray(header,",")+"\n");
// get the first row
String line = mergefbr.get(i).readLine();
if (line != null)
{
filerows.add(getIntsFromStringArray(line.split(",")));
someFileStillHasRows = true;
}
else
{
filerows.add(null);
}
}
Integer[] row;
int cnt = 0;
while (someFileStillHasRows)
{
Integer min;
int minIndex = 0;
row = filerows.get(0);
if (row!=null) {
min = row[compareIndex];
minIndex = 0;
}
else {
min = null;
minIndex = -1;
}
// check which one is min
for(int i=1; i<filerows.size(); i++)
{
row = filerows.get(i);
if (min!=null) {
if(row!=null && row[compareIndex] < min)
{
minIndex = i;
min = filerows.get(i)[compareIndex];
}
}
else
{
if(row!=null)
{
min = row[compareIndex];
minIndex = i;
}
}
}
if (minIndex < 0) {
someFileStillHasRows=false;
}
else
{
// write to the sorted file
bw.append(flattenArray(filerows.get(minIndex),",")+"\n");
// get another row from the file that had the min
String line = mergefbr.get(minIndex).readLine();
if (line != null)
{
filerows.set(minIndex,getIntsFromStringArray(line.split(",")));
}
else
{
filerows.set(minIndex,null);
}
}
// check if one still has rows
for(int i=0; i<filerows.size(); i++)
{
someFileStillHasRows = false;
if(filerows.get(i)!=null)
{
if (minIndex < 0)
{
puts ("mindex lt 0 and found row not null" + flattenArray(filerows.get(i)," "));
System.exit(-1);
}
someFileStillHasRows = true;
break;
}
}
// check the actual files one more time
if (!someFileStillHasRows)
{
//write the last one not covered above
for(int i=0; i<filerows.size(); i++)
{
if (filerows.get(i) == null)
{
String line = mergefbr.get(i).readLine();
if (line!=null)
{
someFileStillHasRows=true;
filerows.set(i,getIntsFromStringArray(line.split(",")));
}
}
}
}
}
// close all the files
bw.close();
fw.close();
for(int i=0; i<mergefbr.size(); i++)
mergefbr.get(i).close();
for(int i=0; i<mergefr.size(); i++)
mergefr.get(i).close();
}
catch (Exception ex)
{
ex.printStackTrace();
System.exit(-1);
}
}
Ideally, you'd want to do better abstraction on a lot of this code and put it into different methods.
Like Ben Nadel did in his
ColdFusion/Java version of this,
I'm keeping it inline so you don't need to track over to a function to see exactly what it does (though
a good name would certainly tell you without the details - which is what abstraction is about!). In
particular, I'd move all the file-specific operations into abstractions which better describe what
they're doing in this particular domain. It would really help to better understand what's going on
in the
mergeFiles
method at a glance, rather than reading all the details.
Questions? Comments?
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
can you please send me the full code with all the methods?
tomerab@hotmail.com
Posted by Tomer
on May 20, 2007 at 03:02 PM UTC - 5 hrs
Are you missing the code for the method "getIndexForColumn" :) ?
A part this, Great Work!
Posted by Andrew
on Jun 13, 2007 at 02:08 PM UTC - 5 hrs
Hi Andrew - indeed, I left it out because it didn't add much value for the clutter it provided. That method takes a string array and a string, and just returns the index where it found the value. Basically, something like (I haven't tested this, but I think you'll get the idea):
private int getIndexForColumn(String [] arr, String val)
{
int result = -1;
for(int i=0; i<arr.length; i++)
{
if (val==arr[i]) { result=i; break; }
}
return result;
}
Posted by
Sam
on Jun 14, 2007 at 12:14 PM UTC - 5 hrs
Hello,
Can you please explain, how is the implementation of "getIntsFromStringArray" method,
"private static Integer[] getIntsFromStringArray(String[] row)"
Posted by Rohidas
on Jun 19, 2007 at 05:57 AM UTC - 5 hrs
If I understod correctly this "getIntsFromStringArray()" method only deal with the integer datatype if I am using all the datatypes then how to deal with this method.will it affect the sorting techniques.
Posted by Rohidas
on Jun 19, 2007 at 07:33 AM UTC - 5 hrs
Well, I think you can really only sort on one data type per column, or else it has to be a string. You might be able to think of some way around that, but I don't think it follows that dates are less than or greater than strings or numbers or ...
The input is expected to be in string format, so if you want to sort on strings, just leave out the getIntsFromStringArray(). If you want to sort on doubles, floats, dates, or whatever, you just need to write a function that converts the string representation to the correct data type. That's all getIntsFromStringArray does.
Basically, it does this (not tested, so you'll need to make it work if it doesn't):
private int[] getIntsFromStringArray(String[] arr)
{
int[] result = new int[arr.length];
for (int i=0; i<arr.length; i++) {
result[i]=Integer.parseInt(arr[i]);
}
return result;
}
Of course, you'll probably want to do some error checking to make sure it doesn't crash if arr[i] is not able to be converted to an int.
Posted by
Sam
on Jun 19, 2007 at 12:41 PM UTC - 5 hrs
who knows how to sort date using quick sort???? send it to my email add..... blueglenn3@yahoo.com
Posted by glenn
on Jul 08, 2007 at 11:05 PM UTC - 5 hrs
Hi Glenn. Assuming the date is originally in string format, you'll need to parse it to extract a Date or (more likely) Calendar object. Then sort as normal, using comparisons on the dates.
Posted by
Sam
on Jul 08, 2007 at 11:13 PM UTC - 5 hrs
any source code for date pal???? the date sorting?????
Posted by glenn
on Jul 10, 2007 at 07:46 AM UTC - 5 hrs
input int and then display string u know that ????
Posted by glenn
on Jul 12, 2007 at 12:13 AM UTC - 5 hrs
Sorry, I don't understand what you are asking.
Posted by
Sam
on Jul 12, 2007 at 06:26 AM UTC - 5 hrs
a prgram that you can input integer then it outputs a string like for example input 1 then the computer outputs "one" a string pal. from int it become a string you know that??????
Posted by glenn
on Jul 13, 2007 at 11:09 AM UTC - 5 hrs
How about if(input==1) output("one") ... and so forth?
Posted by
Sam
on Jul 13, 2007 at 12:36 PM UTC - 5 hrs
that's to long pal im looking for an altenative for that hehe... how about the user inputs 2,000,000 do i need to write a code until 2 million " if(input==2,000,000)output("two million") it so long and hard to write can i use a loop for that??/
Posted by glenn
on Jul 13, 2007 at 11:43 PM UTC - 5 hrs
Oh I see. I was thinking it might output "two zero zero..."
I suppose you could do it up to 15, then it becomes x+"teen" for 5<x<20. Then "twenty"+x for x betwwen 1 and 9, and so on for each twenty, thirty, forty, ... ninety. Then hundreds, and so on. Basically how we learn to count - we only need to know 1-15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000, 1 million, 1 billion, 1 trillion, and so on.
Of course, you cannot encode all of it, but you could stop at some really high number and just call it one-thousand-trillion instead of the next highest word for it.
So really there is only 28 cases if you did that. That's still a lot of cases, so you might look to replace the conditionals with a table lookup.
Posted by
Sam
on Jul 14, 2007 at 07:34 AM UTC - 5 hrs
thanks pal somewhat i can see the logic.... Are you a IT graduate?? if not what is your course.....
Posted by glenn
on Jul 14, 2007 at 10:19 AM UTC - 5 hrs
Yes, I'm working on my masters in computer science at the moment.
Posted by
Sam
on Jul 14, 2007 at 10:45 AM UTC - 5 hrs
Tigris,
The above should work just fine - you might need to tailor some things to get it to work, but it should work fine.
If you'd like an overview of the algorithm in pseudocode, see
http://www.codeodor.com/index.cfm/2007/5/10/Sortin...
Posted by
Sam
on Nov 02, 2007 at 05:40 AM UTC - 5 hrs
Thanks Sam
I started write code, I'm a bit slow.
What is flattenArray's job in the code.
Can you write the whole code?
Special Thanks again.
Posted by Tigris07
on Nov 05, 2007 at 04:06 AM UTC - 5 hrs
Flatten array just takes an array and joins each element, separated by the 2nd parameter, into a string. Something like:
String flattenArray(Array a, String separator)
{
String result = "";
for(i=0; i<a.length; i++)
result+=a[i] + separator;
return result;
}
Posted by
Sam
on Nov 05, 2007 at 06:05 AM UTC - 5 hrs
Unix sort provides for out-of-memory sort, but it's been lacking in the Java world. Thanks.
Would you be willing to contribute your code to the public domain so lots of people can use it?
Posted by
George
on Nov 30, 2007 at 09:13 AM UTC - 5 hrs
Bugfix: getIndexForColumn()
Instead of testing for pointer equality with "val == arr[i]" it should test for string equality with "val.equals(arr[i])".
Optimization: mergeFiles()
For each line, it performs a linear search to determine which file has the minimum line. And we're facing files with trillions of lines and probably many files to merge. Instead, it'd be good to use a heap. Lotta code to write in the world.
-George
Posted by
George
on Nov 30, 2007 at 09:42 AM UTC - 5 hrs
Hi George, thanks for pointing those things out -- I often forget the equality and normally catch it in the test, but for the comment I didn't look at the original code, just wrote what I thought it would be, so I never ever ran it, much less tested it =).
Re: opening it up to public domain: I'm happy to let whoever wants to use it use it, but to make it completely useful I'd need to reconstruct it from this post as the computer it resided on is no longer with me. That's ok though, because it was all just in the main class, so it'd need to be rewritten to make it useful and I'd have to add sorting for other data types as well.
So, consider it public domain, but I wouldn't want to actually start a project because I wouldn't be available to fix it to get it to a releasable state. Now if /you/ want to do all that, I certainly have no qualms about it. =)
Thanks again for the important comments!
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:28 AM UTC - 5 hrs
Of course, if someone were to do that and base it on the code here, I wouldn't particularly mind if they gave credit to where it came from. =)
Posted by
Sammy Larbi
on Nov 30, 2007 at 10:30 AM UTC - 5 hrs
how to sort the result that i get from the file. example:
in the file contain data as below:
----------------------------------
uname1,no1
uname2,no2
the result display as in the file, but how to produce the data randomly for example:
-----------------------------------------------------------
uname2,no2
uname1,no1
or
uname1,no2
uname2,no1
help me....
thank you..
Posted by wannie
on Dec 27, 2009 at 09:56 PM UTC - 5 hrs
I just copied the code and made it compile. I understand that header line with column names is needed, and sort is done by specifying single column.
1. What if no header line (how could I work by column numbers instead of by column names)? How to sort not only for single specified column, but for all columns?
2. [How to sort for n specified columns?]
3. Could I get the whole code file for it?
Posted by Michael Lev
on Feb 02, 2011 at 03:47 PM UTC - 5 hrs
Hi Michael,
1) You can just remove the parts that read the header and change the parts that find the index of the header column by the name to just use the index you want.
2) To sort for n specified columns is way outside the bounds of what I'd have time to get in depth about right now, but you might look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads.
3) I don't have the whole code file any more (that I've been able to find anyway). I do get a lot of requests for it. One of these days I might have to see if I can find that old hard drive, and then see if I can find it. If I do, I might toss it up on github. I'll leave a comment here if and when I find it.
Posted by
Sammy Larbi
on Feb 28, 2011 at 01:18 PM UTC - 5 hrs
I finally got a hold of the old computer that code was on, and luckily it was still there!
I've posted the working main function along with all the code above and a test.csv to my github miscellany project:
https://github.com/codeodor/miscellany/blob/master...
Posted by
Sammy Larbi
on Mar 18, 2011 at 08:20 AM UTC - 5 hrs
can you send me all code...
Posted by rahul
on May 23, 2011 at 02:53 AM UTC - 5 hrs
Can I use this for sorting based on multiple columns at a time ? e.g. B ascending and C descending ?
Posted by Abhay
on Aug 04, 2011 at 04:30 AM UTC - 5 hrs
Abhay:
You might look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads.
Posted by
Sammy Larbi
on Aug 04, 2011 at 05:56 AM UTC - 5 hrs
Guys,
I need to sort based on two different keys(columns). First one is a numeric and the second one is a string.
Data contains duplicates so whenever the two numeric are the same I need to sort based on string.
Cane somebody provide help to me.
Posted by shaklasah
on Nov 07, 2011 at 01:27 AM UTC - 5 hrs
Shaklasah,
As I mentioned above: "look at determining if the primary columns are equal in the merge() step, then sort on the next column, and if the same then the next, and so on. I'm not sure how well that would work out, but I'd start thinking about it from there first and see where it leads."
Posted by
Sammy Larbi
on Nov 07, 2011 at 05:45 AM UTC - 5 hrs
This is a great idea. However, one thing that came to mind as I was browsing the comments. Since Java already has a TreeMap that implements SortedMap why not construct the key and simply use a TreeMap to build the individual sorted files prior to the merge? This would be simple for ascending keys. It might be more challenging if it had mixed keys. I would write both the sort key and the record content and use the sort key when merging the files back together. I have a need for this sort of thing in my current development. Maybe I'll take a look.
Posted by GP
on Jan 06, 2012 at 04:41 PM UTC - 5 hrs
Admittedly, it's been a while since I've worked in Java in any capacity, but I don't see a reason you couldn't do it, assuming you could guarantee unique values in the sorted field.
To my knowledge, if you had duplicate values in the field on which you're sorting, you won't end up with the right output -- specifically, you'll only get the last encountered duplicate key's row.
Do you have a plan to deal with that? (Or is it unnecessary for some reason?)
Anyway, give it a shot and let me know how it goes!
Posted by
Sammy Larbi
on Jan 06, 2012 at 06:27 PM UTC - 5 hrs
Can you send me a whole code to my email?
I want to take a look at it. I am just beginner of programming. Want to learn more. Thank you
green_goblin013@yahoo.com
Posted by Amy
on Apr 17, 2012 at 07:16 PM UTC - 5 hrs
Thank you , Sammy
Can I ask you question?
When did you work for this one? and what class is that?
You work on this program for your master degree?
green_goblin013@yahoo.com
Posted by Amy
on Apr 24, 2012 at 04:03 PM UTC - 5 hrs
Yeah, I think it was for a course on database theory. I don't remember for sure though.
Posted by
Sammy Larbi
on Apr 25, 2012 at 07:41 PM UTC - 5 hrs
Thanks for sharing your idea and your code, I'm working on the very same job now and luckily I found this, what you have done is particularly useful for me. BTW I'm on my master degree of computer science at the moment too ;-D
Posted by Sherlock Wesker
on Oct 11, 2012 at 02:37 AM UTC - 5 hrs
Cool Sherlock, glad you found it helpful. Best of luck in your MS!
Posted by
Sammy Larbi
on Oct 11, 2012 at 07:04 AM UTC - 5 hrs
Can any one explain, why we are using "// check if one still has rows" this block?
tell me in which scenario System exist will occur in this block.
Thanks,
Posted by Mahamood
on Nov 07, 2014 at 09:28 AM UTC - 5 hrs
@Mahamood I don't recall the reasoning at the time, but looking back through it I think it may be to cover the unexpected case that another process opened the file and added a row while we were sorting.
Posted by
Sammy Larbi
on Nov 09, 2014 at 11:09 AM UTC - 5 hrs
Leave a comment