Grant’s Grunts
Random thoughts on programming, photography, triathlon, life and work
| Log in
Home Computers Misc. NC Nutrition Photography Sports

Processing a large number of Files in Java

So I was working on indexing 15+ million documents using Lucene and it strikes me that I would like a little better low level support for dealing with files in Java. These files are stored in a several hundred directories, each containing 30K docs per directory.

The first approach I took to handling this is to recurse on the directories, invoke File.listFiles() on each of the directories and then loop over the resulting array and add the files to the Index. Pretty standard and it works and fine, but it has a double (actually triple when you look at the implementation of listFiles() ) loop in it, namely, one at the OS level when Java requests the lists of files and then the loop I do over the results of listFiles() OK, I thought, perhaps I could (hidden in my library, unbeknown to my library users) use File.listFiles(FileFilter) but have the FileFilter’s accept method actually do the indexing. This works and eliminates the second loop, but… In looking into the Java source on Windows, I noticed that the implementation just relies on File.list() so I am not saving as much as I had hoped although I am still saving, which is good. Additionally, since I have my “fake” FileFilter return false, I don’t have to worry about adding to the array that is created to hold accepted Files nor do I have to worry about the toArray() call that converts the internal List into a File array.

Still, it seems like there is a better way to do this. What I would really like is a low-level hook provided by Java that doesn’t require the array to be built at all, something like:
File.processFiles(FileProcessor) (it really shouldn’t be on File, but in some class that deals with the File System, probably) where FileProcessor looks something like FileFilter except the accept method is something like:
void process(File)

This method should not create any of the arrays/lists associated with listFiles() (or at least it would minimize) and would be hooked into the FileSystem implementation at the low-level so that it would do it’s file operations as it gets them back from the OS, or at least it would be optimized to work with the OS. Unfortunately, I don’t know enough about the low-level details of the OS, but I imagine something like this is possible.

Java, File, listFiles(), FileFilter, large file processing, Lucene, indexing

Technorati Tags: , , , , , ,

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image