Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / June 2007

Tip: Looking for answers? Try searching our database.

File searching is really slow, how do I speed it up?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
janderson - 27 May 2007 04:01 GMT
Hi gang,

I have a win32 C# app that needs to recursively search for a particular
file type really fast.  It always searches the same place for these files.

I'm using  Directory.GetDirectories and Directory.GetFiles to do it
currently.  Because I know the location is the same every time I build a
cache of all the files and directories.  If a directory changes I update
its contents on startup of the app.

Unfortunately when I do a last-date-of-modification check on a directory
it only reports on changes of files that are directly under it.  So I
have to check every single directly rather then simply checking the
parent directory.

While I get a significant improvement when none of the directories
change (its much faster to load a file with all the entries in it then
to do a directory search).  In particular the first time the app loads
up and has to search every file (around 100,000 and growing) which is
painfully slow (2 or 3 minutes to start up the app).

What techniques/APIs can I use to speed up file searching?

Is there a way to more directly look at the raw data that make up the
file-tables?

Cheers
Peter Duniho - 27 May 2007 07:03 GMT
> [...]
> What techniques/APIs can I use to speed up file searching?

Well, one thing that comes to mind is that perhaps there is an API, .NET  
or native Windows, that has access to the Windows index-based searching.  
I don't know whether there is or not, but it's worth considering.

> Is there a way to more directly look at the raw data that make up the  
> file-tables?

Surely there is, but I'm sure you can't do it via .NET directly, nor am I  
at all convinced it would significantly speed up your search, since I  
suspect that Windows is already pulling the "raw data" off the disk about  
as fast as is possible, and feeding it to you with relatively little  
overhead.

Beyond that, I have to question a design that requires the search of  
100,000 files or more during startup.  At the very least, if you feel  
that's necessary, I think you'll simply have to take the performance hit.  
That's a *lot* of files, and a lot of i/o.  I/o is slow, and highly  
dependent on the device.  The best way to speed up your design is to  
change it so that you don't have to look at every file in every directory  
in the first place.

How you might do that, I don't know.  I don't know why you have to search  
for this particular file on startup, so I can't suggest ways you could fix  
the design so that's not required.

On that point, keep in mind that with such a lengthy search, you really  
have no guarantee when the search is over that you have accurate results.  
For example, once you've checked a directory, it could be that the file  
gets created after that, while you're checking the rest of the  
directories.  Or maybe you detect a file that gets removed before your  
search is done.

Again, since I don't know exactly how the file gets created or deleted, or  
why it is you want to look for it at startup, I can't really comment on  
the exact nuances of searching for it.  But the above issues would apply  
in the general case, at least.

If in the end you decide you don't want to change the design, you may want  
to consider including some sort of auxiliary background process that  
essentially indexes the interesting directories, and can tell your main  
process immediately upon startup whether the file you're interested in  
exists.  That would be sort of like using the built-in Windows indexing,  
but wouldn't require that you figure out how to get at it (assuming it's  
possible at all).

Pete
janderson - 27 May 2007 11:44 GMT
>> [...]
>> What techniques/APIs can I use to speed up file searching?
[quoted text clipped - 8 lines]
>
> Surely there is, but I'm sure you can't do it via .NET directly, nor am

If its indirect, I'm fine with that.

> I at all convinced it would significantly speed up your search, since I
> suspect that Windows is already pulling the "raw data" off the disk
> about as fast as is possible, and feeding it to you with relatively
> little overhead.

Maybe, however I think net may be pulling in things a simply don't need.

> Beyond that, I have to question a design that requires the search of
> 100,000 files or more during startup.  At the very least, if you feel
[quoted text clipped - 3 lines]
> to change it so that you don't have to look at every file in every
> directory in the first place.

It shouldn't be that slow.  I know the file-system is all over the place
but the amount of data for 100,000 is only about 2mbs worth of
filenames. But maybe windows is doing a good job of the search and its
just how the table is layed out.

> How you might do that, I don't know.  I don't know why you have to
> search for this particular file on startup, so I can't suggest ways you
[quoted text clipped - 6 lines]
> directories.  Or maybe you detect a file that gets removed before your
> search is done.

I'm certain the user won't change things while they are using the tool.

> Again, since I don't know exactly how the file gets created or deleted,
> or why it is you want to look for it at startup, I can't really comment
> on the exact nuances of searching for it.  But the above issues would
> apply in the general case, at least.

In this design its is necessary.  Have you ever used visual assist?  It
needs to work kinda like that.  The whole idea is that the fastest way
to get to something is to simply type it and see the list of
possibilities shrink as you get closer to what you want.  Its a optimal
design thing.

> If in the end you decide you don't want to change the design, you may
> want to consider including some sort of auxiliary background process
[quoted text clipped - 3 lines]
> indexing, but wouldn't require that you figure out how to get at it
> (assuming it's possible at all).

This is a good idea.  I would probably need to run the service all the
time.  There is a way to monitor file-system changes.  I may give that a
shot.  I hope it doesn't slow down computer time by much like
visualstudio does.  Actually I noticed the google desktop searcher does
this.

Thanks for your thoughts.

> Pete
Peter Duniho - 27 May 2007 18:26 GMT
>>> Is there a way to more directly look at the raw data that make up the  
>>> file-tables?
>>  Surely there is, but I'm sure you can't do it via .NET directly, nor am
>
> If its indirect, I'm fine with that.

Well, there's always p/invoke.  If you can find a way in the native Win32  
API, you can use p/invoke to call outside of managed code.

>> I at all convinced it would significantly speed up your search, since I  
>> suspect that Windows is already pulling the "raw data" off the disk  
>> about as fast as is possible, and feeding it to you with relatively  
>> little overhead.
>
> Maybe, however I think net may be pulling in things a simply don't need.

I would be surprised if it was accounting for even 10% overhead.

> It shouldn't be that slow.  I know the file-system is all over the place  
> but the amount of data for 100,000 is only about 2mbs worth of  
> filenames. But maybe windows is doing a good job of the search and its  
> just how the table is layed out.

You can't just consider the total amount of data.  In fact, for the kind  
of search you're doing, the total amount of data is probably the  
least-relevant piece of information.  Most important is how the data is  
distributed on the disk, and how many seeks it takes to get at it all.  It  
all depends on how the directory data is laid out.  If it's not all in one  
place, seeks and cache misses are going to really slow things down.

> I'm certain the user won't change things while they are using the tool.

Then why not just run your search in the background, using the most recent  
previous results until you have updated the data?  The user will get  
nearly 100% functionality during that window after startup before the  
search has completed, without causing a delay in the application starting  
up.

> In this design its is necessary.  Have you ever used visual assist?  It  
> needs to work kinda like that.  The whole idea is that the fastest way  
> to get to something is to simply type it and see the list of  
> possibilities shrink as you get closer to what you want.  Its a optimal  
> design thing.

I have never heard of "virtual assist".  I found this link using Google:  
http://virtualassistoncall.com/  I don't see what searching 100,000 files  
has to do with that though.  Your explanation also doesn't actually give  
me any insight into why it is that searching 100,000 files when you start  
your application is necessary.  "Optimal design" and "search 100,000 files  
on startup" don't strike me as correlated, at least not at first glance.

> This is a good idea.  I would probably need to run the service all the  
> time.  There is a way to monitor file-system changes.  I may give that a  
> shot.  I hope it doesn't slow down computer time by much like  
> visualstudio does.  Actually I noticed the google desktop searcher does  
> this.

Yes, all of the various desktop search utilities have to do something  
similar.  Of course, it begs the question as to whether it really makes  
sense to have a half-dozen different applications installed, all of which  
are trying to do some sort of indexing on the disk.  :)  But it's  
technically possible.

Pete
janderson - 27 May 2007 19:11 GMT
>>>> Is there a way to more directly look at the raw data that make up
>>>> the file-tables?
[quoted text clipped - 47 lines]
> "search 100,000 files on startup" don't strike me as correlated, at
> least not at first glance.

Sorry I meant visual assist.

http://www.wholetomato.com/

>> This is a good idea.  I would probably need to run the service all the
>> time.  There is a way to monitor file-system changes.  I may give that
[quoted text clipped - 9 lines]
>
> Pete
Peter Duniho - 27 May 2007 19:27 GMT
> Sorry I meant visual assist.
>
> http://www.wholetomato.com/

Actually, that's what you wrote.  I must be getting dyslexic in my old  
age.  :)

Sorry for the confusion.  Still having seen that, it doesn't seem a lot  
different than Visual Studio's built-in "Intellisense" (though it does add  
to that).  VS doesn't have to search through 100,000 files to accomplish  
what it does, nor does it prevent you from using VS before Intellisense  
has caught up (in the rare cases when it hasn't...I usually only see the  
"still building database" message on my really slow PIII 550Mhz PC anyway).

I do remain unconvinced that you're going to make the search of 100,000  
files much faster (unless the code you're using now is *really* broken  
somehow, and it doesn't sound like it is).  It's more likely that you'll  
have success with an approach that simply moves the search out of the  
user's way so it's not a bottleneck (whether that's doing it in a  
different process, or doing in the background the way Intellisense does).

And of course, I still wonder at the need to do the search in the first  
place, but without more details I can only take as granted that you're  
correct about that need.

Pete
janderson - 24 Jun 2007 21:53 GMT
>> Sorry I meant visual assist.
>>
[quoted text clipped - 10 lines]
> only see the "still building database" message on my really slow PIII
> 550Mhz PC anyway).

intellisense in visual assist is nice, but thats not what I'm talking
about.  You are able to type alt+O partial name of the file u want,
enter.  Bam your at that file.  Rather then navigating a slow tree.  A
mouse movement takes 1-4 seconds to navigate each branch, thats if you
don't have to scroll (which is extremely inefficient) and read 20
different names 3 levels deep.  Maybe I'm slow at navigating trees but
it takes me about 20-30 seconds to find a file in VS.

A keyboard click takes (at 60 words a minute) 1 second or less, normally
you only need to type 3 or 4 letters to get what you want.  So 7
keystrokes/seconds in total is a lot more efficient. (Visual assist
takes me on average 3 seconds to get to a file).

I guess its a case of you have be be there.

> I do remain unconvinced that you're going to make the search of 100,000
> files much faster (unless the code you're using now is *really* broken
[quoted text clipped - 8 lines]
>
> Pete

I came across this today, thought it may be useful.

http://video.google.com/videoplay?docid=-6856727143023456694&hl=en
Peter Duniho - 24 Jun 2007 22:14 GMT
> I came across this today, thought it may be useful.
>
> http://video.google.com/videoplay?docid=-6856727143023456694&hl=en

At 90 minutes, no.  Not useful at all.

From your previous description, you seem to be talking about an indexed  
file search.  Which is fine, and a common enough practice.  But you don't  
get that for free.  If you intend to index even just the filenames of  
100,000 files, you have to read that data from the disk, and that will  
take a non-trivial amount of time.

Note that all along, I have assumed that in spite of my failure to  
understand the exact application you're trying to write, you really do  
need to do what you're asking about.  I haven't been questioning that.  I  
have only questioned whether your expectation that you can do it in some  
unnoticeable amount of time is a reasonable one.

I could be wrong about that, but the last post in this thread was four  
weeks ago.  That seems like plenty of time to have explored your options  
and found a faster way to get the file information, if one exists.  So  
that begs the question: have you in fact found any significant improvement  
on the file i/o aspect of your problem?

Absent any progress on the performance aspect, I stand by my previous  
suggestion: the way to do this is to do it as other indexing  
implementations handle it.  Index in the background, and offer the user  
whatever functionality you can, however minimal at any given moment, with  
whatever has been indexed so far.  Once an initial index has been built,  
this will provide nearly perfect results immediately, and should only take  
a few minutes or so (maybe 10 or 15 if you throttle your indexing so that  
the user's other work isn't impeded noticeably) for the index to be  
brought up to date.

Pete
Ray Cassick - 28 May 2007 00:59 GMT
Can you split the app up into two parts?

1) The part that uses the files

2) A service that is in charge of keeping track of new files? This service
can always run in the background and use the file system watcher to be
triggered on new/changed files even if the other app is not running.

The 1st app can query the second one when it starts for a list of any
changed files and will only need to go directly to those files on startup
instead of having to search all of them.

> Hi gang,
>
[quoted text clipped - 23 lines]
>
> Cheers
janderson - 28 May 2007 05:36 GMT
> Can you split the app up into two parts?
>
[quoted text clipped - 7 lines]
> changed files and will only need to go directly to those files on startup
> instead of having to search all of them.

This is probably what I'm going to end up doing.  Thanks.

>> Hi gang,
>>
[quoted text clipped - 23 lines]
>>
>> Cheers

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.