Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / March 2008

Tip: Looking for answers? Try searching our database.

Reading & Searching a Huge file

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Ahmad Jalil Qarshi - 14 Mar 2008 18:35 GMT
Hi,

I have a text file having size about 2 GB. The text file format is like:

Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values
Numeric value[space]AlphaNumeric values

For example consider following chunk of actual data:

52 SANZP3118.00Q3126T070533N16CEPXT
96 SANZP3118.00Q7294T070534N17CEP
100 SBHPP3200.00Q16000T070534N18CEPXT
150 SBHPP3200.00Q1000T070534N19CEP
153 SBHPP3200.00Q5000070534N20CEP
159 SBHPP3200.00Q3000070534N21CEP
168 SBHPP3200.00Q111000T070534N22CEP
168 SLHGP375.00Q4000T070534N26CEP
210 SBHPP3200.0Q000T070534N23CEP
569 SLHGP375.00Q23000T070534N24CEP
1000 SLHGP375.00Q162000T070534N25CEP
1010 SLHGP375.00Q4000T070534N26CEP

Now what I want to is to find a specific numeric value at the beginning of
the line. For example in the above few lines the numeric values are
52,96,100,150,153,159,168,210,569,1000,1010 (these values are always
sorted). Suppose I have to search 168. Since the values are sorted I can use
the binary search technique to search a specific numeric value. But for that
I will have to read all the numeric values sequentially into some array in
memory. The other way is sequential search i.e to read line by line and
fetch the numeric value before [space] and compare it with required one.

What is the best possible way to perform searching in the above mentioned
file format.

Thanks in anticipation,

Regards,

Ahmad Jalil Qarshi
Peter Duniho - 14 Mar 2008 19:37 GMT
> [...]
> Now what I want to is to find a specific numeric value at the beginning  
[quoted text clipped - 12 lines]
> What is the best possible way to perform searching in the above mentioned
> file format.

How many times do you need to perform this search?

If it's just once, then I think you could take advantage of the fact that  
the lines are all _almost_ the same length and do a binary search on the  
file itself.  You'll have to make a guess, then search for the  
previous/next end-of-line, then check the numeric value at the beginning  
of that line.  That's not as efficient as it could be if you had identical  
line lengths, but compared to scanning through a 2GB file it should still  
be plenty fast.

If you're going to perform the search on a regular basis, I'd recommend  
creating an index for the file.  Scan it once, storing just the initial  
numeric value and a byte offset within the file representing where that  
record's line starts.  Then later you can do your binary search on the  
index instead of the file.  This is, I think, what you're suggesting when  
you wrote "read all the numeric values sequentially into some array"; you  
didn't mention also storing the file offsets for each record, but  
hopefully you realize that's implied since otherwise keeping the values in  
an array wouldn't be useful.

Of course, keep in mind that this is the sort of thing that databases are  
really good at.  If you have a way to redirect the data into a database  
rather than keeping it in this huge text file, I think that would be the  
best way to handle it.

Pete
Ahmad Jalil Qarshi - 15 Mar 2008 10:36 GMT
Thanks Peter & Jon,

I have to read the file on user request. Now the problem is that I can't
create index as the format of file is fixed. So I think the only option left
is binary search.

Let me explain one more thing which I missed in my previous post. The
numeric value before the [space] can be repeated thousands of time
consequtively. Like.
....... ..............................
2073 SANZP311800Q729T070534N17CEP
2345 SBHPP3200.00Q16000T070534N18CEPXT
2345 SBHPP3200.00Q1000T070534N19CEP
2345 SBHPP3200.00Q5000070534N20CEPTP
....... ..............................
....... ..............................
2345 SLHGP375.00Q4000T070534N26CEP
2567 SBHPP3200.0Q000T070534N23CEP
........ ..............................

Now if I have to find the first occurance of a numeric value 2345. In that
case binary search will help or not?

Regards,

>> [...]
>> Now what I want to is to find a specific numeric value at the beginning
[quoted text clipped - 39 lines]
>
> Pete
Peter Duniho - 15 Mar 2008 19:10 GMT
> Thanks Peter & Jon,
>
> I have to read the file on user request. Now the problem is that I can't
> create index as the format of file is fixed. So I think the only option  
> left
> is binary search.

Well, a linear scan is always an option.  Whether that performs well  
enough to justify avoiding the work of implementing a binary search,  
that's up to you or whoever's making the decision about how your time is  
best used.

If the text file uses an encoding that has fixed-sized characters, I think  
that a binary search isn't actually going to be very hard to implement.  
But as Jon points out, if you have to support multiple encodings and/or  
encodings with variable-length characters (e.g. UTF-8), then it gets more  
complicated (a little...I'm no character-encoding expert, but I'm under  
the impression that in any of the encodings, finding an end-of-line marker  
while scanning backwards would not be difficult; a LF or CR/LF pair isn't  
a valid sequence in any other character's data, is it?).

In either case, it's certainly possible.  The question is what's the best  
use of your time.

I still wonder at the requirement that this has to be done "on demand",  
without an opportunity to index the file.  That's an awful lot of data to  
be search just once.  But if "the powers that be" deem it more sensible to  
make you implement a binary search on a variable-line-length text file  
than to fix the problem space to better incorporate an indexing- or  
database-based solution, I guess that's what you're stuck with.

I have to admit: if these happen infrequently enough that indexing a file  
doesn't make sense, I wonder if they don't also happen infrequently enough  
that just scanning linearly through the file wouldn't be okay.  Why does  
this need to be fast?

> Let me explain one more thing which I missed in my previous post. The
> numeric value before the [space] can be repeated thousands of time
[quoted text clipped - 3 lines]
> that
> case binary search will help or not?

It will help just as much as it would in any other situation where a  
binary search can be used.  Binary search doesn't imply unique elements,  
but if you have duplicated elements that does mean that you have to do  
some extra work to scan and find the first matching element, since the  
first element the search finds may not be the actual first element in the  
sequence.

Pete
Jon Skeet [C# MVP] - 14 Mar 2008 20:36 GMT
<snip>

> What is the best possible way to perform searching in the above mentioned
> file format.

What are your criteria for "best" here? If you don't mind scanning
through the whole file I'd just read it line by line.

If it needs to be quick, you could go for the binary chop while hunting
around for line boundaries - but that'll be much more complex.

I'd guess that the simple solution would be about 5 lines of code,
whereas the complex one could easily be 100, and easy to get wrong. It
gets worse if the file is in a variable-length encoding (e.g. UTF-8).

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet   Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk


Rate this thread:







Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.