.NET Forum / .NET Framework / General / May 2006
File Stream - Performance is getting slower and slower - why?
|
|
Thread rating:  |
m00nm0nkey - 21 Apr 2006 12:18 GMT Hello. I am trying to split a file with 334,386 lines into seperate files of 50,000 each.
This is the code i am running:
Dim intFragmentRawIndex As Integer Dim swRawDataFile As StreamWriter Dim intNewRawIndex As Integer Dim strNewRawDataFileName As String Dim intFragmentCallCount As Integer = 0 Dim strHeaderLine As String Dim blnFileClosed As Boolean = False
strHeaderLine = colRawDataFile(1)
CreateRawDataFragment(intParentRawIndex, intNewRawIndex, strNewRawDataFileName)
Dim myFileStream As New System.IO.FileStream(strTempDirectoryPath & strNewRawDataFileName, _ FileMode.OpenOrCreate, FileAccess.Write, FileShare.None)
swRawDataFile = New StreamWriter(myFileStream)
swRawDataFile.WriteLine(strHeaderLine)
For i As Integer = 2 To colRawDataFile.Count
If intFragmentCallCount = 50000 Then
'Clear Stream Writer Buffer swRawDataFile.Flush()
'Close file swRawDataFile.Close()
'Set Call Count against raw data file SetFragmentCallCount(intNewRawIndex, intFragmentCallCount)
'Reset call count intFragmentCallCount = 0
'If not on final line of raw data file.... If i <> colRawDataFile.Count Then
CreateRawDataFragment(intParentRawIndex, intNewRawIndex, strNewRawDataFileName)
myFileStream = New System.IO.FileStream(strTempDirectoryPath & strNewRawDataFileName, _ FileMode.OpenOrCreate, FileAccess.Write, FileShare.None)
swRawDataFile = New StreamWriter(myFileStream)
swRawDataFile.WriteLine(strHeaderLine)
Else
blnFileClosed = True
End If
End If
swRawDataFile.WriteLine(colRawDataFile(i))
intFragmentCallCount += 1
Next
If Not blnFileClosed Then
'Close last fragment swRawDataFile.Close()
'Set call count against last fragment SetFragmentCallCount(intNewRawIndex, intFragmentCallCount)
End If
The first file creates in 3 mins. The second file creates in 11 minutes. The third file creates in 18 minutes. I am still waiting for the forth file to create.
I am writing the same number of records to each file, so why would the time it takes to write the file of the same size take longer each time?
I thought that calling the flush method of the stream would maintain performance but this does not seem to be the case! What am i doing wrong?
 Signature welcome to the mooon !
olrt - 21 Apr 2006 14:09 GMT What are these routines :
- SetFragmentCallCount - CreateRawDataFragment
What is this variable : colRawDataFile ...
??
m00nm0nkey - 21 Apr 2006 14:20 GMT >>SetFragmentCallCount >>CreateRawDataFragment Don't worry about these - basic database operations
>>colRawDataFile This is the key point in this routine - it's basically the entire file, loaded into a collection - therefore, this collection has 334,386 entries in it.
 Signature welcome to the mooon !
> What are these routines : > [quoted text clipped - 5 lines] > > ?? Markus - 21 Apr 2006 17:30 GMT >>> colRawDataFile
> This is the key point in this routine - it's basically the entire > file, loaded into a collection - therefore, this collection has > 334,386 entries in it. And why do you load the entire file into a collection??? This doesn't make sense to me (but maybe I don't have a complete understanding of your solution)... I would try an approach like this (pseudo code):
FileReader input = ...; FileWriter output = ...; string data; while (input is not EOF) { if (i = 50000) { // close old output // create new output }
output.write(input.ReadLine()); }
no need to fetch all lines into memory (a collection).
hth Markus
Jon Skeet [C# MVP] - 21 Apr 2006 19:42 GMT > >>SetFragmentCallCount > >>CreateRawDataFragment [quoted text clipped - 4 lines] > loaded into a collection - therefore, this collection has 334,386 entries in > it. What kind of collection? If it's some kind of linked list, it would get horribly slow.
Have you tried removing pieces of the routine (such as the database operations) and seeing whether that makes a difference?
If this doesn't help, could you post a short but complete program which demonstrates the problem?
See http://www.pobox.com/~skeet/csharp/complete.html for details of what I mean by that. (Ignore the fact that it talks about C# - the same can be done in VB.NET easily.)
If the database calls aren't the problem, then stripping those out to produce a short but complete program shouldn't be an issue, and you can generate random strings to put into the collection.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
m00nm0nkey - 21 Apr 2006 14:33 GMT Ok well i thought i'd try a different approach, so what I'm now trying is appending 50,000 lines from the collection to a stringbuilder, and then writing that entire stringbuilder to a file.
However, look at this log:
21/04/2006 14:09:06: Building String Start 21/04/2006 14:09:14: appended 10,000 lines to the stringbuilder 21/04/2006 14:09:39: appended 10,000 lines to the stringbuilder 21/04/2006 14:10:20: appended 10,000 lines to the stringbuilder 21/04/2006 14:11:20: appended 10,000 lines to the stringbuilder 21/04/2006 14:12:36: appended 10,000 lines to the stringbuilder 21/04/2006 14:12:36: append of 50,000 lines to file from stringbuilder complete 21/04/2006 14:12:36: Building String Start 21/04/2006 14:14:05: appended 10,000 lines to the stringbuilder 21/04/2006 14:16:00: appended 10,000 lines to the stringbuilder 21/04/2006 14:18:36: appended 10,000 lines to the stringbuilder 21/04/2006 14:21:18: appended 10,000 lines to the stringbuilder 21/04/2006 14:23:58: appended 10,000 lines to the stringbuilder 21/04/2006 14:23:59: append of 50,000 lines to file from stringbuilder complete 21/04/2006 14:23:59: Building String Start
I clear the stringbuilder between appending to the file using this code: sbFileContent = New StringBuilder
However, there's still obviously a big slow down, why is this?
 Signature welcome to the mooon !
> Hello. I am trying to split a file with 334,386 lines into seperate files of > 50,000 each. [quoted text clipped - 88 lines] > I thought that calling the flush method of the stream would maintain > performance but this does not seem to be the case! What am i doing wrong? Patrice - 21 Apr 2006 15:15 GMT Not sure why you are using a StringBuilder to split a file into 3...
I would just read directly from the source file and would write to the "current" file just switching to a new file when appropriate (just perhaps playing with buffered streams to improve performance).
 Signature Patrice
> Ok well i thought i'd try a different approach, so what I'm now trying is > appending 50,000 lines from the collection to a stringbuilder, and then [quoted text clipped - 121 lines] >> I thought that calling the flush method of the stream would maintain >> performance but this does not seem to be the case! What am i doing wrong? Helge Jensen - 21 Apr 2006 20:26 GMT > Hello. I am trying to split a file with 334,386 lines into seperate files of > 50,000 each. How about something along the lines of: read lines, one by one, write them to current output. Every N lines open a new output.
void SplitIntoFiles(TextReader r, ulong limit, string nameFormat) { ulong count = 0; TextWriter w = null; try { for ( l = r.ReadLine(); l != null; l = r.ReadLine() ) { if ( count % limit == 0 ) { if ( w != null ) w.Dispose(); w = new TextWriter(string.Format(format, count); } ++count; w.WriteLine(l); } } finally { if ( w != null ) w.Dispose(); } } SplitIntoFiles(new TextReader(input_path), 50000, input_path + ".{0}");
I haven't compiled the code, but you should be able to get the idea.
Note that the code above will add a newline to the end of the last output-file, even if none was present in input_path. Also, the code will not work as expected on files longer than ulong.Max lines.
 Signature Helge Jensen mailto:helge.jensen@slog.dk sip:helge.jensen@slog.dk -=> Sebastian cover-music: http://ungdomshus.nu <=-
Jon Skeet [C# MVP] - 21 Apr 2006 22:45 GMT > Also, the code will not work as expected on files longer than > ulong.Max lines. When you find a disk capable of storing a file with 18,446,744,073,709,551,615 lines, let me know :) At one byte per line, that's still 16 exabytes.
http://en.wikipedia.org/wiki/Exabyte has some interesting stats on exabytes.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet If replying to the group, please do not mail me too
Helge Jensen - 22 Apr 2006 07:54 GMT >>Also, the code will not work as expected on files longer than >>ulong.Max lines. > > When you find a disk capable of storing a file with > 18,446,744,073,709,551,615 lines, let me know :) At one byte per line, > that's still 16 exabytes. It's not that I'm concerned about it, it just happens to be so :)
Since some streams are infinite (or atleast supposed infinite) and line oriented, it makes sense to just note the fact that there is a limit on the expected behaviour.
The code could be rewritten to work on arbitrary-length input (provided the FS allows arbitraty length paths) but I don't think it's worth the effort, and it's nice to have the line-offset in the file-name so...
 Signature Helge Jensen mailto:helge.jensen@slog.dk sip:helge.jensen@slog.dk -=> Sebastian cover-music: http://ungdomshus.nu <=-
Eric - 04 May 2006 10:27 GMT I am struggling with the same sort of problem. The program I'm building needs to split a file containing up to 100.000 xml records. I need to split those into 2 or 3 seperate files. I'm using different XmlTextWriters to create those files, and clean them up, and release the resources used by them.
Somehow, the writing of the files slows down with time proceeding, and I cannot find a solution for this problem..
Code and pseudocode on request ;)
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|