.NET Forum / .NET Framework / Performance / September 2005
StreamReader->regex->Win32 api marshalling
|
|
Thread rating:  |
Tim - 31 Aug 2005 15:35 GMT I've written a C# .Net console app that reads massive (upto 2Gb) text files of csv data using a StreamReader with UTF7 in configurable-size char[] blocks, parses the read data using the .Net regex facilities and re-formats it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp api and so squirts the data into a SQLServer db. It all works fine, but the performance is just not good enough - I need to at least halve the time its taking to process the data. I've profiled the code and it looks like the overhead is more or less evenly split between the reading/parsing on the one hand and the data loading/bcp/Win32 api marshalling on the other hand - so I'm looking for improvements across the whole functionality. The problem is that I need the Regex sophistication to parse the data because its very rough - so I need to turn the text read from the files into .Net Strings so I can Regex them. But then I need to marshall the output Strings from the Regex down to the BCP api because bcp is the only SQLServer i/f that provides anything like the performance we need. However, I gather that two big .Net performance overheads are String splitting/manipulation and .Net to Win32 ansi string marshalling :-(. Any suggestions for optimisation tweaks for these parts of the FCL or ways to circumvent the problem areas much appreciated.
David Browne - 31 Aug 2005 16:02 GMT > I've written a C# .Net console app that reads massive (upto 2Gb) text > files [quoted text clipped - 24 lines] > these parts of the FCL or ways to circumvent the problem areas much > appreciated. Two ideas. First is to use buffering so your parsing code doesn't wait on your BCP loads, and vice versa. Introducing a buffer between the two can help keep the processor busy while the BCP component is waiting on network and SQL Server work.
An excelent candidate for buffering is a file. Just run you .NET code and output a clean file. Then take that file and BCP it into SQL Server. You really aren't gaining anything by pipelining this stuff into SQL.
David
Tim - 31 Aug 2005 17:56 GMT I did initially try writing an intermediate re-formatted text file and then processing that using the bcp utility as a separate process (great minds ...), but the problem with that is the size of the files - they are huge and we can't afford to replicate them as intermediate files. The performance wasn't any better either - the overhead of writing to the intermediate text file was just as great. Regarding buffering - the profiling seemed to indicate that there wasn't much slack in the processor, so I don't think there's a lot to be made up there (its a local server so there's no network overhead). I didn't want to complicate things by mentioning this, but a further complication is that there is a separate thread to the file parsing component that uses a .Net FileSystemWatcher object to look for the csv files arriving and fires off the reading/parsing/bcping process as separate threads - so in practice there will be many of these file processing threads running in parallel, so I'd expect any slack in the processor to be taken up by that anyway. So for the minute I'm just trying to optimise the file processing code itself.
Tim
> > I've written a C# .Net console app that reads massive (upto 2Gb) text > > files [quoted text clipped - 35 lines] > > David Scott - 31 Aug 2005 16:37 GMT > db. It all works fine, but the performance is just not good enough - > I need to at least halve the time its taking to process the data. > I've profiled the code and it looks like the overhead is more or less > evenly split between the reading/parsing on the one hand and the data > loading/bcp/Win32 api marshalling on the other hand - so I'm looking Are you using the Compile flag on the regex?
I would also try to insert to the database directly from your code by batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert xxx xxxx xxx;... I would start by batching into groups of a 100 and then try to tweak from there.
I've had very good performance success with these two strategies.
Scott C.
David Browne - 31 Aug 2005 16:44 GMT >> db. It all works fine, but the performance is just not good enough - >> I need to at least halve the time its taking to process the data. [quoted text clipped - 8 lines] > xxx xxxx xxx;... I would start by batching into groups of a 100 and > then try to tweak from there. BCP is significantly faster than using "batched" inserts.
David
shriop - 31 Aug 2005 16:51 GMT Lots of info here, but I'll try to split it out.
Regex's are a notoriously bad performance wise way to accomplish what you're trying to do. http://www.codeproject.com/cs/database/CsvReader.asp http://www.csvreader.com/csv_benchmarks.html
I would suggest you either implement the code yourself using raw char array handling to find the commas, etc, etc, or purchase a commercial csv parser like the one I sell, http://www.csvreader.com .
bcp is not what I would recommend for this because it does not properly handle escape sequences, and does not necessarily have a performance benefit that I've seen over dts. I would recommend calling a dts job instead to import this data. If you don't need to do any data manipulation to the original csv file, then there's no reason to even do any csv parsing, just let dts do it for you. If you do want to try the batching of insert statements like one of the previous comments suggested, make sure you start a transaction before doing the chunks of inserts, then commit at the end of a chunk. Otherwise, you'll get a performance hit as the db tries to commit after every single insert.
> > db. It all works fine, but the performance is just not good enough - > > I need to at least halve the time its taking to process the data. [quoted text clipped - 12 lines] > > Scott C. Tim - 01 Sep 2005 10:21 GMT When I said the csv data was rough, I meant it! As an example: it uses a specific configurable field delimiter for all the fields in a record except the last field in the record which uses a \r\n. BUT ... the last field (and in fact any other field) may also contain \r\n in its data! My solution to this was to use the regex definition of the format of the first field in each record to find the end of the last field. Would I be able to handle this with your csv parser? Thanks Tim
> Lots of info here, but I'll try to split it out. > [quoted text clipped - 34 lines] > > > > Scott C. shriop - 01 Sep 2005 16:15 GMT I'm pretty sure this is right in line with what my parser is made to handle. If you find it can't handle it, then I'd like to see an example of the data so I can add the functionality to handle it. You can set the delimiter to anything you want. And end of row is commonly ended with a /r/n, so it handles that. the /r/n in the data is no problem as long as there are double quotes around the field itself, so the parser has something to base the logic on and know that this is data, and not the end of a row. I've got a free demo version that you can try it out with. You can also feel free to email me directly with any questions. This kind of data is why I generally recommend people not try to use regular expressions to parse the data, the expression just gets too hairy, leading to poor performance.
Here's what I'm picturing you're data looking like, and what it can handle, although I don't know what specific delimiter you're using.
1,Bruce,Dunwiddie,"in here, I can have commas and \r\n"\r\n2,Bob,Jones,some other quote here
> When I said the csv data was rough, I meant it! As an example: it uses a > specific configurable field delimiter for all the fields in a record except [quoted text clipped - 44 lines] > > > > > > Scott C. Tim - 01 Sep 2005 17:39 GMT Thats the problem - the text isn't quoted. So I'm using meta-data definitions of the SQLServer table columns that the csv data is destined for to decide what sort of char patterns to expect in each field. But where the field is text all bets are off - you just keep going 'til you reach whatever's defined as the delimiter char; which is fine unless the last field in the row is text and potentially includes the row delimiter (\r\n) in its data. As far as I can see the only way to semi-reliably determine the end of the row, therefore, is to find the start of the next row. Originally I was using the entire regex definition of a record to find the start of the next row, but I've optimised it to just look for the first column of the next row to save time. I can't think of anything else to do - apart from drowning the person responsible for the data format. Won't solve the problem but I'll feel a lot better.
> I'm pretty sure this is right in line with what my parser is made to > handle. If you find it can't handle it, then I'd like to see an example [quoted text clipped - 63 lines] > > > > > > > > Scott C. shriop - 01 Sep 2005 18:34 GMT Can I see an example of one of these problematic rows? You're pretty much out of luck from what you say. I saw a post a while back about exporting events from the event viewer when the description contains newline characters and the description is not quoted. I tried to find something, anything, to base the logic off of. But if you can't be certain of anything to deal with this file, then it's awful hard to parse. This is one of the reasons why I try to push using my CsvWriter class to create csv files, so people who don't know any better don't try to come up with what they think a csv format is and end up leaving out handling of newlines for example.
Tim - 02 Sep 2005 09:54 GMT Well as I say the regex solution gives you some hope of being able to parse the data, but I can't see any other way to do it. The other good thing about using regex in messy cases like this is that any bits of text that don't get recognised for whatever reason just don't get extracted - they don't interfere with the parsing of any subsequent 'valid' rows of data. The data would look something like ...
1|1/9/2005|Bruce|Dunwiddie|in here, I can have commas and \r\n anything except the bar char thats the designated field delimiter for this particular table "\r\n 2|28/7/2005|Bob|Jones|some other text
... so you would look for the pattern of newline followed by an int follwed by the field delimiter as the row terminator (not forgetting the end of the file ofcourse!). I think the only solution is to turbocharge the regex class or give it some steroids or something. Or maybe put some limited optional regex functionality into your csv parser? If there was some way to submit a regex expression as the row delimiter in your csv parser that might be a good half-way solution?
> Can I see an example of one of these problematic rows? You're pretty > much out of luck from what you say. I saw a post a while back about [quoted text clipped - 6 lines] > try to come up with what they think a csv format is and end up leaving > out handling of newlines for example. shriop - 02 Sep 2005 15:44 GMT Because of the way the parser is built for performance, there's no way to include regular expressions into the parsing. It'd have to be a totally different version of the parser.
What you're describing possibly working with regular expressions has nothing to do with regular expressions. You can, and I think should, do the exact same thing just doing normal parsing. I'm fine with what you're seeing as a pattern for a row delimiter. If you stayed with using a normal csv parser, you could read in a row, and save the results off somewhere. Then, read in the next row. If the first column's value is not an integer, then you know that the first column's value should be appended to the last column's value from the last row and that you're still in the middle of a row. Rinse, and repeat.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|