Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / Performance / September 2005

Tip: Looking for answers? Try searching our database.

StreamReader->regex->Win32 api marshalling

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Tim - 31 Aug 2005 15:35 GMT
I've written a C# .Net console app that reads massive (upto 2Gb) text files
of csv data using a StreamReader with UTF7 in configurable-size char[]
blocks, parses the read data using the .Net regex facilities and re-formats
it before sending it to a Managed C++ .Net dll that wraps the SQLServer bcp
api and so squirts the data into a SQLServer db. It all works fine, but the
performance is just not good enough - I need to at least halve the time its
taking to process the data. I've profiled the code and it looks like the
overhead is more or less evenly split between the reading/parsing on the one
hand and the data loading/bcp/Win32 api marshalling on the other hand - so
I'm looking for improvements across the whole functionality. The problem is
that I need the Regex sophistication to parse the data because its very rough
- so I need to turn the text read from the files into .Net Strings so I can
Regex them. But then I need to marshall the output Strings from the Regex
down to the BCP api because bcp is the only SQLServer i/f that provides
anything like the performance we need. However, I gather that two big .Net
performance overheads are String splitting/manipulation and .Net to Win32
ansi string marshalling :-(. Any suggestions for optimisation tweaks for
these parts of the FCL or ways to circumvent the problem areas much
appreciated.
David Browne - 31 Aug 2005 16:02 GMT
> I've written a C# .Net console app that reads massive (upto 2Gb) text
> files
[quoted text clipped - 24 lines]
> these parts of the FCL or ways to circumvent the problem areas much
> appreciated.

Two ideas.  First is to use buffering so your parsing code doesn't wait on
your BCP loads, and vice versa.  Introducing a buffer between the two can
help keep the processor busy while the BCP component is waiting on network
and SQL Server work.

An excelent candidate for buffering is a file.  Just run you .NET code and
output a clean file.  Then take that file and BCP it into SQL Server.  You
really aren't gaining anything by pipelining this stuff into SQL.

David
Tim - 31 Aug 2005 17:56 GMT
I did initially try writing an intermediate re-formatted text file and then
processing that using the bcp utility as a separate process (great minds
...), but the problem with that is the size of the files - they are huge and
we can't afford to replicate them as intermediate files. The performance
wasn't any better either - the overhead of writing to the intermediate text
file was just as great.
Regarding buffering - the profiling seemed to indicate that there wasn't
much slack in the processor, so I don't think there's a lot to be made up
there (its a local server so there's no network overhead).
I didn't want to complicate things by mentioning this, but a further
complication is that there is a separate thread to the file parsing component
that uses a .Net FileSystemWatcher object to look for the csv files arriving
and fires off the reading/parsing/bcping process as separate threads - so in
practice there will be many of these file processing threads running in
parallel, so I'd expect any slack in the processor to be taken up by that
anyway. So for the minute I'm just trying to optimise the file processing
code itself.

Tim

> > I've written a C# .Net console app that reads massive (upto 2Gb) text
> > files
[quoted text clipped - 35 lines]
>
> David
Scott - 31 Aug 2005 16:37 GMT
> db. It all works fine, but the performance is just not good enough -
> I need to at least halve the time its taking to process the data.
> I've profiled the code and it looks like the overhead is more or less
> evenly split between the reading/parsing on the one hand and the data
> loading/bcp/Win32 api marshalling on the other hand - so I'm looking

Are you using the Compile flag on the regex?

I would also try to insert to the database directly from your code by
batching the statements: insert xxx xxxx xxx;insert xxx xxxx xxx;insert
xxx xxxx xxx;... I would start by batching into groups of a 100 and
then try to tweak from there.

I've had very good performance success with these two strategies.

Scott C.
David Browne - 31 Aug 2005 16:44 GMT
>> db. It all works fine, but the performance is just not good enough -
>> I need to at least halve the time its taking to process the data.
[quoted text clipped - 8 lines]
> xxx xxxx xxx;... I would start by batching into groups of a 100 and
> then try to tweak from there.

BCP is significantly faster than using "batched" inserts.

David
shriop - 31 Aug 2005 16:51 GMT
Lots of info here, but I'll try to split it out.

Regex's are a notoriously bad performance wise way to accomplish what
you're trying to do.
http://www.codeproject.com/cs/database/CsvReader.asp
http://www.csvreader.com/csv_benchmarks.html

I would suggest you either implement the code yourself using raw char
array handling to find the commas, etc, etc, or purchase a commercial
csv parser like the one I sell, http://www.csvreader.com .

bcp is not what I would recommend for this because it does not properly
handle escape sequences, and does not necessarily have a performance
benefit that I've seen over dts. I would recommend calling a dts job
instead to import this data. If you don't need to do any data
manipulation to the original csv file, then there's no reason to even
do any csv parsing, just let dts do it for you. If you do want to try
the batching of insert statements like one of the previous comments
suggested, make sure you start a transaction before doing the chunks of
inserts, then commit at the end of a chunk. Otherwise, you'll get a
performance hit as the db tries to commit after every single insert.

> > db. It all works fine, but the performance is just not good enough -
> > I need to at least halve the time its taking to process the data.
[quoted text clipped - 12 lines]
>
> Scott C.
Tim - 01 Sep 2005 10:21 GMT
When I said the csv data was rough, I meant it! As an example: it uses a
specific configurable field delimiter for all the fields in a record except
the last field in the record which uses a \r\n. BUT ... the last field (and
in fact any other field) may also contain \r\n in its data! My solution to
this was to use the regex definition of the format of the first field in each
record to find the end of the last field. Would I be able to handle this with
your csv parser?
Thanks
Tim

> Lots of info here, but I'll try to split it out.
>
[quoted text clipped - 34 lines]
> >
> > Scott C.
shriop - 01 Sep 2005 16:15 GMT
I'm pretty sure this is right in line with what my parser is made to
handle. If you find it can't handle it, then I'd like to see an example
of the data so I can add the functionality to handle it. You can set
the delimiter to anything you want. And end of row is commonly ended
with a /r/n, so it handles that. the /r/n in the data is no problem as
long as there are double quotes around the field itself, so the parser
has something to base the logic on and know that this is data, and not
the end of a row. I've got a free demo version that you can try it out
with. You can also feel free to email me directly with any questions.
This kind of data is why I generally recommend people not try to use
regular expressions to parse the data, the expression just gets too
hairy, leading to poor performance.

Here's what I'm picturing you're data looking like, and what it can
handle, although I don't know what specific delimiter you're using.

1,Bruce,Dunwiddie,"in here, I can have commas and
\r\n"\r\n2,Bob,Jones,some other quote here

> When I said the csv data was rough, I meant it! As an example: it uses a
> specific configurable field delimiter for all the fields in a record except
[quoted text clipped - 44 lines]
> > >
> > > Scott C.
Tim - 01 Sep 2005 17:39 GMT
Thats the problem - the text isn't quoted. So I'm using meta-data definitions
of the SQLServer table columns that the csv data is destined for to decide
what sort of char patterns to expect in each field. But where the field is
text all bets are off - you just keep going 'til you reach whatever's defined
as the delimiter char; which is fine unless the last field in the row is text
and potentially includes the row delimiter (\r\n) in its data. As far as I
can see the only way to semi-reliably determine the end of the row,
therefore, is to find the start of the next row. Originally I was using the
entire regex definition of a record to find the start of the next row, but
I've optimised it to just look for the first column of the next row to save
time. I can't think of anything else to do - apart from drowning the person
responsible for the data format. Won't solve the problem but I'll feel a lot
better.

> I'm pretty sure this is right in line with what my parser is made to
> handle. If you find it can't handle it, then I'd like to see an example
[quoted text clipped - 63 lines]
> > > >
> > > > Scott C.
shriop - 01 Sep 2005 18:34 GMT
Can I see an example of one of these problematic rows? You're pretty
much out of luck from what you say. I saw a post a while back about
exporting events from the event viewer when the description contains
newline characters and the description is not quoted. I tried to find
something, anything, to base the logic off of. But if you can't be
certain of anything to deal with this file, then it's awful hard to
parse. This is one of the reasons why I try to push using my CsvWriter
class to create csv files, so people who don't know any better don't
try to come up with what they think a csv format is and end up leaving
out handling of newlines for example.
Tim - 02 Sep 2005 09:54 GMT
Well as I say the regex solution gives you some hope of being able to parse
the data, but I can't see any other way to do it. The other good thing about
using regex in messy cases like this is that any bits of text that don't get
recognised for whatever reason just don't get extracted - they don't
interfere with the parsing of any subsequent 'valid' rows of data.
The data would look something like ...

1|1/9/2005|Bruce|Dunwiddie|in here, I can have commas and
\r\n anything except the bar char thats the designated field delimiter for
this particular table "\r\n
2|28/7/2005|Bob|Jones|some other text

... so you would look for the pattern of newline followed by an int follwed
by the field delimiter as the row terminator (not forgetting the end of the
file ofcourse!).
I think the only solution is to turbocharge the regex class or give it some
steroids or something. Or maybe put some limited optional regex functionality
into your csv parser? If there was some way to submit a regex expression as
the row delimiter in your csv parser that might be a good half-way solution?

> Can I see an example of one of these problematic rows? You're pretty
> much out of luck from what you say. I saw a post a while back about
[quoted text clipped - 6 lines]
> try to come up with what they think a csv format is and end up leaving
> out handling of newlines for example.
shriop - 02 Sep 2005 15:44 GMT
Because of the way the parser is built for performance, there's no way
to include regular expressions into the parsing. It'd have to be a
totally different version of the parser.

What you're describing possibly working with regular expressions has
nothing to do with regular expressions. You can, and I think should, do
the exact same thing just doing normal parsing. I'm fine with what
you're seeing as a pattern for a row delimiter. If you stayed with
using a normal csv parser, you could read in a row, and save the
results off somewhere. Then, read in the next row. If the first
column's value is not an integer, then you know that the first column's
value should be appended to the last column's value from the last row
and that you're still in the middle of a row. Rinse, and repeat.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.