Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / Performance / June 2005

Tip: Looking for answers? Try searching our database.

Huge string arrays in C#

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Emil _ DK - 24 May 2005 23:04 GMT
Hi guys

What would be the most effcient way to use a huge in-memory string array?
I have a need for around 1.000.000+ strings of more or less the same size;
30-120 bytes each. I do not know the size in advance.

The code adds rapidly strings to the array, so access to the array should be
the key point to performance.

I have pleanty of time to create the array, so if an expensive
initialization is needed  - it's fine.

I'm used to old-school C, and have no idea how strings and C# perform in the
.NET runtime.

Rgds,
Emil
Bob Grommes - 25 May 2005 02:21 GMT
The most efficient approach is a string array, but you can't dynamically
resize arrays so you either need to know the size in advance, or you need to
use some sort of wrapper class that allocates and coordinates arrays of a
known size, adding a new array of some reasonable number of elements (100K
or whatever is appropriate for you scenario) as needed.

That's a lot of work, though, and violates the KISS principle.  So in the
current CLR (1.1) you would use an ArrayList.  An ArrayList can grow
dynamically, but it only holds the System.Object type and so you have
casting overhead to get the strings back out.  Even so, you ought to give it
a whirl and see what the real world performance will be before deciding
against it.

The ideal in CLR 2.0 would be a generic List<string>.  This gives you
effectively a dynamically sizeable strongly typed string array.  But
production bits are still a few months away.

Another possibility if the string size doesn't vary much is to determine the
max string size and just put everything into a great big unsafe buffer of
MaxSize * NumberOfStrings within a wrapper class that has an indexer that
can return strings by index using pointers.  I've never gone there, but it's
certainly possible to do.  I would only look into this as a last resort
though.  ArrayList is remarkably performant, despite the casting.

Another thing to look at alongside all this is the often-overlooked
technique of string interning.  Your strings may have a lot of repeating
data -- for example, the city, state and zip in addresses often repeat;
there are finite numbers of first names.  You can sometimes cut memory
consumption by 90% or so by using the string interning pool, so that you
have one copy in memory of each unique string value.  See the docs on
String.Intern() for details.  In my experience this will slow down loading
quite a bit but it also saves a great deal of memory for other uses and I
haven't seen a noticeable performance penalty for access once all the
strings are loaded.  The nice thing about interning is it's not an
all-or-nothing thing; if you know something about your data you can use it
selectively where it will do you the most good while minimizing the
performance penalty.  Returning to the name & address example, you might
intern city and state, probably zip (especially if the data is from one
locale), maybe first or even last name, but generally there wouldn't be much
point in interning address or phone number.

--Bob

> Hi guys
>
[quoted text clipped - 15 lines]
> Rgds,
> Emil
Philipp Schumann - 25 May 2005 02:54 GMT
Hi Bob & Emil,

in addition, string is a reference type, not a value type, so the casting
"overhead" should in fact be minimal, if at all, noticable (as compared to
the "boxing" that would occur with value types when cast to a reference
type, Object). As you know, reference types are all representationally
equivalent for the purposes of the ArrayList, they are just pointers. I
believe it is no casting at all in this case, it's simple
inheritance---classB inherits classA, if classA is the type of a method
parameter and I pass a classB, there is no casting involved, is there?? Same
here, since any type _inherits_ from Object. 'Boxing' is the special case
for non-reference (ie value) types, but does not apply to strings.

I would try ArrayList.

Best,
Phil

> The most efficient approach is a string array, but you can't dynamically
> resize arrays so you either need to know the size in advance, or you need
[quoted text clipped - 59 lines]
>> Rgds,
>> Emil
Jon Shemitz - 25 May 2005 03:13 GMT
> in addition, string is a reference type, not a value type, so the casting
> "overhead" should in fact be minimal, if at all, noticable (as compared to
[quoted text clipped - 6 lines]
> here, since any type _inherits_ from Object. 'Boxing' is the special case
> for non-reference (ie value) types, but does not apply to strings.

There is no cast to assign a string to an object variable, but you do
have to cast - as Bob mentioned - "to get the strings back out."

The casting is not insanely expensive, but it is required whenever you
assign an object from an ArrayList to a string variable. Since string
descends directly from object, the test isn't as expensive as when you
cast an object to some class with several ancestors between it and the
root, but it's not free.

Signature

www.midnightbeach.com

Philipp Schumann - 25 May 2005 15:03 GMT
True ...  =)

>> in addition, string is a reference type, not a value type, so the casting
>> "overhead" should in fact be minimal, if at all, noticable (as compared
[quoted text clipped - 17 lines]
> cast an object to some class with several ancestors between it and the
> root, but it's not free.
Emil _ DK - 25 May 2005 08:31 GMT
Hi Phillipp

I didn't know about string reference types. I'll look into it togther with
the ArrayList.

Currently my code does something like:
 storeStrInBuffer(data1 + ";" + data2 + ";");

Ie. a lot of concatination - is this a performance bottleneck (sorry for
asking - I could just make the different codes and test myself - but any
experience is appreciated).

The main reason for the string 'buffer' is that i plan to write every
100.000 rows at a time to a file on the disk to minimize too many disk IO
requests.

I was think of creating a byte array array, but I am insure if there to much
overhead in string.ToBytes().

BR
Emil

> Hi Bob & Emil,
>
[quoted text clipped - 77 lines]
> >> Rgds,
> >> Emil
Philipp Schumann - 25 May 2005 15:06 GMT
> Ie. a lot of concatination - is this a performance bottleneck (sorry for
> asking - I could just make the different codes and test myself - but any
> experience is appreciated).

It is somewhat optimized I think, in some cases it is replaced with
string.Concat, though I'm not sure whether this is done during C#->IL
compilation, or during JITing.

StringBuilder is the recommended practice for huge string manipulations,
similar to StringBuffer in Java.

> The main reason for the string 'buffer' is that i plan to write every
> 100.000 rows at a time to a file on the disk to minimize too many disk IO
[quoted text clipped - 3 lines]
> much
> overhead in string.ToBytes().

I think this is probably a typical case of trading one overhead for another,
but I'm no expert, so I suggest you try both and measure.  =)
john conwell - 26 May 2005 01:22 GMT
If you use + style concatenation in C# for up to 4 strings, the compiler will
write the IL equivalent of String.Concat("blah" + "blah" + "blah" + "blah"),
which is good.  Only one string object gets created.  If you use more than 4
strings, then it defaults back to + style concatenation and you get a bunch
of temporary string objects created in memory.
   
If your concatenation is inside a loop, then you probably should use the
stringbuilder class and initialize it with a reasonable number of chars that
you think it will hold.  This way you can cut down on many tempory string
objects being created

> > Ie. a lot of concatination - is this a performance bottleneck (sorry for
> > asking - I could just make the different codes and test myself - but any
[quoted text clipped - 17 lines]
> I think this is probably a typical case of trading one overhead for another,
> but I'm no expert, so I suggest you try both and measure.  =)
Jon Skeet [C# MVP] - 26 May 2005 07:22 GMT
> If you use + style concatenation in C# for up to 4 strings, the compiler will
> write the IL equivalent of String.Concat("blah" + "blah" + "blah" + "blah"),
> which is good.  Only one string object gets created.  If you use more than 4
> strings, then it defaults back to + style concatenation and you get a bunch
> of temporary string objects created in memory.

No, it uses String.Concat(params string[]). In other words,

a + b + c + d + e

is equivalent to

String.Concat (a, b, c, d, e)

which is equivalent to

String.Concat (new String[]{a, b, c, d, e})

There's a temporary string array object, but no temporary strings.
   
> If your concatenation is inside a loop, then you probably should use the
> stringbuilder class and initialize it with a reasonable number of chars that
> you think it will hold.  This way you can cut down on many tempory string
> objects being created

Yes.

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Sherif El-Metainy - 05 Jun 2005 20:58 GMT
Hello

If the reason for concatination and using too many strings in memory is just
to save disk IO then using filestream, which has buffering, should be
enough. Let .NET handle the buffering for you, most probably this will be
most efficient.

You can use async streams, so that the data is written in a background IO
thread and your application doesn't have to wait for IO to complete. Also
note that keeping a lot of strings in memory will increase the probability
that virtual memory is required, and therefore you would also have extra
unnessary disk IO when Windows swap pages from memory to disk and vice
versa.

Also note that very big strings or byte arrays (>85K) are allocated on the
large object heap (LOH) and are collected only with a generation 2
collection which means they will live a lot longer after there are no
reference to them. Also the LOH is not compacted after the GC. This means
that you may have memory fragmentation.

Best regards,
Sherif

> Hi Phillipp
>
[quoted text clipped - 122 lines]
>> >> Rgds,
>> >> Emil
Jon Shemitz - 25 May 2005 03:17 GMT
> Another thing to look at alongside all this is the often-overlooked
> technique of string interning.  Your strings may have a lot of repeating
[quoted text clipped - 12 lines]
> locale), maybe first or even last name, but generally there wouldn't be much
> point in interning address or phone number.

Yes, this is a good suggestion. The only drawback to interning is that
interned strings will stay in memory until the app terminates, which
can be expensive if the array needs to go away. A Hashtable (or, in
2.0, a Dictionary<string,string>) can be used to detect repeats while
still allowing the memory to be reclaimed when done.

Signature

www.midnightbeach.com

Bob Grommes - 25 May 2005 04:38 GMT
Hey Jon,

I hadn't noticed this because when I've used interning it's been in batch
processing scenarios where the program will exit when done with the strings.

Is this something you've observed, or read about, or ... ?  Do you have a
reference on this topic?  It intrigues me.  I assumed interning would have
been done with something like a Dictionary<string,string> internally anyway
... odd that they would not take a string with no references out of the
pool.  I can understand that with string constants, which are always
interned at compile time, but not with dynamic data that's interned at
runtime.

Thanks,

--Bob

> The only drawback to interning is that
> interned strings will stay in memory until the app terminates, which
> can be expensive if the array needs to go away.
Jon Skeet [C# MVP] - 25 May 2005 07:12 GMT
> I hadn't noticed this because when I've used interning it's been in batch
> processing scenarios where the program will exit when done with the strings.
[quoted text clipped - 6 lines]
> interned at compile time, but not with dynamic data that's interned at
> runtime.

That would involve more work at GC time, I suspect - and making the
intern pool able to shrink as well as grow would probably make it more
complicated.

I *think* it's done on an AppDomain basis rather than a process basis,
but unfortunately the discussion I remember having about it was on a
private newsgroup which isn't archived :(

Signature

Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

john conwell - 26 May 2005 01:26 GMT
Yes, interning is done at the AppDomain level.  There is a internal method in
the AppDomain that takes care of this.  Here is its method structure

internal extern string GetOrInternString(string str);

> > I hadn't noticed this because when I've used interning it's been in batch
> > processing scenarios where the program will exit when done with the strings.
[quoted text clipped - 14 lines]
> but unfortunately the discussion I remember having about it was on a
> private newsgroup which isn't archived :(
Jon Shemitz - 25 May 2005 22:47 GMT
> Is this something you've observed, or read about, or ... ?  Do you have a
> reference on this topic?  It intrigues me.  I assumed interning would have
[quoted text clipped - 3 lines]
> interned at compile time, but not with dynamic data that's interned at
> runtime.

Fwiw, I also assume that interning is done with a Hashtable in 1.x and
a Dictionary<string,string> in 2.x - efficient lookup, why reinvent
the wheel, and all that.

As for permanence, I'm pretty sure I did read it somewhere, but can't
remember where, and the dox sure don't explicitly mention it. Otoh,
there is no "String.Unintern" and no mention of the possibility that a
once-interned string would ever be reclaimed. (For that matter, as Jon
Skeet suggests, how would the system detect that a given string is
referred to only by the intern pool? That would be a comparatively
expensive addition to gc.)

Signature

www.midnightbeach.com

Emil _ DK - 25 May 2005 08:37 GMT
Hi Jon

I will look at it anyhow. The server got 4 gb ram of which this program can
get ½-1 gb. I'm not concerned about this - for now :-)

BR
Emil

> Yes, this is a good suggestion. The only drawback to interning is that
> interned strings will stay in memory until the app terminates, which
> can be expensive if the array needs to go away. A Hashtable (or, in
> 2.0, a Dictionary<string,string>) can be used to detect repeats while
> still allowing the memory to be reclaimed when done.
Emil _ DK - 25 May 2005 08:23 GMT
Hi Bob

Thanks for your reply. I will look at String.Intern() idea, since about a
quarter of each string is known in advace.
I also looked at the ArrayList, but wasn't sure it was the best solution
hence this post. My concern was the casting and boxing issues.

I thougt about making a managed code class, and just to it in C++, but due
to a tight schedule, I estimated that it would require to much time to read
about this technique.

Cheers,
Emil

> The most efficient approach is a string array, but you can't dynamically
> resize arrays so you either need to know the size in advance, or you need to
[quoted text clipped - 38 lines]
>
> --Bob
Prashant Majhwar - 25 May 2005 10:41 GMT
If you are concating huge strings then instead of using
for( int i = 0 ; i < arraylist.count; i++)
   teststring  += "this is a test" + "\n";

use
StringBuilder teststring = new StringBuilder(specify your size);
for( int i = 0 ; i < arraylist.count; i++)
   teststring.Append("this is a test" + "\n");
Later is 1000 times faster...

> Hi guys
>
[quoted text clipped - 15 lines]
> Rgds,
> Emil
Rafal Gwizdala - 01 Jun 2005 15:13 GMT
If you just want to build a large string (consisting of 100k of strings) and
write it in one chunk to a file you don't have to do anything but use
StringBuilder as a write buffer. Just append the strings to it until it
grows enough and then save it to a file. You could also use MemoryStream +
TextWriter for this purpose.

Best Regards
RG

> If you are concating huge strings then instead of using
> for( int i = 0 ; i < arraylist.count; i++)
[quoted text clipped - 26 lines]
>> Rgds,
>> Emil

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.