.NET Forum / .NET Framework / Performance / June 2005
Huge string arrays in C#
|
|
Thread rating:  |
Emil _ DK - 24 May 2005 23:04 GMT Hi guys
What would be the most effcient way to use a huge in-memory string array? I have a need for around 1.000.000+ strings of more or less the same size; 30-120 bytes each. I do not know the size in advance.
The code adds rapidly strings to the array, so access to the array should be the key point to performance.
I have pleanty of time to create the array, so if an expensive initialization is needed - it's fine.
I'm used to old-school C, and have no idea how strings and C# perform in the .NET runtime.
Rgds, Emil
Bob Grommes - 25 May 2005 02:21 GMT The most efficient approach is a string array, but you can't dynamically resize arrays so you either need to know the size in advance, or you need to use some sort of wrapper class that allocates and coordinates arrays of a known size, adding a new array of some reasonable number of elements (100K or whatever is appropriate for you scenario) as needed.
That's a lot of work, though, and violates the KISS principle. So in the current CLR (1.1) you would use an ArrayList. An ArrayList can grow dynamically, but it only holds the System.Object type and so you have casting overhead to get the strings back out. Even so, you ought to give it a whirl and see what the real world performance will be before deciding against it.
The ideal in CLR 2.0 would be a generic List<string>. This gives you effectively a dynamically sizeable strongly typed string array. But production bits are still a few months away.
Another possibility if the string size doesn't vary much is to determine the max string size and just put everything into a great big unsafe buffer of MaxSize * NumberOfStrings within a wrapper class that has an indexer that can return strings by index using pointers. I've never gone there, but it's certainly possible to do. I would only look into this as a last resort though. ArrayList is remarkably performant, despite the casting.
Another thing to look at alongside all this is the often-overlooked technique of string interning. Your strings may have a lot of repeating data -- for example, the city, state and zip in addresses often repeat; there are finite numbers of first names. You can sometimes cut memory consumption by 90% or so by using the string interning pool, so that you have one copy in memory of each unique string value. See the docs on String.Intern() for details. In my experience this will slow down loading quite a bit but it also saves a great deal of memory for other uses and I haven't seen a noticeable performance penalty for access once all the strings are loaded. The nice thing about interning is it's not an all-or-nothing thing; if you know something about your data you can use it selectively where it will do you the most good while minimizing the performance penalty. Returning to the name & address example, you might intern city and state, probably zip (especially if the data is from one locale), maybe first or even last name, but generally there wouldn't be much point in interning address or phone number.
--Bob
> Hi guys > [quoted text clipped - 15 lines] > Rgds, > Emil Philipp Schumann - 25 May 2005 02:54 GMT Hi Bob & Emil,
in addition, string is a reference type, not a value type, so the casting "overhead" should in fact be minimal, if at all, noticable (as compared to the "boxing" that would occur with value types when cast to a reference type, Object). As you know, reference types are all representationally equivalent for the purposes of the ArrayList, they are just pointers. I believe it is no casting at all in this case, it's simple inheritance---classB inherits classA, if classA is the type of a method parameter and I pass a classB, there is no casting involved, is there?? Same here, since any type _inherits_ from Object. 'Boxing' is the special case for non-reference (ie value) types, but does not apply to strings.
I would try ArrayList.
Best, Phil
> The most efficient approach is a string array, but you can't dynamically > resize arrays so you either need to know the size in advance, or you need [quoted text clipped - 59 lines] >> Rgds, >> Emil Jon Shemitz - 25 May 2005 03:13 GMT > in addition, string is a reference type, not a value type, so the casting > "overhead" should in fact be minimal, if at all, noticable (as compared to [quoted text clipped - 6 lines] > here, since any type _inherits_ from Object. 'Boxing' is the special case > for non-reference (ie value) types, but does not apply to strings. There is no cast to assign a string to an object variable, but you do have to cast - as Bob mentioned - "to get the strings back out."
The casting is not insanely expensive, but it is required whenever you assign an object from an ArrayList to a string variable. Since string descends directly from object, the test isn't as expensive as when you cast an object to some class with several ancestors between it and the root, but it's not free.
 Signature www.midnightbeach.com
Philipp Schumann - 25 May 2005 15:03 GMT True ... =)
>> in addition, string is a reference type, not a value type, so the casting >> "overhead" should in fact be minimal, if at all, noticable (as compared [quoted text clipped - 17 lines] > cast an object to some class with several ancestors between it and the > root, but it's not free. Emil _ DK - 25 May 2005 08:31 GMT Hi Phillipp
I didn't know about string reference types. I'll look into it togther with the ArrayList.
Currently my code does something like: storeStrInBuffer(data1 + ";" + data2 + ";");
Ie. a lot of concatination - is this a performance bottleneck (sorry for asking - I could just make the different codes and test myself - but any experience is appreciated).
The main reason for the string 'buffer' is that i plan to write every 100.000 rows at a time to a file on the disk to minimize too many disk IO requests.
I was think of creating a byte array array, but I am insure if there to much overhead in string.ToBytes().
BR Emil
> Hi Bob & Emil, > [quoted text clipped - 77 lines] > >> Rgds, > >> Emil Philipp Schumann - 25 May 2005 15:06 GMT > Ie. a lot of concatination - is this a performance bottleneck (sorry for > asking - I could just make the different codes and test myself - but any > experience is appreciated). It is somewhat optimized I think, in some cases it is replaced with string.Concat, though I'm not sure whether this is done during C#->IL compilation, or during JITing.
StringBuilder is the recommended practice for huge string manipulations, similar to StringBuffer in Java.
> The main reason for the string 'buffer' is that i plan to write every > 100.000 rows at a time to a file on the disk to minimize too many disk IO [quoted text clipped - 3 lines] > much > overhead in string.ToBytes(). I think this is probably a typical case of trading one overhead for another, but I'm no expert, so I suggest you try both and measure. =)
john conwell - 26 May 2005 01:22 GMT If you use + style concatenation in C# for up to 4 strings, the compiler will write the IL equivalent of String.Concat("blah" + "blah" + "blah" + "blah"), which is good. Only one string object gets created. If you use more than 4 strings, then it defaults back to + style concatenation and you get a bunch of temporary string objects created in memory. If your concatenation is inside a loop, then you probably should use the stringbuilder class and initialize it with a reasonable number of chars that you think it will hold. This way you can cut down on many tempory string objects being created
> > Ie. a lot of concatination - is this a performance bottleneck (sorry for > > asking - I could just make the different codes and test myself - but any [quoted text clipped - 17 lines] > I think this is probably a typical case of trading one overhead for another, > but I'm no expert, so I suggest you try both and measure. =) Jon Skeet [C# MVP] - 26 May 2005 07:22 GMT > If you use + style concatenation in C# for up to 4 strings, the compiler will > write the IL equivalent of String.Concat("blah" + "blah" + "blah" + "blah"), > which is good. Only one string object gets created. If you use more than 4 > strings, then it defaults back to + style concatenation and you get a bunch > of temporary string objects created in memory. No, it uses String.Concat(params string[]). In other words,
a + b + c + d + e
is equivalent to
String.Concat (a, b, c, d, e)
which is equivalent to
String.Concat (new String[]{a, b, c, d, e})
There's a temporary string array object, but no temporary strings.
> If your concatenation is inside a loop, then you probably should use the > stringbuilder class and initialize it with a reasonable number of chars that > you think it will hold. This way you can cut down on many tempory string > objects being created Yes.
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too
Sherif El-Metainy - 05 Jun 2005 20:58 GMT Hello
If the reason for concatination and using too many strings in memory is just to save disk IO then using filestream, which has buffering, should be enough. Let .NET handle the buffering for you, most probably this will be most efficient.
You can use async streams, so that the data is written in a background IO thread and your application doesn't have to wait for IO to complete. Also note that keeping a lot of strings in memory will increase the probability that virtual memory is required, and therefore you would also have extra unnessary disk IO when Windows swap pages from memory to disk and vice versa.
Also note that very big strings or byte arrays (>85K) are allocated on the large object heap (LOH) and are collected only with a generation 2 collection which means they will live a lot longer after there are no reference to them. Also the LOH is not compacted after the GC. This means that you may have memory fragmentation.
Best regards, Sherif
> Hi Phillipp > [quoted text clipped - 122 lines] >> >> Rgds, >> >> Emil Jon Shemitz - 25 May 2005 03:17 GMT > Another thing to look at alongside all this is the often-overlooked > technique of string interning. Your strings may have a lot of repeating [quoted text clipped - 12 lines] > locale), maybe first or even last name, but generally there wouldn't be much > point in interning address or phone number. Yes, this is a good suggestion. The only drawback to interning is that interned strings will stay in memory until the app terminates, which can be expensive if the array needs to go away. A Hashtable (or, in 2.0, a Dictionary<string,string>) can be used to detect repeats while still allowing the memory to be reclaimed when done.
 Signature www.midnightbeach.com
Bob Grommes - 25 May 2005 04:38 GMT Hey Jon,
I hadn't noticed this because when I've used interning it's been in batch processing scenarios where the program will exit when done with the strings.
Is this something you've observed, or read about, or ... ? Do you have a reference on this topic? It intrigues me. I assumed interning would have been done with something like a Dictionary<string,string> internally anyway ... odd that they would not take a string with no references out of the pool. I can understand that with string constants, which are always interned at compile time, but not with dynamic data that's interned at runtime.
Thanks,
--Bob
> The only drawback to interning is that > interned strings will stay in memory until the app terminates, which > can be expensive if the array needs to go away. Jon Skeet [C# MVP] - 25 May 2005 07:12 GMT > I hadn't noticed this because when I've used interning it's been in batch > processing scenarios where the program will exit when done with the strings. [quoted text clipped - 6 lines] > interned at compile time, but not with dynamic data that's interned at > runtime. That would involve more work at GC time, I suspect - and making the intern pool able to shrink as well as grow would probably make it more complicated.
I *think* it's done on an AppDomain basis rather than a process basis, but unfortunately the discussion I remember having about it was on a private newsgroup which isn't archived :(
 Signature Jon Skeet - <skeet@pobox.com> http://www.pobox.com/~skeet If replying to the group, please do not mail me too
john conwell - 26 May 2005 01:26 GMT Yes, interning is done at the AppDomain level. There is a internal method in the AppDomain that takes care of this. Here is its method structure
internal extern string GetOrInternString(string str);
> > I hadn't noticed this because when I've used interning it's been in batch > > processing scenarios where the program will exit when done with the strings. [quoted text clipped - 14 lines] > but unfortunately the discussion I remember having about it was on a > private newsgroup which isn't archived :( Jon Shemitz - 25 May 2005 22:47 GMT > Is this something you've observed, or read about, or ... ? Do you have a > reference on this topic? It intrigues me. I assumed interning would have [quoted text clipped - 3 lines] > interned at compile time, but not with dynamic data that's interned at > runtime. Fwiw, I also assume that interning is done with a Hashtable in 1.x and a Dictionary<string,string> in 2.x - efficient lookup, why reinvent the wheel, and all that.
As for permanence, I'm pretty sure I did read it somewhere, but can't remember where, and the dox sure don't explicitly mention it. Otoh, there is no "String.Unintern" and no mention of the possibility that a once-interned string would ever be reclaimed. (For that matter, as Jon Skeet suggests, how would the system detect that a given string is referred to only by the intern pool? That would be a comparatively expensive addition to gc.)
 Signature www.midnightbeach.com
Emil _ DK - 25 May 2005 08:37 GMT Hi Jon
I will look at it anyhow. The server got 4 gb ram of which this program can get ½-1 gb. I'm not concerned about this - for now :-)
BR Emil
> Yes, this is a good suggestion. The only drawback to interning is that > interned strings will stay in memory until the app terminates, which > can be expensive if the array needs to go away. A Hashtable (or, in > 2.0, a Dictionary<string,string>) can be used to detect repeats while > still allowing the memory to be reclaimed when done. Emil _ DK - 25 May 2005 08:23 GMT Hi Bob
Thanks for your reply. I will look at String.Intern() idea, since about a quarter of each string is known in advace. I also looked at the ArrayList, but wasn't sure it was the best solution hence this post. My concern was the casting and boxing issues.
I thougt about making a managed code class, and just to it in C++, but due to a tight schedule, I estimated that it would require to much time to read about this technique.
Cheers, Emil
> The most efficient approach is a string array, but you can't dynamically > resize arrays so you either need to know the size in advance, or you need to [quoted text clipped - 38 lines] > > --Bob Prashant Majhwar - 25 May 2005 10:41 GMT If you are concating huge strings then instead of using for( int i = 0 ; i < arraylist.count; i++) teststring += "this is a test" + "\n";
use StringBuilder teststring = new StringBuilder(specify your size); for( int i = 0 ; i < arraylist.count; i++) teststring.Append("this is a test" + "\n"); Later is 1000 times faster...
> Hi guys > [quoted text clipped - 15 lines] > Rgds, > Emil Rafal Gwizdala - 01 Jun 2005 15:13 GMT If you just want to build a large string (consisting of 100k of strings) and write it in one chunk to a file you don't have to do anything but use StringBuilder as a write buffer. Just append the strings to it until it grows enough and then save it to a file. You could also use MemoryStream + TextWriter for this purpose.
Best Regards RG
> If you are concating huge strings then instead of using > for( int i = 0 ; i < arraylist.count; i++) [quoted text clipped - 26 lines] >> Rgds, >> Emil
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|