> A colleague put me onto string.Intern, but this won't help as by the time
> I'm calling that method, I've already allocated the string.
Also, there is no way to "un-Intern" - an interned string lasts until
the process terminates.
> Note that these strings are very short lived. After deserialisation, they
> will be processed and (for the most part) garbage collected before they get
> promoted to generation 1. This happens several thousand times a second under
> normal conditions, giving the garbage collector (what I assume is) a lot of
> work. I'm seeing the classic sawtooth pattern in a heap timeline but with
> very high frequency.
Aside from the frequency of garbage-collection, this is not a bad
pattern: Short-lived strings do not need to be relocated, and gc costs
will be low.
> I'd like to know whether this is a situation in which I can improve
> performance. I can envisage some sort of structure (perhaps a Trie) that
> hones in on the stored string as we progress through the byte sequence.
> However this structure cannot be pre-populated (the strings will be
> determined at runtime).
One consequence of any scheme that caches strings - whether a trie, a
hashtable (Dictionary<string,bool>), or string interning - is that
each string (even the non-repeats) lasts a long time, and needs to be
promoted and relocated from Gen 0 to Gen 1 to Gen 2.
> The big question is: do the benefits of reducing string allocation justify
> the overhead in finding a stored string? This, no doubt, depends upon the
> implementation.
Seems to me you have three broad ranges, here:
* A trie scheme (or the equivalent) would immortalize all strings, but
would reduce allocations, and hence the number of garbage collections.
Fewer, more expensive garbage collections; higher memory consumption.
* Any interning scheme immortalizes strings, without reducing
allocations. No real impact on gc; just a bit slower over-all, and
higher memory consumption.
* The current scheme has many, comparatively cheap garbage
collections, and doesn't have high fixed memory use,
I'd guess that whether 1 or 3 is faster is depends on the cost of each
gc - how many roots you have, and how many of them change from gc to
gc. If most gc-s are triggered by the creation of temp strings that
are already garbage, I'd expect each gc to be pretty cheap.
That is, my best GUESS is that a trie may save you a few percentage
points of runtime costs (maybe 10% to as much as 25%) at the cost of
several hours work and higher memory use.

Signature
www.midnightbeach.com
drewnoakes - 06 Sep 2005 09:24 GMT
Hi Jon,
Thanks for your reply. I think you understand my conundrum.
> Also, there is no way to "un-Intern" - an interned string lasts until
> the process terminates.
I was under the impression that a reference count is maintained for interned
strings, and that this applies across AppDomains. This allows strings to be
uninterned. However this is trivia to me as I doubt I'll use interning.
What might be useful is to understand some details of how interning performs
it's string lookup based upon value. Surely that's a similar problem to the
one I'm facing.
> Aside from the frequency of garbage-collection, this is not a bad
> pattern: Short-lived strings do not need to be relocated, and gc costs
> will be low.
Yeah it looks like a textbook example, though the frequency seems quite
high. I haven't done much work optimising for GC at this level before. Can
anyone recommend some approaches or tools for the job?
> One consequence of any scheme that caches strings - whether a trie, a
> hashtable (Dictionary<string,bool>), or string interning - is that
> each string (even the non-repeats) lasts a long time, and needs to be
> promoted and relocated from Gen 0 to Gen 1 to Gen 2.
I don't mind promoting the most common strings, as this is a long-lived
process and memory isn't such an issue. However as this list needs to be
built at runtime there is likely to be a list of candidate strings used to
ensure the lookup structure contains only those which are most common. These
will be around long enough to make it out of generation 0 I expect, and a
potential performance problem.
I envisage a candidate list holding strings awaiting inclusion in the lookup
structure. Whenever a string does not reside in the lookup, it's added to
the candidate list -- a short list of the most recent strings, sorted by the
number of times they're seen. Items get knocked off the list (and gc'd) over
time if they don't appear often enough. Those which work up the list are
added to the lookup structure.
> Seems to me you have three broad ranges, here:
>
[quoted text clipped - 8 lines]
> * The current scheme has many, comparatively cheap garbage
> collections, and doesn't have high fixed memory use,
You're right. I essentially want to determine the viability of #1, given
that #2 is unsuitable and #3 is where I am presently.
> I'd guess that whether 1 or 3 is faster is depends on the cost of each
> gc - how many roots you have, and how many of them change from gc to
> gc. If most gc-s are triggered by the creation of temp strings that
> are already garbage, I'd expect each gc to be pretty cheap.
Why do the number of roots impact the time taken for GC? By roots I take it
you mean objects referenced directly from any level of any thread's stack.
> That is, my best GUESS is that a trie may save you a few percentage
> points of runtime costs (maybe 10% to as much as 25%) at the cost of
> several hours work and higher memory use.
If I can get an improvement of at least 15% then it may be worth the labour.
This process will run as a service on a server machine where processing
throughput is more important than memory usage.
Has no-one else tried their hand at this problem before?
Thanks again Jon.
Drew.
Jon Shemitz - 06 Sep 2005 17:26 GMT
> > the process terminates.
>
> I was under the impression that a reference count is maintained for interned
> strings, and that this applies across AppDomains. This allows strings to be
> uninterned.
No, I don't think this is right. (From memory) there's no "unintern"
calls. Fwiw, interning does apply across AppDomains - interned strings
are serialized as a table index, not as strings.
> What might be useful is to understand some details of how interning performs
> it's string lookup based upon value. Surely that's a similar problem to the
> one I'm facing.
I'd be way surprised if the intern table wasn't a Hashtable.
> I don't mind promoting the most common strings, as this is a long-lived
> process and memory isn't such an issue. However as this list needs to be
[quoted text clipped - 9 lines]
> time if they don't appear often enough. Those which work up the list are
> added to the lookup structure.
I think perhaps I'd add every string, and keep a usage count. When
you've seen "enough" strings (10,000?) you could prune the structure -
discarding any strings that haven't seen at least (say) five repeats.
After that, you might stop adding new strings.
> Why do the number of roots impact the time taken for GC? By roots I take it
> you mean objects referenced directly from any level of any thread's stack.
More roots equals more objects to scan for live data. Roots also
include static variables.

Signature
www.midnightbeach.com
> Right now, I'm using BinaryReader.ReadString() which gives the correct
> result, however it creates a new instance of System.String for each byte
> sequence. What I'd really like is to detect the repeated byte sequence, and
> return a reference to an existing deserialised version.
Read the bytes instead.
Hash the read bytes. return "deserialized" objects from an ICollection
based on a hash of the read bytes (note, byte[].GetHashCode() wont work).
For good hash-functions, check out Knuth's "The art of programming" or
search the net.
When you read a byte[] that you have no object ready for, call
deserialize and insert it into the ICollection beffore returning it.

Signature
Helge Jensen
mailto:helge.jensen@slog.dk
sip:helge.jensen@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-
Helge Jensen - 06 Sep 2005 12:51 GMT
I can see in other threads on other groups that you are concerned about
the performance of the read byte[]'s.
Please try the solution first, and verify -- using a profiler -- that
your performance-problem is actually where you think it is.
If you need still more performance, you can checkout a data-structure
called a "prefix-tree" which should give you some ideas to allow you to
only store every read byte-sequence once, reading bytes one at a time.
However it is not certain that prefix-trees will benefit your
performance, it all depends on the GC efficiency vs. your implementation
of the data-structure you arrive at.

Signature
Helge Jensen
mailto:helge.jensen@slog.dk
sip:helge.jensen@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-
drewnoakes - 06 Sep 2005 16:57 GMT
Hi Helge,
Thanks for your posts.
I like this idea of calculating a hash value from the byte sequence and
using that as a key in some sort of dictionary. I can see a way in which
that would work where I wouldn't need to allocate any objects unless the
string doesn't yet exist. I'll think about that for a while...
I'd previously been stuck on a different idea. I guess it's the same
concept as the prefix tree you mention -- a structure where common subsets
form nodes in a tree which can be traversed top-down when deserialising to
reach the string value at a leaf.
Anyway, I've a few things to try now and it seems to be quite challenging.
Thanks again.