Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / Managed C++ / October 2004

Tip: Looking for answers? Try searching our database.

One more question on TSTs

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Bonj - 05 Oct 2004 09:03 GMT
I almost understand TSTs, to the point where I just need to know the answer
to this:
When making a TST (in C++) that will have as its leaf nodes words that make
up SQL language and an categorising identifier for each one, and each layer
of the tree will represent comparison of a further letter within the search
string, what will happen when a particular node is a leaf node itself, but
also has leaf nodes of its own? i.e. specifically, as far as the code goes,
for this particular scenario.

for instance, the node "sp_he" has leaf nodes "sp_hel" (and possibly others)
but "sp_help" is a leaf nodes, as "sp_help" is a SQL word, BUT it also has
leaf nodes of its own, "sp_helpindex" for example.
What would I want to have going on in the C++ code to identify when this is
the case?
Carl Daniel [VC++ MVP] - 05 Oct 2004 14:52 GMT
> I almost understand TSTs, to the point where I just need to know the
> answer to this:
[quoted text clipped - 11 lines]
> What would I want to have going on in the C++ code to identify when
> this is the case?

You'd store something in the data of each node that identifies it as a legal
end-node.  For example, store your keyword classifier value in all potential
end-nodes, and -1 in all others.  Of course, all leaf nodes are end-nodes.

If the data in your tree is dynamically allocated (I wouldn't think it would
be in your case), then you could simply store null for non-end nodes.

-cd

PS: I'm glad you stuck with it!  I was going to put together a small TST
demo to post here, but just haven't had the time.
Bonj - 05 Oct 2004 17:19 GMT
> You'd store something in the data of each node that identifies it as a legal
> end-node.  For example, store your keyword classifier value in all potential
> end-nodes, and -1 in all others.  Of course, all leaf nodes are end-nodes.

Right, got that bit. The bit I'm still a bit muddy on is how to decide
whether the 'incoming' (i.e. in-parameter) string that's finding its way
through the tree should *use* this end-node (if it exists), or make another
comparison. For instance, if "sp_helpindex" was the 'incoming' string, then
the node that it wants to get to (its target) is the "sp_helpindex" node. It
will go via the "sp_help" node, but will somehow have to choose to make
another comparison and go down the route towards the node for "sp_helpindex",
rather than stop at "sp_help".
Further, "sp_help" will also have to go via the "sp_help" node, but will
have to stop there.
"sp_helpgobbledegook" will have to go via "sp_help" aswell, but will have to
take the decision to make another comparison with its 'g' character, and then
fall out.
My current thinking is to also pass in the length of the string aswell
(which will be known in the calling algorithm anyway) and make it so each
node knows how long a string must be to stop at it. I'm pretty sure this will
work, although I haven't got the algorithm written yet, I like to have a
clear path for the alrogithm worked out before I write it, as I program best
when I program fast!

> If the data in your tree is dynamically allocated (I wouldn't think it would
> be in your case), then you could simply store null for non-end nodes.

That's the other bit I really want to know. How can I construct an algorithm
that contains word nodes like this, but *isn't* dynamically generated? The
only way I can think how I'm going to do it at the moment is if I have a
'node' class (or struct). Which means populating it at runtime. Which means
loading the data when the program starts. Is there another way?

> -cd
>
> PS: I'm glad you stuck with it!  

Cheers! I always find it's best when you do. The only thing that puts me off
is "this'll be useless", I'm never put off by "it'll never work" - because if
anybody can get it to work, there's no reason why I can't. And this is
something that I actually *want* as a product to use in my own day to day
work, so it can't possibly be pointless. Otherwise, if I can't see myself
using it, I couldn't possibly have the motivation to have even started it! I
think the best programs are the ones that you actually specify yourself and
write for yourself...
BTW cheers for the help on the RICHEDIT by the way, I think I've finally
mastered that (although I shouldn't speak too soon...)

> I was going to put together a small TST
> demo to post here, but just haven't had the time.

Let me know if/ when you do...

Cheers
Igor Tandetnik - 05 Oct 2004 17:34 GMT
>> You'd store something in the data of each node that identifies it as
>> a legal end-node.  For example, store your keyword classifier value
[quoted text clipped - 10 lines]
> the route towards the node for "sp_helpindex", rather than stop at
> "sp_help".

If the next character is whitespace or end-of-text, you stop at sp_help.
Otherwise, it sp_help can't possibly match, so you continue on.
Signature

With best wishes,
   Igor Tandetnik

"On two occasions, I have been asked [by members of Parliament], 'Pray,
Mr. Babbage, if you put into the machine wrong figures, will the right
answers come out?' I am not able to rightly apprehend the kind of
confusion of ideas that could provoke such a question." -- Charles
Babbage

Alexander Nickolov - 05 Oct 2004 17:39 GMT
Your criterion is very simple - do you have unconsumed letters
in your input word. If the answer is yes, you must continue -
you havent' found your word yet. Sounds logical, no?

Signature

=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@mvps.org
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

>> You'd store something in the data of each node that identifies it as a
>> legal
[quoted text clipped - 68 lines]
>
> Cheers
Bonj - 05 Oct 2004 17:45 GMT
Yes, pretty much. That effectively compounds my logic about the length of the
word.

Cheers
Bonj - 05 Oct 2004 17:49 GMT
Thanks all, for the great logic and suggestions.
But can anyone shed any light on how the data storage could be hard-coded
such that it didn't have to dynamically load the data? Or if it does, such
that the time is minimal.
My only thoughts on how this could be done at the moment are an
algorithm-generating algorithm, which spits out c++ code, which I'm sure
isn't the best way to do it.

Cheers
Alexander Nickolov - 05 Oct 2004 18:25 GMT
A hint: if you want efficient storage, you have to think as a C
programmer (not C++). You can do it either with pointers
to the first child and to the next sibling (NULL for child means
no children, NULL for sibling means no more siblings), or you
can use flat portable layout using offsets. Flat layout may allow
you to optimize the size for small trees, since you won't need
32 bits for offsets (with even greater impact on the upcoming
64-bit platforms!). Then depending on the approach, you either
generate a bunch of struct instances with pointers within them,
or you generate a single byte chunk (can be comprised of
WORDs/DWORDs instead if you align your data with the
offsets) with the entire tree. In either case you'd use some
code to generate your source from the original tree format.

Note: I'd advise the no-sibling pointer/offset optimization. E.g.
you have all siblings in a continuous chunk of memory and
use some tag (an invalid input char as an entry after the last
sibling for example, or a bit flag in each node) to determine the
end of the chunk. This is natural for flat memory representation.

Another advantage of the flat representation is you can load
and save the data to disk files for an alternative storage method.
Then you simply map a view of the file in your process and
use it as embedded data.

Another optimization, only possible in flat layout, is there you
can use variable size entries. If a node doesn't have children,
you don't need the offset after all. If you use the optimization
suggested to compress consecutive children with no siblings
in a single node, the data (the substring) can be varibale length.
Otherwise you'd either use a pointer for the string, or have all
nodes allocate memory sufficient for the largest string.

A word on offsets: you can use either absolute offsets from
the beginning of the whole structure, or relative offsets from
the beginning of the current node. The latter are more efficient
since they may allow you greater trees with narrower offsets.
Furthermore, even though it's natural to generate the tree in
left to right order, you can randomize the internal representation
since you use offsets to locate each chunk anyway. This may
allow you even greater trees when using narrower relative
offsets.

The top of the icing, useful mostly for larger trees where memory
is more important than speed, is that you can use progressive
Huffman encoding on each chunk to potentially minimize the
size even further.

Of course, the logic to traverse the tree will go into a C++
class, you only need C for the data itself.

As you may notice, I've done these and other embedded data
structures quite a bit in my past... :)

Signature

=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@mvps.org
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

> Thanks all, for the great logic and suggestions.
> But can anyone shed any light on how the data storage could be hard-coded
[quoted text clipped - 5 lines]
>
> Cheers
Bonj - 06 Oct 2004 16:33 GMT
> A hint: if you want efficient storage, you have to think as a C
> programmer (not C++). You can do it either with pointers
> to the first child and to the next sibling (NULL for child means
> no children, NULL for sibling means no more siblings),

What in this specific instance are you referring to as a sibling?
For instance, "sp_hel" has as one of its children, "sp_help". "sp_help" has
"sp_helpi" (leading to "sp_helpindex") as one of its children, but what would
be a sibling? Can you give me an example, 'cos I can't picture it. Or is it
just an 'extra' child, i.e. after the first one?

>  or you
> can use flat portable layout using offsets.

Can you explain in any way how I would store and retrieve this?

> Flat layout may allow
> you to optimize the size for small trees

Without worrying about optimization - I can understand how I would generate
this tree of structs, when each one has a pointer to another one (or more
than one) in order to represent children. But I seem to have a mental block
figuring out how to have a flat layout when the structs contain pointers. You
see, a pointer is simply a long integer, that's pointing to a different
memory location. But that memory location is only valid for the life of the
current process (at most). So, the memory needed to store the tree of structs
may not necessarily be contigious.
What is the usual way a C programmer would get round this, in order to
persist the tree of structures as a 'block'?

> , since you won't need
> 32 bits for offsets

I don't know what you mean by an offset...... I think I would probably
understand it if you told me how to store the tree as a 'flat' set of
structures, in a contigious memory space?

> (with even greater impact on the upcoming
> 64-bit platforms!). Then depending on the approach, you either
[quoted text clipped - 3 lines]
> offsets) with the entire tree. In either case you'd use some
> code to generate your source from the original tree format.

You mean using an algorithm to actually achieve a level of indirection with
respect to compiling code, i.e. write an algorithm to produce C++ code? This
is what I originally tried, but when a word was a subset of another word,
like "sp_help" is of "sp_helpindex", it completely ignored "sp_help".
Probably just the way I wrote it... but is this what you would recommend I
do? If so, any tips on how I might do it better?

> Note: I'd advise the no-sibling pointer/offset optimization. E.g.
> you have all siblings in a continuous chunk of memory and
[quoted text clipped - 24 lines]
> allow you even greater trees when using narrower relative
> offsets.

oh... right ! I think I almost understand what you're going on about... is
it that you simply force all the structs to be next to each other and instead
of storing actual pointers, you just store the distance from the current one,
hence 'offset'  ?

> The top of the icing, useful mostly for larger trees where memory
> is more important than speed, is that you can use progressive
[quoted text clipped - 16 lines]
> >
> > Cheers
Alexander Nickolov - 06 Oct 2004 18:09 GMT
Sibling is as same as in English - your brother, sister, and in this
case a letter of an alternative word in the set. E.g. if AB and
AC are words in your language, B and C are siblings when
treated as children of A. Same with intermediate letters: for
ABC and ADE, B and D are again siblings.

I guess you grasped the concept of offset instead of a pointer
at the end of your post, so I won't go into further details (which
is a relief considering there aren't many...)

What you produce is C data, not code. With pointers it would be
a long list of:

const struct blah c_NodeN {
   "data",
   &c_NodeK, /* child (can be NULL instead) */
   &c_NodeL /* next sibling (can be NULL instead) */
};

Note this is the least efficient representation, I'm just showing the
concept.

With flat layout it would be much simpler:

const BYTE data[] = {
   // the linearized tree goes here
};

Signature

=====================================
Alexander Nickolov
Microsoft MVP [VC], MCSD
email: agnickolov@mvps.org
MVP VC FAQ: http://www.mvps.org/vcfaq
=====================================

>> A hint: if you want efficient storage, you have to think as a C
>> programmer (not C++). You can do it either with pointers
[quoted text clipped - 117 lines]
>> >
>> > Cheers
Bonj - 07 Oct 2004 08:25 GMT
Thanks very much, to all, for your help, I'm confident I can have a bash at
it now!

> Sibling is as same as in English - your brother, sister, and in this
> case a letter of an alternative word in the set. E.g. if AB and
[quoted text clipped - 145 lines]
> >> >
> >> > Cheers

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.