.NET Forum / Languages / Managed C++ / October 2004
One more question on TSTs
|
|
Thread rating:  |
Bonj - 05 Oct 2004 09:03 GMT I almost understand TSTs, to the point where I just need to know the answer to this: When making a TST (in C++) that will have as its leaf nodes words that make up SQL language and an categorising identifier for each one, and each layer of the tree will represent comparison of a further letter within the search string, what will happen when a particular node is a leaf node itself, but also has leaf nodes of its own? i.e. specifically, as far as the code goes, for this particular scenario.
for instance, the node "sp_he" has leaf nodes "sp_hel" (and possibly others) but "sp_help" is a leaf nodes, as "sp_help" is a SQL word, BUT it also has leaf nodes of its own, "sp_helpindex" for example. What would I want to have going on in the C++ code to identify when this is the case?
Carl Daniel [VC++ MVP] - 05 Oct 2004 14:52 GMT > I almost understand TSTs, to the point where I just need to know the > answer to this: [quoted text clipped - 11 lines] > What would I want to have going on in the C++ code to identify when > this is the case? You'd store something in the data of each node that identifies it as a legal end-node. For example, store your keyword classifier value in all potential end-nodes, and -1 in all others. Of course, all leaf nodes are end-nodes.
If the data in your tree is dynamically allocated (I wouldn't think it would be in your case), then you could simply store null for non-end nodes.
-cd
PS: I'm glad you stuck with it! I was going to put together a small TST demo to post here, but just haven't had the time.
Bonj - 05 Oct 2004 17:19 GMT > You'd store something in the data of each node that identifies it as a legal > end-node. For example, store your keyword classifier value in all potential > end-nodes, and -1 in all others. Of course, all leaf nodes are end-nodes. Right, got that bit. The bit I'm still a bit muddy on is how to decide whether the 'incoming' (i.e. in-parameter) string that's finding its way through the tree should *use* this end-node (if it exists), or make another comparison. For instance, if "sp_helpindex" was the 'incoming' string, then the node that it wants to get to (its target) is the "sp_helpindex" node. It will go via the "sp_help" node, but will somehow have to choose to make another comparison and go down the route towards the node for "sp_helpindex", rather than stop at "sp_help". Further, "sp_help" will also have to go via the "sp_help" node, but will have to stop there. "sp_helpgobbledegook" will have to go via "sp_help" aswell, but will have to take the decision to make another comparison with its 'g' character, and then fall out. My current thinking is to also pass in the length of the string aswell (which will be known in the calling algorithm anyway) and make it so each node knows how long a string must be to stop at it. I'm pretty sure this will work, although I haven't got the algorithm written yet, I like to have a clear path for the alrogithm worked out before I write it, as I program best when I program fast!
> If the data in your tree is dynamically allocated (I wouldn't think it would > be in your case), then you could simply store null for non-end nodes. That's the other bit I really want to know. How can I construct an algorithm that contains word nodes like this, but *isn't* dynamically generated? The only way I can think how I'm going to do it at the moment is if I have a 'node' class (or struct). Which means populating it at runtime. Which means loading the data when the program starts. Is there another way?
> -cd > > PS: I'm glad you stuck with it! Cheers! I always find it's best when you do. The only thing that puts me off is "this'll be useless", I'm never put off by "it'll never work" - because if anybody can get it to work, there's no reason why I can't. And this is something that I actually *want* as a product to use in my own day to day work, so it can't possibly be pointless. Otherwise, if I can't see myself using it, I couldn't possibly have the motivation to have even started it! I think the best programs are the ones that you actually specify yourself and write for yourself... BTW cheers for the help on the RICHEDIT by the way, I think I've finally mastered that (although I shouldn't speak too soon...)
> I was going to put together a small TST > demo to post here, but just haven't had the time. Let me know if/ when you do...
Cheers
Igor Tandetnik - 05 Oct 2004 17:34 GMT >> You'd store something in the data of each node that identifies it as >> a legal end-node. For example, store your keyword classifier value [quoted text clipped - 10 lines] > the route towards the node for "sp_helpindex", rather than stop at > "sp_help". If the next character is whitespace or end-of-text, you stop at sp_help. Otherwise, it sp_help can't possibly match, so you continue on.
 Signature With best wishes, Igor Tandetnik
"On two occasions, I have been asked [by members of Parliament], 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able to rightly apprehend the kind of confusion of ideas that could provoke such a question." -- Charles Babbage
Alexander Nickolov - 05 Oct 2004 17:39 GMT Your criterion is very simple - do you have unconsumed letters in your input word. If the answer is yes, you must continue - you havent' found your word yet. Sounds logical, no?
 Signature ===================================== Alexander Nickolov Microsoft MVP [VC], MCSD email: agnickolov@mvps.org MVP VC FAQ: http://www.mvps.org/vcfaq =====================================
>> You'd store something in the data of each node that identifies it as a >> legal [quoted text clipped - 68 lines] > > Cheers Bonj - 05 Oct 2004 17:45 GMT Yes, pretty much. That effectively compounds my logic about the length of the word.
Cheers
Bonj - 05 Oct 2004 17:49 GMT Thanks all, for the great logic and suggestions. But can anyone shed any light on how the data storage could be hard-coded such that it didn't have to dynamically load the data? Or if it does, such that the time is minimal. My only thoughts on how this could be done at the moment are an algorithm-generating algorithm, which spits out c++ code, which I'm sure isn't the best way to do it.
Cheers
Alexander Nickolov - 05 Oct 2004 18:25 GMT A hint: if you want efficient storage, you have to think as a C programmer (not C++). You can do it either with pointers to the first child and to the next sibling (NULL for child means no children, NULL for sibling means no more siblings), or you can use flat portable layout using offsets. Flat layout may allow you to optimize the size for small trees, since you won't need 32 bits for offsets (with even greater impact on the upcoming 64-bit platforms!). Then depending on the approach, you either generate a bunch of struct instances with pointers within them, or you generate a single byte chunk (can be comprised of WORDs/DWORDs instead if you align your data with the offsets) with the entire tree. In either case you'd use some code to generate your source from the original tree format.
Note: I'd advise the no-sibling pointer/offset optimization. E.g. you have all siblings in a continuous chunk of memory and use some tag (an invalid input char as an entry after the last sibling for example, or a bit flag in each node) to determine the end of the chunk. This is natural for flat memory representation.
Another advantage of the flat representation is you can load and save the data to disk files for an alternative storage method. Then you simply map a view of the file in your process and use it as embedded data.
Another optimization, only possible in flat layout, is there you can use variable size entries. If a node doesn't have children, you don't need the offset after all. If you use the optimization suggested to compress consecutive children with no siblings in a single node, the data (the substring) can be varibale length. Otherwise you'd either use a pointer for the string, or have all nodes allocate memory sufficient for the largest string.
A word on offsets: you can use either absolute offsets from the beginning of the whole structure, or relative offsets from the beginning of the current node. The latter are more efficient since they may allow you greater trees with narrower offsets. Furthermore, even though it's natural to generate the tree in left to right order, you can randomize the internal representation since you use offsets to locate each chunk anyway. This may allow you even greater trees when using narrower relative offsets.
The top of the icing, useful mostly for larger trees where memory is more important than speed, is that you can use progressive Huffman encoding on each chunk to potentially minimize the size even further.
Of course, the logic to traverse the tree will go into a C++ class, you only need C for the data itself.
As you may notice, I've done these and other embedded data structures quite a bit in my past... :)
 Signature ===================================== Alexander Nickolov Microsoft MVP [VC], MCSD email: agnickolov@mvps.org MVP VC FAQ: http://www.mvps.org/vcfaq =====================================
> Thanks all, for the great logic and suggestions. > But can anyone shed any light on how the data storage could be hard-coded [quoted text clipped - 5 lines] > > Cheers Bonj - 06 Oct 2004 16:33 GMT > A hint: if you want efficient storage, you have to think as a C > programmer (not C++). You can do it either with pointers > to the first child and to the next sibling (NULL for child means > no children, NULL for sibling means no more siblings), What in this specific instance are you referring to as a sibling? For instance, "sp_hel" has as one of its children, "sp_help". "sp_help" has "sp_helpi" (leading to "sp_helpindex") as one of its children, but what would be a sibling? Can you give me an example, 'cos I can't picture it. Or is it just an 'extra' child, i.e. after the first one?
> or you > can use flat portable layout using offsets. Can you explain in any way how I would store and retrieve this?
> Flat layout may allow > you to optimize the size for small trees Without worrying about optimization - I can understand how I would generate this tree of structs, when each one has a pointer to another one (or more than one) in order to represent children. But I seem to have a mental block figuring out how to have a flat layout when the structs contain pointers. You see, a pointer is simply a long integer, that's pointing to a different memory location. But that memory location is only valid for the life of the current process (at most). So, the memory needed to store the tree of structs may not necessarily be contigious. What is the usual way a C programmer would get round this, in order to persist the tree of structures as a 'block'?
> , since you won't need > 32 bits for offsets I don't know what you mean by an offset...... I think I would probably understand it if you told me how to store the tree as a 'flat' set of structures, in a contigious memory space?
> (with even greater impact on the upcoming > 64-bit platforms!). Then depending on the approach, you either [quoted text clipped - 3 lines] > offsets) with the entire tree. In either case you'd use some > code to generate your source from the original tree format. You mean using an algorithm to actually achieve a level of indirection with respect to compiling code, i.e. write an algorithm to produce C++ code? This is what I originally tried, but when a word was a subset of another word, like "sp_help" is of "sp_helpindex", it completely ignored "sp_help". Probably just the way I wrote it... but is this what you would recommend I do? If so, any tips on how I might do it better?
> Note: I'd advise the no-sibling pointer/offset optimization. E.g. > you have all siblings in a continuous chunk of memory and [quoted text clipped - 24 lines] > allow you even greater trees when using narrower relative > offsets. oh... right ! I think I almost understand what you're going on about... is it that you simply force all the structs to be next to each other and instead of storing actual pointers, you just store the distance from the current one, hence 'offset' ?
> The top of the icing, useful mostly for larger trees where memory > is more important than speed, is that you can use progressive [quoted text clipped - 16 lines] > > > > Cheers Alexander Nickolov - 06 Oct 2004 18:09 GMT Sibling is as same as in English - your brother, sister, and in this case a letter of an alternative word in the set. E.g. if AB and AC are words in your language, B and C are siblings when treated as children of A. Same with intermediate letters: for ABC and ADE, B and D are again siblings.
I guess you grasped the concept of offset instead of a pointer at the end of your post, so I won't go into further details (which is a relief considering there aren't many...)
What you produce is C data, not code. With pointers it would be a long list of:
const struct blah c_NodeN { "data", &c_NodeK, /* child (can be NULL instead) */ &c_NodeL /* next sibling (can be NULL instead) */ };
Note this is the least efficient representation, I'm just showing the concept.
With flat layout it would be much simpler:
const BYTE data[] = { // the linearized tree goes here };
 Signature ===================================== Alexander Nickolov Microsoft MVP [VC], MCSD email: agnickolov@mvps.org MVP VC FAQ: http://www.mvps.org/vcfaq =====================================
>> A hint: if you want efficient storage, you have to think as a C >> programmer (not C++). You can do it either with pointers [quoted text clipped - 117 lines] >> > >> > Cheers Bonj - 07 Oct 2004 08:25 GMT Thanks very much, to all, for your help, I'm confident I can have a bash at it now!
> Sibling is as same as in English - your brother, sister, and in this > case a letter of an alternative word in the set. E.g. if AB and [quoted text clipped - 145 lines] > >> > > >> > Cheers
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|