Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / VB.NET / March 2008

Tip: Looking for answers? Try searching our database.

Anagram Algorithm - How can I improve it?

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Andrew - 04 Mar 2008 19:13 GMT
Hi all

I am looking for your advice on a project of mine to create a fully-
featured dictionary class.

I have a source file with around 75000 words of UK English. These are
sorted, then loaded into a Dictionary object, each with a long as its
key.

I have methods (code at the end of this message) as follows:

1) CheckWordExists: Check if a given word exists. If so, returns its
position in the dictionary, as a long. If not, returns the compliment
of its nearest neighbour

2) ScrambleWord: Takes a string, and returns the letters randomly
scrambled

3) ReturnRandomWord: Returns a random word from the dictionary. (Code
not listed below)

4) FindAnagram: Given a string, invokes a loop, calling ScrambleWord.
Then passes this to CheckWordExists to determine if this is a genuine
word rather than a random string of characters. If not, passes the
string to ScrambleWord again, to retrieve another anagram and checks
if this is a real word. This loops either until a valid word is
returned (which is not the original word!) or until 100000 attempts
have been made, at which point it gives up!

Now, the first 3 routines work nicely - taking usually less than a
millisecond to return their values.

However, the FindAnagram method is rather slow. I created a test loop
which does the following:
1) Calls ReturnRandomWord to generate a random 8-character word
2) Passes this to FindAnagram to try to find an anagram
3) If this fails to find an anagram, generate another random word and
try again.

Using a couple of datetime variables and a timespan variable, I tested
this a few times, and gained the following information:

1) Each time FindAnagram runs, it takes an average of 4.3 seconds to
either find an anagram, or report failure.
2) Using 8-letter words results in finding an anagram for, on average,
1 word in 23 taking an average of 1:42 minutes.

Now, I'm wondering if I'm missing something - is there a better way to
approach this than the "brute force" method I'm using to find
anagrams?

If anyone has any thoughts or input, I'd be really grateful!

Thanks in advance
Andrew

Code: (dictionaryList is the name of the Dictionary object which
contains the list of words)
-----------------------------------------------------------
   Function CheckWordExists(ByVal wordToCheck As String) As Long
       'Function takes a word to find in the dictionary
       'If found, will return the value of its position in the
dictionary
       'If not, will return the compliment of its nearest neighbour
       'NOTE- All values start at 1 - ie the first word in the
dictionary is word 1, etc
       ' This is so that if a word cannot be found, we never return a
-0 where the first word
       ' is the closest match.
       Dim firstWord As String, lastWord As String
       Dim firstVal As Double, lastVal As Double

       wordToCheck = wordToCheck.ToLower
       firstVal = 0
       lastVal = dictionaryList.Count - 1
       firstWord = dictionaryList.Item(CLng(firstVal)).ToLower
       lastWord = dictionaryList.Item(CLng(lastVal)).ToLower
       If wordToCheck >= firstWord And wordToCheck <= lastWord Then
           Do Until firstWord = wordToCheck Or lastWord = wordToCheck
               'Check if first and last vals are already adjacent -
we have no match
               If (lastVal - firstVal) < 1 Then Exit Do
               If firstWord.CompareTo(wordToCheck) < 0 Then 'first
word is before wordtocheck
                   If lastWord.CompareTo(wordToCheck) > 0 Then
'Last word is after wordtocheck
                       lastVal = firstVal + ((lastVal - firstVal) /
2) 'Move last word back to check again
                   Else 'Move first and last words up to check again
                       Dim diff As Double
                       diff = lastVal - firstVal
                       firstVal = lastVal
                       lastVal += diff
                   End If
               Else 'First word is after wordtocheck
                   'Move firstval up by half difference between first
and last vals
                   firstVal = firstVal + (lastVal - firstVal) / 2
               End If
               firstWord =
dictionaryList.Item(CLng(firstVal)).ToLower
               lastWord = dictionaryList.Item(CLng(lastVal)).ToLower
           Loop
       End If
       'By this point, we are as close as possible to a match
       If firstWord = wordToCheck Then
           Return CLng(firstVal) + 1
       ElseIf lastWord = wordToCheck Then
           Return CLng(lastVal) + 1
       Else
           Dim i As Integer
           On Error Resume Next

           Do Until (i = Math.Min(firstWord.Length,
Math.Min(lastWord.Length, wordToCheck.Length)) - 1)
               'Find whether firstWord or lastWord is the closest
match
               'Do this by comparing each, letter by letter, with
wordToCheck
               'until one is further away than the other.
               'Stop if we run out of letters to compare!
               If
Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
firstWord.Chars(i))) < _
               Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
lastWord.Chars(i))) Then
                   Return -firstVal - 1

               ElseIf
Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
firstWord.Chars(i))) > _

Math.Abs(String.CompareOrdinal(wordToCheck.Chars(i),
lastWord.Chars(i))) Then
                   Return -lastVal - 1
               End If
               i = i + 1

           Loop

           Return -firstVal - 1 'By this point the two words are
identical to the given word,
           'simply having more letters, which differ. Just return the
first one - no point checking further
       End If
   End Function

--------------------

   Public Function ScrambleWord(ByVal OriginalWord As String) As
String
       Dim NewWord(OriginalWord.Length - 1) As Char
       Dim rtnString As String = String.Empty
       Dim newPos As Integer

       Randomize()
       For i As Integer = 0 To OriginalWord.Length - 1
           Do
               newPos = Int(Rnd() * (OriginalWord.Length))
           Loop Until NewWord(newPos) = Nothing
           NewWord(newPos) = OriginalWord.Chars(i)
       Next
       For i As Integer = 0 To OriginalWord.Length - 1
           rtnString &= NewWord(i)
       Next
       Return rtnString
   End Function

-----------------------------------------------

   Public Function FindAnagram(ByVal OriginalWord As String) As
String
       Dim testWord As String = String.Empty
       Dim i As Integer

       Do
           i = i + 1
           If i > 100000 Then Return "-" 'Run 362880 checks, then
give up (value is 9 factorial)
           testWord = Me.ScrambleWord(OriginalWord)
       Loop Until Me.CheckWordExists(testWord) > 0 AndAlso
testWord.ToLower <> OriginalWord.ToLower

       Return testWord

   End Function
----------------------------------------------
amdrit - 04 Mar 2008 19:27 GMT
This sounds like something very similar to a scrabble clone I am working on.
It seems that B-star search algorithm is the key to my needs and perhaps
yours as well.  I haven't implemented it yet, nor have I found decent
articles on the inner workings of the algorithm.

> Hi all
>
[quoted text clipped - 183 lines]
>    End Function
> ----------------------------------------------
Just_a_fan@home.net - 15 Mar 2008 01:06 GMT
Well, I have been reticent to post this but a long time ago, in the days
of VB6 (or maybe even 5 of 4, I don't remember), I wrote a little
program to "solve" the daily "Jumble" puzzles.  It is limited to 6
letters so it runs fairly fast.  I had a dictionary that I took all of
the words of 6 letters of less and put them into an Access database (if
I remember correctly).

I then used the code below to find all permutations of the word I wanted
to figure out.  I then listed all of the matches from the dictionary on
the screen.  Frequently there was only one and that was the answer to
the Jumble puzzle word.

The routine makes sure that a single letter is not used twice in the
test word and then tries to find it.

Brute force, indeed, but it was quite fast.  I never had a speed
complaint.  The Findword routine is the one which looked up the test
word in the database.  The duplicate business was so that the same word
would not be listed on screen twice.  This could happen with a word with
a repeated letter.

Here it is.  **Try not to laugh at it.**  It is quite old but still
works.

Mike

 For p1 = 1 To 6
   For p2 = 1 To 6
     For p3 = 1 To 6
       For p4 = 1 To 6
         For p5 = 1 To 6
           For p6 = 1 To 6
             If p2 = p1 Or p3 = p2 Or p3 = p1 Or p4 = p3 Or p4 = p2 Or
p4 = p1 Or p5 = p4 Or p5 = p3 Or p5 = p2 Or p5 = p1 Or p6 = p5 Or p6 =
p4 Or p6 = p3 Or p6 = p2 Or p6 = p1 Then
             Else
               created_word = ""
               
               testchar = Mid$(txtJumble, p1, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               testchar = Mid$(txtJumble, p2, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               testchar = Mid$(txtJumble, p3, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               testchar = Mid$(txtJumble, p4, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               testchar = Mid$(txtJumble, p5, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               testchar = Mid$(txtJumble, p6, 1)
               If testchar <> " " Then created_word = created_word &
testchar
               
               If FindWord(created_word) Then
                 duplicate = False
                 For i = 0 To lstFound.ListCount - 1
                   If lstFound.List(i) = created_word Then
                     duplicate = True
                     Exit For
                   End If
                 Next i
                 
                 If Not duplicate Then
                   lstFound.AddItem created_word
                   DoEvents
                 End If  ' not duplicate
               End If  ' word found
             End If
           Next
         Next
       Next
     Next
   Next
 Next

---------------------------------------

On Tue, 4 Mar 2008 13:27:32 -0600, in
microsoft.public.dotnet.languages.vb "amdrit" <amdrit@hotmail.com>
wrote:

>This sounds like something very similar to a scrabble clone I am working on.
>It seems that B-star search algorithm is the key to my needs and perhaps
[quoted text clipped - 188 lines]
>>    End Function
>> ----------------------------------------------
Mathias Wührmann - 04 Mar 2008 20:15 GMT
Hi Andrew,

Andrew schrieb:
> However, the FindAnagram method is rather slow. I created a test loop
>  which does the following: 1) Calls ReturnRandomWord to generate a
> random 8-character word 2) Passes this to FindAnagram to try to find
> an anagram 3) If this fails to find an anagram, generate another
> random word and try again.

do you really want to find an anagram or a similar word? For similar
words, maybe also for an anagram you will have more success with a
combination of algorithms like Double Metaphone and Levensthein.

Regards,

Mathias Wuehrmann
Andrew - 04 Mar 2008 20:33 GMT
> Hi Andrew,
>
[quoted text clipped - 13 lines]
>
> Mathias Wuehrmann

No - I really want an anagram! In other words, I'm after another word
(Word B) which can be made using only and all the letters of a given
Word A. Don't mind how dissimilar they are (as long as they're the
same language... :-)  )

Andrew
AMercer - 04 Mar 2008 20:36 GMT
> 4) FindAnagram: Given a string, invokes a loop, calling ScrambleWord.
> Then passes this to CheckWordExists to determine if this is a genuine
[quoted text clipped - 6 lines]
> However, the FindAnagram method is rather slow. I created a test loop
> which does the following:

For anagrams, make a second dictionary whose keys are the words of your
dictionary with letters arranged alphabetically and whose values are all the
words that map to the same key.

In other words, if your dictionary contains both 'pear' and 'pare', then the
new dictionary will contain key 'aepr' and value 'pear' and 'pare'.  The
value could be the concatenation of the words, or a collection of words, or
whatever is convenient.  The main idea is to clump together words that are
anagrams.  You can produce this anagram dictionary with a simple loop over
your dictionary.  For words that have no other anagrams, it is up to you
whether you delete them or not from the anagram dictionary.
Family Tree Mike - 05 Mar 2008 01:07 GMT
> > 4) FindAnagram: Given a string, invokes a loop, calling ScrambleWord.
> > Then passes this to CheckWordExists to determine if this is a genuine
[quoted text clipped - 18 lines]
> your dictionary.  For words that have no other anagrams, it is up to you
> whether you delete them or not from the anagram dictionary.

Now that is nice!  I hadn't thought of approaching it that way.
Kerry Moorman - 04 Mar 2008 20:44 GMT
Andrew,

Have you Googled for anagram algorithm?

Kerry Moorman

> Hi all
>
[quoted text clipped - 183 lines]
>     End Function
> ----------------------------------------------
BobRoyAce - 05 Mar 2008 01:33 GMT
One obvious weakness to your FindAnagram method is that it could
conceivably be checking the same combinations of letters many times.
So, it's worse that brute force really.

AMercer's suggestion is a good one if all you're looking to do is find
words that can be made from ALL of a series of letters (e.g. find six
letter word that can be made from six letters). And that seems to be
what you're looking for here.

That being said, if you wanted to generate a function that could find
the words, of any length, that could be made by using the letters in a
series of letters (something that would certainly be of value in
Scrabble like game), you could write an algorithm that systematically
generated words of varying lengths, then looked them up, after
scrambling letters to be in alphabetical order, using AMercer's
approach.
Cor Ligthert[MVP] - 05 Mar 2008 05:14 GMT
Andrew,

This is in fact outside my knowledge as it is about language, but as I had
to do it without a real educated one to help me in that part, then for
English and Dutch dictionary I would start removing all vowels on both side
and replacing the c, s, z by s.

In Dutch I would than replace all "ch" by g, etc. Then I would search the
words while using inside the word string the Find because that is the fasted
for that.

Cor

> Hi all
>
[quoted text clipped - 183 lines]
>    End Function
> ----------------------------------------------
HSalim[MVP] - 05 Mar 2008 14:44 GMT
This is interesting.

There is a site called a word a day which provides anagrams - which is
similar to what you want.
http://wordsmith.org/anagram/index.html  (I enjoy their daily a-word-a-day
mailings )
But since you are doing this for scrabble, wouldn't it be cool to have a
blank tile?

> Hi all
>
[quoted text clipped - 183 lines]
>    End Function
> ----------------------------------------------
HSalim[MVP] - 05 Mar 2008 15:23 GMT
And let that blank tile be the Eighth letter!

> This is interesting.
>
[quoted text clipped - 192 lines]
>>    End Function
>> ----------------------------------------------
Alex Clark - 05 Mar 2008 21:41 GMT
Firstly, it sounds very inefficient to just be looping and calling the
FindAnagram method up to a maximum of 10,000 times.  Surely you're likely to
get repeated hits on the same jumbled array of letters that way?  If it were
me, I'd design a routine that iterates through every possible combination of
arrangement of the letters in the word and I'd do it in alphabetical order
as well.

Secondly, have a SortedList of the words and check for the existence of the
word in that.  The framework should find it much easier to search through
the SortedList using binary chops so you should see better performance.
Watch your heap though, with 75k+ strings loaded into two different
collection classes you'll probably start paging and that won't be good news
for lookup speed.

-Alex

> Hi all
>
[quoted text clipped - 183 lines]
>    End Function
> ----------------------------------------------
Andrew - 06 Mar 2008 12:05 GMT
> Firstly, it sounds very inefficient to just be looping and calling the
> FindAnagram method up to a maximum of 10,000 times.  Surely you're likely to
[quoted text clipped - 201 lines]
>
> - Show quoted text -

Okay. Thanks to everyone who responded - I really appreciate all your
input and you've certainly given me some things to think about.

One or two specific thoughts:
1) I'm not actually trying to create a scrabble clone, as has been
suggested, so while the ideas of finding all words (including shorter
words) that can be made out of a given set of letters, or of including
"blanks" is not directly relevant to the project. But it might be an
interesting thing to have a play with later....

2)  In response to AMercer - that sounds like a good way to tackle it,
and I may give that a go as my next step...

3) In response to Kerry Moorman - no, I hadn't googled it (have now!)
because part of the challenge was to find a way without (initially)
relying on someone else's algorithm. I know you could argue that
posting for help here amounts to the same thing, but I wanted to have
a go and then learn from my mistakes rather than simply tap into some
pre-determined solution.

4) Thanks to all those who pointed out that simply looping over a
"random scrambler" routine will result in much wasted effort as the
same non-words are regenerated multiple times. This was something that
had occurred to me too, so was one of the first areas I was going to
look at in coming back to my code.

Incidentally, this seems to have struck a bit of a common chord with
people saying "I'm working on something similar..." or "I've done that
with..." . I'm sure this is an obvious question that I could just
google again (I think I just prefer the human interaction!) but are
there sites out there where (say) a few people get together to work
interactively on stuff like this, such as developing Scrabble clones,
or other similar things, just for fun and as a way of learning more
from each other?

Thanks again for your input all.

Andrew
HSalim[MVP] - 06 Mar 2008 12:59 GMT
Andrew,
To a great extent, the open source movement is just that - a project where
many developers collaborate to develop software.
Sourceforge http://Sourceforge.net is the meeting place for open source
projects.  A search for scrabble generated 37 matches.

HS

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.