.NET Forum / .NET Framework / New Users / June 2007
RegEx: How to ignore the number of whitespaces?
|
|
Thread rating:  |
Florian Haag - 15 Jun 2007 00:29 GMT Hi, I'm not sure whether this is the right group; I'm trying to achieve the following with .NET's RegEx class:
I want to match strings while ignoring the number of whitespaces. In a simple case, this would of course mean something like
a\s+b
which would match not only "a b", but also "a b", "a b" etc.
However, a case like
a\s+b?\s+b
already doesn't work for me any more, as it would only match "a c" (two spaces in between), not "a c" (one space in between), if the "b" is omitted. I can override this by using an expression like
a\s+(b\s+)?b
, which would already require some modifications from the input, though, as users of the target application will not bother to include the 2nd whitespace into the optional part of the string when they input the expression (using a very simplified and otherwise limited syntax, which I'd like to convert to RegEx).
Things get even more complicated in cases like this:
(a|b\s+)(c|\s+d)
It seems to me that I cannot evaluate this directly but instead have to replace it with
ac|a\s+d|b\s+c|b\s+d
in order to make it match "b d" (one space in between), too, not only "b d" (two spaces in between).
I wonder whether this can be done by including each \s+ into a named group and then use an alternation construct referencing to the groups of any possibly adjacent space, thereby determining whether another \s+ is required to match. But maybe there's another, simpler (and maybe even faster?) way to achieve this?
Thanks in advance, Florian
Chris Diver - 15 Jun 2007 09:46 GMT > Hi, > I'm not sure whether this is the right group; I'm trying to achieve the > following with .NET's RegEx class: > > I want to match strings while ignoring the number of whitespaces. > In a simple case, this would of course mean something like a\s*b would complely ignore the whitespaces, unless you want at least one.
> a\s+b > [quoted text clipped - 7 lines] > (two spaces in between), not "a c" (one space in between), if the "b" > is omitted. I can override this by using an expression like This expression will never match a c are you trying to match a b b as well as a b ?
> a\s+(b\s+)?b > [quoted text clipped - 12 lines] > > ac|a\s+d|b\s+c|b\s+d By that above the rules about the pattern of your strings are that it can be either.
ac a (at least one space) d b (at least one space) c b (at least one space) d
> in order to make it match "b d" (one space in between), too, not only > "b d" (two spaces in between). [quoted text clipped - 8 lines] > Thanks in advance, > Florian Sounds like your homework to me, I don't understand what the format the strings they are supposed to match.
Chris
Florian Haag - 15 Jun 2007 18:33 GMT Hi! First of all, thanks for your response!
> > I want to match strings while ignoring the number of whitespaces. > > In a simple case, this would of course mean something like > > a\s*b would complely ignore the whitespaces, unless you want at least > one. Yes, I do need at least one whitespace. I don't want to ignore the whitespaces alltogether, I just want to ignore the number of subsequent whitespaces.
> > a\s+b > > [quoted text clipped - 10 lines] > This expression will never match a c are you trying to match a b b > as well as a b ? Oops, sorry - the last "b" should have been a "c", as in
a\s+b?\s+c
However, this won't match "a c" (with one space in between).
> > Things get even more complicated in cases like this: > > [quoted text clipped - 12 lines] > b (at least one space) c > b (at least one space) d Yes, that's correct - my question is whether I can go another way than resolving (a|b\s+)(c|\s+d) to ac|a\s+d|b\s+c|b\s+d (which would obviously mean to create all possible combinations of the (...|...) parts). If each bracket hold more than two alternatives, this would mean an enourmous increase in the size of the RegEx, which I'd like to avoid, if possible.
> Sounds like your homework to me, I don't understand what the format > the strings they are supposed to match. It's definitely not my homework; it's actually for a vocabulary training programme the first version of which you can find here:
http://VocDB.de.vu
The input strings are supposed to have the following format:
a and b may be replaced with any characters (or chains thereof) except \[]()|. \ preceding either of \[]()| escapes the respective symbol, otherwise it'll have a special meaning, as described below. [a] means "a" is optional. [a|b] means either "a" or "b" or nothing may be written. (a|b) means either "a" or "b" must be written.
There can be more than one | within each pair of brackets, delimiting more than two alternatives.
i.e. the whole thing is something slightly Regex-like for non-programmers.
In version 1 of the above programme, I use my own evaluator for this. However, for the sake of maintainability, I hoped I could eventually switch to simply converting those input patterns into RegEx-strings. If only there were a way to ignore the number of subsuquent whitespaces without ignoring that there _are_ whitespaces at all at certain places in the word.
Kind regards, Florian
Kevin Spencer - 18 Jun 2007 11:52 GMT If you can explain the requirements of the pattern you're trying to match, without using any regular expression terminology, I can help. A regular expression is a sequence of characters that represent a pattern, or a set of rules regarding what is to be matched in text. Since you're having trouble creating the regular expression, using regular expression symbol terminology to explain the rules only confuses the issue.
Here's an example of what I mean:
"I want to match any number (greater than 0) of sequences of 1 or more alphanumeric (only) characters with no spaces between them. Each sequence is separated from the others by a single space, which may be any white space character except for a line break. Any non-alpha-numeric character other than a non-line-break white space character terminates a matching sequence."
Note that no regular expression terminology is used in the above description. It describes the rules for a matching character sequence, including what is required, how many of what is required are required, what is NOT required, and what is prohibited.
 Signature HTH,
Kevin Spencer Microsoft MVP
Printing Components, Email Components, FTP Client Classes, Enhanced Data Controls, much more. DSI PrintManager, Miradyne Component Libraries: http://www.miradyne.net
> Hi! First of all, thanks for your response! > [quoted text clipped - 86 lines] > Kind regards, > Florian Florian Haag - 19 Jun 2007 23:37 GMT > If you can explain the requirements of the pattern you're trying to > match, without using any regular expression terminology, I can help. Hi, thanks for your response!
Hope this is something like what you meant: "Users of my programme input sequences of arbitrary Unicode characters (from now on, referred to as "patterns"). These patterns are supposed to match other given sequences of Unicode characters (from now on, referred to as "strings").
Certain subsequences of a pattern may be marked as optional. These may be found in the string, but need not. Certain subsequences of a pattern may be marked as a set of alternatives. Exactly one of them must be found in the string, neither more nor less. A pattern will never require more than one space character without any other characters in between to be found in a string. A pattern will accept any number of space characters (greater than zero) without any other characters in between in the string at a position where a space character is expected. A pattern will ignore any space characters at the beginning and at the end of a string. A pattern will never require any space characters at the beginning and at the end of a string."
I'm looking for the easiest way to quickly convert the pattern into a standard regular expression.
Thanks in advance, Florian
Kevin Spencer - 20 Jun 2007 12:14 GMT That is helpful, but I still have a few questions.
> "Users of my programme input sequences of arbitrary Unicode characters > (from now on, referred to as "patterns"). These patterns are supposed > to match other given sequences of Unicode characters (from now on, > referred to as "strings"). <snip>
> I'm looking for the easiest way to quickly convert the pattern into a > standard regular expression. This sounds like the "patterns" are performing the work of regular expressions, matching character sequences in strings. What I don't understand is why you want to create a new regular expression syntax which your users must learn, then convert it to the original, rather than using the original? Or perhaps I'm misunderstanding your intention altogether?
Second, what are the limitations of the "arbitrary Unicode characters?" There are over 16 million Unicode characters, and if we confine ourselves to a single character set, we are still talking about alphanumeric characters, punctuation, diacritical characters, and non-printing characters. I will assume that some of these are not within the set of "arbitrary" characters you're referencing. But I don't know which ones are allowed, and which ones are not.
> Certain subsequences of a pattern may be marked as optional. These may > be found in the string, but need not. > Certain subsequences of a pattern may be marked as a set of > alternatives. Exactly one of them must be found in the string, neither > more nor less. Okay, we've discussed "arbitrary," but now you will need to define the term "marked." As the "patterns" are pure text, the "marks" must also be text. But what consitutes a "text" character and a "mark" character, and how do you escape text characters to create marks?
 Signature HTH,
Kevin Spencer Microsoft MVP
Printing Components, Email Components, FTP Client Classes, Enhanced Data Controls, much more. DSI PrintManager, Miradyne Component Libraries: http://www.miradyne.net
>> If you can explain the requirements of the pattern you're trying to >> match, without using any regular expression terminology, I can help. [quoted text clipped - 28 lines] > Thanks in advance, > Florian Florian Haag - 20 Jun 2007 14:36 GMT Hi!
> This sounds like the "patterns" are performing the work of regular > expressions, matching character sequences in strings. That's right.
> What I don't > understand is why you want to create a new regular expression syntax > which your users must learn, then convert it to the original, rather > than using the original? Some 95% of my users won't have any programming experience whatsoever, or any computer science background. I doubt usual regular expressions with all its features would be suitable for those unexperienced users. I'd expect it very hard to explain for example, why they must write \. and \? instead of simply writing a fullstop or a question mark. All the more, my "space character problem" would remain, for my users do not understand why the pattern "personal computer" will only match "personal computer", but not "personal computer" (two spaces in between), for it's the same words. At the same time, they'd consider writing patterns like "personal *computer" (or even "personal\s*computer") way too unintuitive to use my programme.
That's why I offer another pattern syntax with a very limited set of a few special characters which denote very few pattern features (optional pattern parts, alternative pattern parts) and everything else one could possibly write into a pattern will be evaluated just as it's been input.
> Second, what are the limitations of the "arbitrary Unicode > characters?" Actually, that means all Unicode characters except spaces. By "arbitrary", I wanted to express that any characters may appear in any order without any restrictions in a pattern and should match just like that. Pardon for not describing it very accurately :-$
> Okay, we've discussed "arbitrary," but now you will need to define > the term "marked." As the "patterns" are pure text, the "marks" must > also be text. But what consitutes a "text" character and a "mark" > character, and how do you escape text characters to create marks? Right - there are a few Unicode characters which have to be escaped (which were chosen in a way that they don't appear in regular vocabulary, anyway). These are: \ ( ) [ ] | If either of these characters is meant to actually be found in the string, it has to be preceded by a backslash. Otherwise, pairs of both ( and ) as well as [ and ] "mark" a part of a pattern.
Within such a marked part, there may be any number (greater than zero) of alternative patterns, each separated by a | character. If ( and ) are used to denote the part of the pattern, exactly one of the alternative patterns must appear in the string. If [ and ] are used to denote the part of the pattern, at most one of the alternative patterns must appear in the string.
Such marked parts may be nested to an unlimited depth, that is, each of the above alternative patterns may contain marked parts of its own.
That should be all about the syntax of my patterns, as they are already used in version 1 of my programme.
Regards, Florian
Kevin Spencer - 21 Jun 2007 13:19 GMT Hi Florian,
I must admit your situation is confusing, and I do find the idea of creating a "simpler" regular expression syntax is likely to bite you eventually, one way or another, but requirements are requirements, and my job is to help you solve your problem. So.....
I'm still a little in the dark as to the full scope of what you're doing, but it may not be necessary to understand the whole thing in order to solve this particular problem. If I understand you fully, you're looking for a way to require at least one space between separate character sequences in a string, but that some of these character sequences may be "marked" as optional, in which case no white spaces would be necessary.
If so, I believe this can be solved using a conditional expression:
this(?(?=.)\s+)
This is a regular expression "if" conditional statement, which is a regular expression "if/else" conditional statement without an "else." The syntax of a regular expression "if/else" conditional statement is:
(?(?=regex)then|else)
This means that when the regular expression is matched, the "then" expression is used. When not matched, the "else" expression is used. So, in the following, it means "look for 'this'". If anything follows it, it must be followed by at least 1 white space character (Otherwise, not).
For optional matches, you would use the optional operator as you've illustrated before:
(?:this(?(?=.)\s+))?
In the following, "this," "that," or "other" will match in any combination, as long as it ends in "other":
(?:this(?(?=\s.)\s+))?(?:that(?(?=\s.)\s+))?(?:other)
matches:
other
this other
that other
this other
It does NOT match:
this
this that
 Signature HTH,
Kevin Spencer Microsoft MVP
Printing Components, Email Components, FTP Client Classes, Enhanced Data Controls, much more. DSI PrintManager, Miradyne Component Libraries: http://www.miradyne.net
> Hi! > [quoted text clipped - 62 lines] > Regards, > Florian florianhaag@freenet.de - 24 Jun 2007 15:23 GMT Hi, Kevin, thanks for all your answers.
> I must admit your situation is confusing, and I do find the idea of creating > a "simpler" regular expression syntax is likely to bite you eventually, one > way or another, but requirements are requirements, and my job is to help you > solve your problem. So..... Well, it's not really my idea, rather somewhat common practice. Most bilingual dictionaries feature a syntax where "green photo(graph)" or "green photo[graph]" denotes the words "green photo" as well as "green photograph". I haven't ever seen a dictionary which uses actual regular expression syntax to print its words (i.e. "green \sphoto(graph)?").
Anyway, thanks for your explanations regarding conditional statements in regular expressions. I think I now have enough information to consider the alternatives and decide how to implement my pattern matching :-)
Best regards, Florian
florianhaag@freenet.de - 25 Jun 2007 12:42 GMT florianh...@freenet.de wrote:
> regular expression syntax to print its words (i.e. "green > \sphoto(graph)?"). Sorry, that should of course have been: "^\s*green\s+photo(graph)?\s*$"
Kevin Spencer - 15 Jun 2007 11:35 GMT ^a\s*b?\s*c?\s*d?$
0 ore more spaces indicated. Each letter between spaces is optional. The beginning and end of line characters prevent mis-ordered matches, such as "ba", as the entire string must match.
 Signature HTH,
Kevin Spencer Microsoft MVP
Printing Components, Email Components, FTP Client Classes, Enhanced Data Controls, much more. DSI PrintManager, Miradyne Component Libraries: http://www.miradyne.net
> Hi, > I'm not sure whether this is the right group; I'm trying to achieve the [quoted text clipped - 44 lines] > Thanks in advance, > Florian Florian Haag - 15 Jun 2007 18:33 GMT > ^a\s*b?\s*c?\s*d?$ > > 0 ore more spaces indicated. Sorry, but I want a pattern which requires at least one whitespace, but which, at the same time, does not _require_ more than one subsequent whitespace.
Regards, Florian
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|