Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / February 2008

Tip: Looking for answers? Try searching our database.

Only allowing alphanumeric characters and '_' and '-'

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
DotNetNewbie - 26 Feb 2008 21:59 GMT
Hi,

I want to parse a string, ONLY allowing alphanumeric characters and
also the underscore '_' and dash '-' characters.

Anything else in the string should be removed.

I think my regex is looking like:

^([\w\d_-])*$

Now if I have this code:

string username = "mrcsharpis_so_cool!!!";

How can I strip all the characters that I dont' want?
KH - 26 Feb 2008 22:42 GMT
Regex is a bit overkill for that; you could...

string str = "AB&*^#Cabc(#&123--__";

StringBuilder sb = new StringBuilder(str.Length);

foreach (char ch in str)
{
    if (Char.IsLetterOrDigit(ch)
    || ch == '-' || ch == '_')
    {
        sb.Append(ch);
    }
}

str = sb.ToString();

> Hi,
>
[quoted text clipped - 12 lines]
>
> How can I strip all the characters that I dont' want?
Arne Vajhøj - 26 Feb 2008 23:30 GMT
> Regex is a bit overkill for that; you could...
>
[quoted text clipped - 12 lines]
>
> str = sb.ToString();

I think that code is an overkill compared to a simple Regex.Replace !

Arne
KH - 27 Feb 2008 01:09 GMT
I usually avoid regex's because of performance. In this case I haven't tested
but would imagine the difference is approximatly "who cares" ... nonetheless
I just think of regex's as overkill in many situations where people try to
use them.

A great way to use them though is to put the pattern in a config file so it
can be easily changed when requirements change or for different customers w/o
recompiling the app.

> > Regex is a bit overkill for that; you could...
> >
[quoted text clipped - 16 lines]
>
> Arne
Peter Duniho - 27 Feb 2008 01:39 GMT
>> I think that code is an overkill compared to a simple Regex.Replace !
>
[quoted text clipped - 5 lines]
> to
> use them.

It's funny.  I agree with both statements, sort of.  (Do you smell an  
essay coming on?  You should...  :) )

Fundamentally, I think that Regex is a good thing.  It's a concise,  
reliable way to represent various string interpretations and  
manipulations.  As far as performance goes, I don't think there's a  
reliable way to say that Regex is always better- or worse-performing than  
an equivalent explicit algorithm.

However, I do think that it's likely that Regex performs better for at  
least a broad variety of possible applications, if not the majority.  As a  
framework class, it's got the potential to be well-optimized and there's  
good justification for it to be.  On the other hand, explicit algorithms  
may or may not be well-optimized, depending on who wrote the code and how  
often it's likely to be used.

In addition, every time you write an explicit algorithm, you risk writing  
it wrong.  With Regex, yes there's the possibility of writing an incorrect  
expression, but it's more likely in that case that it just won't work.  
It's much harder to get those subtle "happens once in awhile with only  
this very specific input".  Not impossible, but IMHO more difficult.

So those are all things in favor of Regex.  I think that in general,  
anything that allows you to specify an operation in a concise, error-free  
way and then perform that operation with reasonable, or even optimal  
speed, that's a good thing.

But with Regex, the conciseness is IMHO a bit overboard.  I recognize that  
there are folks out there who have used regular expressions so much that  
it's just like writing regular programming code to them.  They know it  
inside and out.

But for the rest of us, using Regex is an exercise in frustration as we  
skip back and forth in the MSDN documentation trying to find just the  
right syntax for representing some goal.  There's an incredible amount of  
capability there, and with that comes a fairly extensive grammar that  
needs to be learned to use it effectively.  But the syntax of that grammar  
is pretty arcane IMHO, and has been very hard to learn, at least for me.

I wish we had something like Regex, but with a more natural-language-like  
way to program it.  Maybe something like a RegexBuilder class or something  
that you can use to construct an appropriate regular expression.  Or maybe  
just a syntax that looks more like C# than like APL.  Or maybe something  
that takes actual C# code expressions and converts it into a suitable  
regular expression.  Or some alternative I've yet to consider.

I don't know what the actual solution is.  All I know is that Regex itself  
can be very trying to use if you're inexperienced with it, to a _much_  
greater extent than, say, VB or C# might be.  So in the end, for simple  
operations I find myself thinking "well, some explicit C# code will be  
clearer, and it should be easy to make it bug-free", and so I wind up not  
using Regex there.  And then for more complex operations, where the  
conciseness and precision of Regex would be a benefit, I find myself  
thinking "I just don't get how to do this in Regex and the docs aren't  
helping me figure it out", and so I wind up not using Regex.

Which means that either way, I don't use Regex.  I've posted questions  
here asking how to write Regex expressions to do what I want, and to the  
credit of the newsgroup experts who do know Regex, they've always come  
through.  For me, and for others who ask similar questions.  Jesse Houwing  
in particular deserves major kudos for his Regex "kung fu" and his  
willingness to share it with others.  But in the end, if I can't be  
self-reliant on a technology, I tend not to use it.

Maybe if I had greater need to doing string pattern matching, I'd take the  
time and really learn regular expressions and then it'd be useful.  But I  
don't, and for the occasional moments when it'd be useful to me, it's just  
not worth the time and effort to figure out that specific case.

I'd love to see someone fix that problem.  :)

Pete
KWienhold - 27 Feb 2008 07:11 GMT
On 27 Feb., 02:39, "Peter Duniho" <NpOeStPe...@nnowslpianmk.com>
wrote:

> >> I think that code is an overkill compared to a simple Regex.Replace !
>
[quoted text clipped - 78 lines]
>
> Pete

While I do use Regex from time to time (input field validation,
parsing Sql-Connection-strings etc.), I totally agree with Peter.
Whenever I do use regular expressions it would have been quite trivial
to achieve the same thing in code, when the pattern matching becomes
complex enough to really make you want the power the Regex engine
offers, I often find I just can't get the expression to work right in
all circumstances.
A library that would offer a more natural way of constructing regular
expressions would be great, but given the complexity of the syntax
(let alone the fact that there are several different implementations),
I don't quite see how that could be done...

Kevin Wienhold
Stefan Nobis - 27 Feb 2008 09:32 GMT
> Fundamentally, I think that Regex is a good thing.

Fundamentally a RegEx is a type 3 grammar, equivalent to a finite
automata. :)

So a RegEx is more like an upper bound to a class of pattern matching
problems. Sometimes a RegEx is not enough, then you need to go up in
the hierachy to type 2 grammars and write parsers. But in many cases
you don't need all of the expressiveness of a RegEx so you can use
quite simpler constructs.

BTW: In the class of parsing problems where regular expressions
suffice, using a RegEx parser is the most costly (sane) way to do the
job. Simple comparisios like IsDigitOrLetter (traversing the input
string only once, without the overhead of parser generation) are
always (much) faster and need (much) less memory.

Some problems need full regluar expression expressiveness, so in these
cases the cost and overhead of a RegEx is mandatory.

> As far as performance goes, I don't think there's a reliable way to
> say that Regex is always better- or worse-performing than an
> equivalent explicit algorithm.

These class of problems are really good studies and understood. There
are quite reliable ways to say when a RegEx is needed, what performance
and memory characterics follow and when other way are needed or more
efficient.

These and much more are the basics of computer science. There's more
to programming than just try&error.

> other hand, explicit algorithms may or may not be well-optimized,

But a regular expression may also be badly written and as such induce
much more overhead and worse performance for the same regular
expression engine used with a better written RegEx. A regluar
expression is a simple language but still complex enough to say the
same thing in different ways.

If you do basic comparision of algorithms you have always to assume
that the implementation are written as good as possible (for example a
routine to copy a 10 character long string should not need 50MB RAM
and quite some minutes of runtime to do it's job; it's always possible
to do worse, we are only interested if it's possible to do better).

> In addition, every time you write an explicit algorithm, you risk
> writing it wrong.  With Regex, yes there's the possibility of
> writing an incorrect expression, but it's more likely in that case
> that it just won't work.  It's much harder to get those subtle
> "happens once in awhile with only this very specific input".  Not
> impossible, but IMHO more difficult.

You didn't write quite some complex regular expressions, did you? A
RegEx is quite easy to have those subtle problems. But you are not
wrong. A regular expression is a type 3 grammar, C# has (more or less)
a type 2 grammar (it's even Turing complete), so it's much more
expressive and so there exists much more potential for errors.

> But for the rest of us, using Regex is an exercise in frustration as
> we skip back and forth in the MSDN documentation trying to find just
[quoted text clipped - 3 lines]
> syntax of that grammar is pretty arcane IMHO, and has been very hard
> to learn, at least for me.

The concept of regular expressions are not that difficult. The most
common representation in todays languages are pure artificial. Other
representations and syntaxes are possible and do exists; for the
language Common Lisp exists a library called cl-ppcre implementing a
quite efficient regular expression engine (for some examples even
faster than the C engine) -- this engine understands the common
representations but also allows another syntax:

CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
(:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

It's quite long representation and maybe to some eyes even worse but
showing that other ways to notated a RegEx are quite possible.

> questions here asking how to write Regex expressions to do what I
> want

Maybe have a look at

 http://weitz.de/regex-coach/

a IMHO quite useful tool to learn regular expressions and to
experiment with them.

Signature

Stefan.

Stefan Nobis - 27 Feb 2008 09:51 GMT
> CL-USER> (ppcre::parse-string "^([\w\d_-])*$")
> (:SEQUENCE :START-ANCHOR (:GREEDY-REPETITION 0 NIL (:REGISTER (:CHAR-CLASS #\w #\d #\_ #\-))) :END-ANCHOR)

Ups, bad example. The simple translator doen't convert \w and
\d. Sorry. It should read more like this (to put everything except \w
- and _ in the register):

(:SEQUENCE :START-ANCHOR
          (:GREEDY-REPETITION 0 NIL
                              (:REGISTER
                               (:INVERTED-CHAR-CLASS :WORD-CHAR-CLASS
                                            #\_
                                            #\-)))
          :END-ANCHOR)

The first to parameters to :GREEDY-REPETITION meening the min and max
allowed number of repetitions (the above 0 NIL corresponds to the *,
something like (:GREEDY-REPETITION 3 5 ...) corresponds to
...{3,5}). The syntax #\_ is Common Lisp syntax for the single
character _.

Here is a handwritten example using the verbose syntax (I
don't have the perl-like version at hand, sorry):

(:sequence :start-anchor (:alternation #\# ";;;")
                   (:positive-lookahead :word-char-class)
                   (:register (:greedy-repetition 0 nil :word-char-class))
                   (:positive-lookahead
                    (:alternation :end-anchor
                                  (:sequence
                                   (:greedy-repetition 1 nil
                                                       :whitespace-char-class)
                                   :non-whitespace-char-class)))
                   (:greedy-repetition 0 1
                    (:sequence
                     (:greedy-repetition 1 nil :whitespace-char-class)
                     (:register (:greedy-repetition 0 nil :everything)))))

Signature

Stefan.

Arne Vajhøj - 28 Feb 2008 02:53 GMT
> I usually avoid regex's because of performance. In this case I haven't tested
> but would imagine the difference is approximatly "who cares" ... nonetheless
> I just think of regex's as overkill in many situations where people try to
> use them.

Usually fewer lines of code is what is most cost effective overall.

Regex is simple code (and if the reader knows regex as a general concept
it is even easy to read) and code that is easy to modify to different
requirements.

It does come with a certain overhead. It may not be suited for
being called billions or trillions of times. But I doubt that was
the case here (the variable was named 'username').

Arne
Jesse Houwing - 26 Feb 2008 23:32 GMT
Hello KH,

> Regex is a bit overkill for that; you could...
>
[quoted text clipped - 11 lines]
> }
> str = sb.ToString();

Though, should your requirements become more complex, a regex solution like
the following can be used:

string cleaned = Regex.Replace("string to clean", "[^\w\d_-]", "", RegexOptions.None);

Just put all the characters you want to keep into the range above. Everything
else will be removed.

Jesse

>> Hi,
>>
[quoted text clipped - 12 lines]
>>
>> How can I strip all the characters that I dont' want?

--
Jesse Houwing
jesse.houwing at sogeti.nl

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.