Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / Languages / C# / April 2008

Tip: Looking for answers? Try searching our database.

Regex Question

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
AMP - 21 Apr 2008 17:24 GMT
Hello,
I am coming back to a project and I dont remember what the following
Regex says
I do know it removes all \r\n from the string, but I dont see how.
Can someone explain this one?

Regex re = new Regex(@"([\x00-\x1F\x7E-\xFF]+)",
RegexOptions.Compiled);
string op = re.Replace(FileToParse, "");

Thanks
Mike
Gilles Kohl [MVP] - 21 Apr 2008 18:36 GMT
>Hello,
>I am coming back to a project and I dont remember what the following
[quoted text clipped - 5 lines]
>RegexOptions.Compiled);
>string op = re.Replace(FileToParse, "");

How it works? The outer parentheses are redundant IMHO. The regex
boils down to a positive character group with two ranges, the start
and end of which (respectively) being expressed as hexadecimal
escapes: \x00-\x1F (0 to 31 in decimal) and \x7E-\xFF (126 to 255 in
decimal). With the appended "+", it basically means "one or more
characters between 0-31 resp. 126-255".

Replacing all these occurences with nothing (empty string) does far
more than just remove \r and \n - it removes all characters in the
range 0-31 and 126-255. The intention is probably to kill anything
that is not in the "ASCII" range. Unfortunately, it also kills the
tilde "~" (126).

It will also remove e.g. accents and umlaut characters in the range
128-256. What it will NOT remove are Unicode characters from 256
upwards.

Try e.g.

        string originalString = "Testing <\u00e7> <\u0107> ";

        Regex re = new Regex(@"([\x00-\x1F\x7E-\xFF]+)",
RegexOptions.Compiled);
        string replacedString = re.Replace(originalString, "");

        MessageBox.Show(originalString);
        MessageBox.Show(replacedString);

The first "special" character, a lowercase C with cedilla, will be
removed. The second one, a lowercase c with acute accent, will not be
affected.

(My suggestion, if your intention is to remove anything not in the
range 32-126, would be to use this:

Regex re = new Regex(@"[^\x20-\x7E]+", RegexOptions.Compiled);

instead.)

  Regards,
  Gilles.

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.