.NET Forum / Languages / C# / August 2007
Help with replacement pattern
|
|
Thread rating:  |
Flomo Togba Kwele - 21 Aug 2007 03:12 GMT I'm looking to replace all commas in a single string not contained within a pair of double-quotes, with a tab. I don't know where to begin.
e.g., string line = "a,",,"b" would be changed to "a,"\t\t"b"
Can someone suggest a regex pattern to use, or any other way that will accomplish this?
TIA Flomo --
Niels Ull - 21 Aug 2007 09:06 GMT > I'm looking to replace all commas in a single string not contained > within a pair of double-quotes, with a tab. I don't know where to [quoted text clipped - 4 lines] > Can someone suggest a regex pattern to use, or any other way that will > accomplish this? You want to replace all commas preceeded by an even number of doublequotes. Try using a look-behind pattern for that - e.g. something like
str.replace(@"(?<=^([^""]*""[^""]*""[^""]*)*),", "\t");
> TIA Flomo Jialiang Ge [MSFT] - 21 Aug 2007 10:12 GMT Hello Niels,
Thank you for the suggestion.
Hello Flomo,
Niels's regex is the third method to resolve the problem. But this method still suffers from the loss of performance. I made a test to compare the three methods: (see the code listing 1) The result is: #direct operation on chars: 39ms. #my regex: 6280ms. #Niels's regex: 8997ms. Therefore, I think the direct operation on chars is the best way by now.
Code Listing 1: class Program { static void Main(string[] args) { string test = "\"a,,,,,\",\"j,dd,\"b\",\",\""; Test1(test); Test2(test); Test3(test); return; }
public static void Test1(string test) { Stopwatch sw = new Stopwatch(); sw.Start();
for (int times = 0; times < 100000; times++) { char[] str = test.ToCharArray(); bool isInQuotes = false; for (int i = 0; i < str.Length; i++) { if (!isInQuotes && str[i] == ',') { str[i] = '\t'; continue; } if (str[i] == '\"') isInQuotes = !isInQuotes; } //Console.WriteLine(str); } sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); }
public static void Test3(string test) { Stopwatch sw = new Stopwatch(); sw.Start();
for (int times = 0; times < 100000; times++) { Regex regex = new Regex("(?<=^([^\"]*\"[^\"]*\"[^\"]*)*),"); regex.Replace(test, "\t"); } sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); }
public static void Test2(string test) { Stopwatch sw = new Stopwatch(); sw.Start();
for (int times = 0; times < 100000; times++) { Regex regex = new Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*"); MatchEvaluator myEvaluator = new MatchEvaluator(Program.ReplaceFunction); regex.Replace(test, myEvaluator); } sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); }
public static string ReplaceFunction(Match m) { return m.Groups["head"].Value + m.Groups["remove"].Value.Replace(',', '\t') + m.Groups["tail"].Value; } }
Sincerely, Jialiang Ge (jialge@online.microsoft.com, remove 'online.') Microsoft Online Community Support
================================================= When responding to posts, please "Reply to Group" via your newsreader so that others may learn and benefit from your issue. ================================================= This posting is provided "AS IS" with no warranties, and confers no rights.
Jialiang Ge [MSFT] - 21 Aug 2007 09:50 GMT Hello Flomo,
From your post, my understanding on this issue is: you want to use regex to replace the commas which are not contained within a pair of quotes. If I'm off base, please feel free to let me know.
I think you can refer to the following regular expression to replace the commas. (But regex is not a recommended approach in tackling this problem, see the comparison of performance in the end of my reply) (?<head>".*?")*(?<remove>,*)(?<tail>".*?")* Here is some explanations: The first (".*?")* is trying to match any "" pair in front of the commas to be replaced. The last (".*?")* is trying to match "" pair behind the commas. After all the "" pairs are matched, any commas in the remaining string should be replaced with '\t'.
The complete C# code is listed below: static void Main(string[] args) { string test = "\"a,,,,,\",\"j,dd,\"b\",\",\""; Regex regex = new Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*"); MatchEvaluator myEvaluator = new MatchEvaluator(Program.ReplaceFunction); Console.WriteLine(regex.Replace(test, myEvaluator)); } public static string ReplaceFunction(Match m) { return m.Groups["head"].Value + m.Groups["remove"].Value.Replace(',', '\t') + m.Groups["tail"].Value; }
An alternative way to accomplish the task is to purely operate on the chars of the string. By iterating the characters in the string, the task can be done in O(n), n is the length of the string. string test = "\"a,,,,,\",\"j,dd,\"b\",\",\""; char[] str = test.ToCharArray(); bool isInQuotes = false; for (int i = 0; i < str.Length; i++) { if (!isInQuotes && str[i] == ',') { str[i] = '\t'; continue; } if (str[i] == '\"') isInQuotes = !isInQuotes; } Console.WriteLine(str); In the code above, isInQuotes is a flag indicating whether the current char is contained within a pair of quotes. If isInQuotes is false and the char is a comma, then we should replace it with a '\t'.
Here is a comparison in performance of the two approaches: I let both methods run 100000 times on the test string: string test = "\"a,,,,,\",\"j,dd,\"b\",\",\""; The result is that it takes 6380ms for Regex, but only 39ms for the string method. Therefore, I recommend the latter. Regex is useful in some complicated cases such as the match of Email address, but sometime, it is resource-consuming. Thus, in some cases that can be resolved in one iteration of string, a direct operation on chars is recommended.
Please feel free to let me know if you have any other concern.
Sincerely, Jialiang Ge (jialge@online.microsoft.com, remove 'online.') Microsoft Online Community Support
================================================== For MSDN subscribers whose posts are left unanswered, please check this document: http://blogs.msdn.com/msdnts/pages/postingAlias.aspx
Get notification to my posts through email? Please refer to http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif ications. If you are using Outlook Express/Windows Mail, please make sure you clear the check box "Tools/Options/Read: Get 300 headers at a time" to see your reply promptly.
Note: The MSDN Managed Newsgroup support offering is for non-urgent issues where an initial response from the community or a Microsoft Support Engineer within 1 business day is acceptable. Please note that each follow up response may take approximately 2 business days as the support professional working with you may need further investigation to reach the most efficient resolution. The offering is not appropriate for situations that require urgent, real-time or phone-based interactions or complex project analysis and dump analysis issues. Issues of this nature are best handled working with a dedicated Microsoft Support Engineer by contacting Microsoft Customer Support Services (CSS) at http://msdn.microsoft.com/subscriptions/support/default.aspx. ================================================== This posting is provided "AS IS" with no warranties, and confers no rights.
Jialiang Ge [MSFT] - 23 Aug 2007 04:14 GMT Hi Flomo,
Would you mind letting me know the result of the suggestions? If you need further assistance, feel free to let me know. I will be more than happy to be of assistance.
Have a great day!
Sincerely, Jialiang Ge (jialge@online.microsoft.com, remove 'online.') Microsoft Online Community Support
================================================= When responding to posts, please "Reply to Group" via your newsreader so that others may learn and benefit from your issue. ================================================= This posting is provided "AS IS" with no warranties, and confers no rights.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|