Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / New Users / April 2005

Tip: Looking for answers? Try searching our database.

Regex parsing e-mail question.

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
recoil@community.nospam - 14 Apr 2005 16:18 GMT
I basically am trying to match something like keyword: (the : and space is a
marker. I want everything after that all the way up to the next Keyword:
(where keyword HAS to begin a new line. I want everything before the next
keyword.

(\: ).*(\n[a-zA-z])  comes extremely close except for 2 things.
The most important is that it is unable to match patterns where the
"content" that I want spans multiple lines . For example in an e-mail it
would skip over
Received: from unknown (HELO barracuda.domain.com) (127.0.0.1)

    by 192.168.2.195 with SMTP; 24 Feb 2005 19:16:52 -0000

Also I am wondering if there is a way to specify that I want everything
"after" the \: and before the \n .

Any help would be greatly appreciated. Below are sample regex and sample
input that I am trying to use. and yes google may have bastardized some of
the input

Regex that I have tried. The first one has produced the closest results.

(\: ).*(\n[a-zA-z])
(\: ).*[(\n\s)].*(\n[a-zA-z])
(\: ).*[(\n\s)].*[^\n[a-zA-Z]]*(\n?[a-zA-Z])

----------------------------------------

Input

-------------------------------------------

Return-Path: <u...@domain.com>
Delivered-To: u...@domain.com
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
 by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8?000/cgi-bin/mark.cgi
X-ASG-Whitelist:  Sender
X-ASG-Whitelist:  Sender
X-ASG-Whitelist:  Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c?om
[192.168.1.100])
       by barracuda.domains.com (Spam Firewall) with ESMTP
       id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
        Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <u...@domain.com>
To: <u...@domain.com>
Cc: <user.not...@domain.com>,
       <user.supp...@domain.com>,
       <anotherper...@somebodyelse.co?m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
       boundary="----=_NextPart_000_0?0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@?domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0

This is a multi-part message in MIME format.

------=_NextPart_000_00100_01C?52A36.D545F720
Content-Type: text/plain;
       charset="us-ascii"
Content-Transfer-Encoding: 7bit

Content-Transfer-Encoding: 7bit
recoil@community.nospam - 14 Apr 2005 16:34 GMT
\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
I got this to work except I need some way to tell it that the \n[^....] part
must be able to happen multiple times

>I basically am trying to match something like keyword: (the : and space is
>a marker. I want everything after that all the way up to the next Keyword:
[quoted text clipped - 76 lines]
>
> Content-Transfer-Encoding: 7bit
recoil@community.nospam - 15 Apr 2005 22:11 GMT
If you look at the end of the original message there is a raw e-mail header.
Basically I need to be able to parse all of the contents of the e-mail.
Some of the Receieved: headers are multi-line headers.
That is the actual contents of that specific header spans multiple lines.
I have tried putting a * at various positions however the results have
either been a) the same results or b) they start grabbing multiple headers
at the same time.

There is a specific pattern to mark the contents of a header
Each header word (this identifies what the header is) will be a full word
and will be at the beginning of each line. It will be preceeded by a colon
and a space.
The contents will then preceed. If the contents span multiple lines there
will be multiple whitespaces at the beginning of each line. Therefore my
regex needs to start at the beginning

. My pattern if you take the raw contents of an e-mail message and match it
will return an array of  headers. Unfortunately any header that spans more
then 2 lines will not be returned.

You can try testing it here. I have an actual applicatio that i use to test
the regex but sadly i've not been able to find a way to get it to keep
matching repetive lines that begin with whitespace and then stop at the
first line that does not beginwith whitespace.

http://www.regexlib.com/RETester.aspx
Regex of
\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
Input of
Return-Path: <u...@domain.com>
Delivered-To: u...@domain.com
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8?000/cgi-bin/mark.cgi
X-ASG-Whitelist:  Sender
X-ASG-Whitelist:  Sender
X-ASG-Whitelist:  Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c?om
[192.168.1.100])
      by barracuda.domains.com (Spam Firewall) with ESMTP
      id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
     with Microsoft SMTPSVC(6.0.3790.211);
       Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <u...@domain.com>
To: <u...@domain.com>
Cc: <user.not...@domain.com>,
      <user.supp...@domain.com>,
      <anotherper...@somebodyelse.co?m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
      boundary="----=_NextPart_000_0?0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@?domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0

This is a multi-part message in MIME format.

------=_NextPart_000_00100_01C?52A36.D545F720
Content-Type: text/plain;
      charset="us-ascii"
Content-Transfer-Encoding: 7bit

Content-Transfer-Encoding: 7bit

> \: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
> I got this to work except I need some way to tell it that the \n[^....]
[quoted text clipped - 80 lines]
>>
>> Content-Transfer-Encoding: 7bit
"Peter Huang" [MSFT] - 16 Apr 2005 08:53 GMT
Hi

Based on my understanding, you wants to extract all the content after the
xxxxx:
Here goes the code for your reference.

StreamReader sr = new StreamReader(@"..\..\test.txt");
string mstr = sr.ReadToEnd();
string[] strs =Regex.Split(mstr,@"^[\w-]+:",RegexOptions.Multiline);
StreamWriter sw = new StreamWriter(@"..\..\result.txt");
foreach(string str in strs)
{
    if(str=="")
        continue;
    string s = str.Replace("\r\n",string.Empty);
    sw.WriteLine(s);
    Console.WriteLine(s);
}
sw.Close();

Result:
<u...@domain.com>
u...@domain.com
(qmail 21118 invoked from network); 16 Mar 2005 20:41:33-0000
from unknown (HELO barracuda.domains.com) (192.168.192.194) by
192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
1111005918-25079-3-0
http://barracuda.domains.com:8?00/cgi-bin/mark.cgi
 Sender
 Sender
 Sender
from domaindev1.domain.local
(192-168-1-100.generator.isp.com[192.168.1.100])       by
barracuda.domains.com (Spam Firewall) with ESMTP       id AFE2D20A2F39;
Wed, 16 Mar 2005 14:45:18 -0600 (CST)
from tetco634 ([192.168.5.193]) by domaindev1.domain.local      with
Microsoft SMTPSVC(6.0.3790.211);        Wed, 16 Mar 2005 14:45:44 -0600
"user bleah" <u...@domain.com>
<u...@domain.com>
<user.not...@domain.com>,       <user.supp...@domain.com>,      
<anotherper...@somebodyelse.com>
New User Signup
New User Signup
Wed, 16 Mar 2005 14:45:44 -0600
1.0
multipart/alternative;      
boundary="----=_NextPart_000_0?100_01C52A36.D545F720"
Microsoft Office Outlook, Build 11.0.6353
AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?=
Produced By Microsoft MimeOLE V6.00.2900.2527
<domainDEV1RapPodByDl00000...@comaindev1.domain.local>
16 Mar 2005 20:45:44.0462 (UTC)FILETIME=[1FDFCAE0:01C52A69]
by Barracuda Spam Firewall at domains.com
0.00
No, SCORE=0.00 using global scores ofTAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0
KILL_LEVEL=6.0This is a multi-part message in MIME
format.------=_NextPart_000_00100_01CD2A36.D545F720
text/plain;       charset="us-ascii"
7bit
7bit

Best regards,

Peter Huang
Microsoft Online Partner Support

Signature

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

recoil@community.nospam - 16 Apr 2005 13:44 GMT
I will keep that method in mind. I guess one of the reasons i had not come
across that approach is that then would require me to extract and make a
copy of e-mail headers only as that would split all of the contents of the
e-mail and I only want the e-mail header.

Thanks.

> Hi
>
[quoted text clipped - 65 lines]
> This posting is provided "AS IS" with no warranties, and confers no
> rights.
"Peter Huang" [MSFT] - 19 Apr 2005 04:32 GMT
Hi

In the Split method, the regex will try to match the delimiter. For your
concern, I think we just need to clip the header from the content.
If you still have any concern, please feel free to post here.

Best regards,

Peter Huang
Microsoft Online Partner Support

Signature

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.

recoil@community.nospam - 25 Apr 2005 15:52 GMT
I finaly solved it. I took a mixture of your idea and my original idea
and then merged and altered them so i ended up using IndexOf instead of
regex. It turned out to be about 4-10 times faster when parsing over
1k+ e-mails and for some reason certain data input would cause the
Regex to hang @ 100% cpu usage for infinitity which turned out to be a
real show stopper.

Glad for the help.
"Peter Huang" [MSFT] - 15 Apr 2005 07:11 GMT
Hi Recoil,

Have you tried to add a * at the end of \n[^....] part.
(\n[^\<\>a-zA-Z\.\@-])*

Also I still can not understand your scenario very much.
Can you post the input in simple one or two line and output you want?

Best regards,

Peter Huang
Microsoft Online Partner Support

Signature

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.