If you look at the end of the original message there is a raw e-mail header.
Basically I need to be able to parse all of the contents of the e-mail.
Some of the Receieved: headers are multi-line headers.
That is the actual contents of that specific header spans multiple lines.
I have tried putting a * at various positions however the results have
either been a) the same results or b) they start grabbing multiple headers
at the same time.
There is a specific pattern to mark the contents of a header
Each header word (this identifies what the header is) will be a full word
and will be at the beginning of each line. It will be preceeded by a colon
and a space.
The contents will then preceed. If the contents span multiple lines there
will be multiple whitespaces at the beginning of each line. Therefore my
regex needs to start at the beginning
. My pattern if you take the raw contents of an e-mail message and match it
will return an array of headers. Unfortunately any header that spans more
then 2 lines will not be returned.
You can try testing it here. I have an actual applicatio that i use to test
the regex but sadly i've not been able to find a way to get it to keep
matching repetive lines that begin with whitespace and then stop at the
first line that does not beginwith whitespace.
http://www.regexlib.com/RETester.aspx
Regex of
\: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
Input of
Return-Path: <u...@domain.com>
Delivered-To: u...@domain.com
Received: (qmail 21118 invoked from network); 16 Mar 2005 20:41:33
-0000
Received: from unknown (HELO barracuda.domains.com) (192.168.192.194)
by 192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
X-ASG-Debug-ID: 1111005918-25079-3-0
X-Barracuda-URL: http://barracuda.domains.com:8?000/cgi-bin/mark.cgi
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
X-ASG-Whitelist: Sender
Received: from domaindev1.domain.local (192-168-1-100.generator.isp.c?om
[192.168.1.100])
by barracuda.domains.com (Spam Firewall) with ESMTP
id AFE2D20A2F39; Wed, 16 Mar 2005 14:45:18 -0600 (CST)
Received: from tetco634 ([192.168.5.193]) by domaindev1.domain.local
with Microsoft SMTPSVC(6.0.3790.211);
Wed, 16 Mar 2005 14:45:44 -0600
From: "user bleah" <u...@domain.com>
To: <u...@domain.com>
Cc: <user.not...@domain.com>,
<user.supp...@domain.com>,
<anotherper...@somebodyelse.co?m>
X-ASG-Orig-Subj: New User Signup
Subject: New User Signup
Date: Wed, 16 Mar 2005 14:45:44 -0600
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0?0100_01C52A36.D545F720"
X-Mailer: Microsoft Office Outlook, Build 11.0.6353
Thread-Index: AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?==
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2527
Message-ID: <domainDEV1RapPodByDl00000...@?domaindev1.domain.local>
X-OriginalArrivalTime: 16 Mar 2005 20:45:44.0462 (UTC)
FILETIME=[1FDFCAE0:01C52A69]
X-Virus-Scanned: by Barracuda Spam Firewall at domains.com
X-Barracuda-Spam-Score: 0.00
X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of
TAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=6.0
This is a multi-part message in MIME format.
------=_NextPart_000_00100_01C?52A36.D545F720
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
> \: .*((\n[^\<\>a-zA-Z\.\@-])|.*).*(\n[a-zA-Z])
> I got this to work except I need some way to tell it that the \n[^....]
[quoted text clipped - 80 lines]
>>
>> Content-Transfer-Encoding: 7bit
"Peter Huang" [MSFT] - 16 Apr 2005 08:53 GMT
Hi
Based on my understanding, you wants to extract all the content after the
xxxxx:
Here goes the code for your reference.
StreamReader sr = new StreamReader(@"..\..\test.txt");
string mstr = sr.ReadToEnd();
string[] strs =Regex.Split(mstr,@"^[\w-]+:",RegexOptions.Multiline);
StreamWriter sw = new StreamWriter(@"..\..\result.txt");
foreach(string str in strs)
{
if(str=="")
continue;
string s = str.Replace("\r\n",string.Empty);
sw.WriteLine(s);
Console.WriteLine(s);
}
sw.Close();
Result:
<u...@domain.com>
u...@domain.com
(qmail 21118 invoked from network); 16 Mar 2005 20:41:33-0000
from unknown (HELO barracuda.domains.com) (192.168.192.194) by
192.168.2.195 with SMTP; 16 Mar 2005 20:41:33 -0000
1111005918-25079-3-0
http://barracuda.domains.com:8?00/cgi-bin/mark.cgi
Sender
Sender
Sender
from domaindev1.domain.local
(192-168-1-100.generator.isp.com[192.168.1.100]) by
barracuda.domains.com (Spam Firewall) with ESMTP id AFE2D20A2F39;
Wed, 16 Mar 2005 14:45:18 -0600 (CST)
from tetco634 ([192.168.5.193]) by domaindev1.domain.local with
Microsoft SMTPSVC(6.0.3790.211); Wed, 16 Mar 2005 14:45:44 -0600
"user bleah" <u...@domain.com>
<u...@domain.com>
<user.not...@domain.com>, <user.supp...@domain.com>,
<anotherper...@somebodyelse.com>
New User Signup
New User Signup
Wed, 16 Mar 2005 14:45:44 -0600
1.0
multipart/alternative;
boundary="----=_NextPart_000_0?100_01C52A36.D545F720"
Microsoft Office Outlook, Build 11.0.6353
AcUqaR/LUDk7Lu7bQdu3vY6SjqLPAQ?=
Produced By Microsoft MimeOLE V6.00.2900.2527
<domainDEV1RapPodByDl00000...@comaindev1.domain.local>
16 Mar 2005 20:45:44.0462 (UTC)FILETIME=[1FDFCAE0:01C52A69]
by Barracuda Spam Firewall at domains.com
0.00
No, SCORE=0.00 using global scores ofTAG_LEVEL=3.5 QUARANTINE_LEVEL=1000.0
KILL_LEVEL=6.0This is a multi-part message in MIME
format.------=_NextPart_000_00100_01CD2A36.D545F720
text/plain; charset="us-ascii"
7bit
7bit
Best regards,
Peter Huang
Microsoft Online Partner Support

Signature
Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
recoil@community.nospam - 16 Apr 2005 13:44 GMT
I will keep that method in mind. I guess one of the reasons i had not come
across that approach is that then would require me to extract and make a
copy of e-mail headers only as that would split all of the contents of the
e-mail and I only want the e-mail header.
Thanks.
> Hi
>
[quoted text clipped - 65 lines]
> This posting is provided "AS IS" with no warranties, and confers no
> rights.
"Peter Huang" [MSFT] - 19 Apr 2005 04:32 GMT
Hi
In the Split method, the regex will try to match the delimiter. For your
concern, I think we just need to clip the header from the content.
If you still have any concern, please feel free to post here.
Best regards,
Peter Huang
Microsoft Online Partner Support

Signature
Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.
recoil@community.nospam - 25 Apr 2005 15:52 GMT
I finaly solved it. I took a mixture of your idea and my original idea
and then merged and altered them so i ended up using IndexOf instead of
regex. It turned out to be about 4-10 times faster when parsing over
1k+ e-mails and for some reason certain data input would cause the
Regex to hang @ 100% cpu usage for infinitity which turned out to be a
real show stopper.
Glad for the help.