i am doing some research where i need to parse some data from SEC web site.
the data is not in xml format and sort of unstructured.
can someone recommand me a way to parse this data.
i need to gather a lot of filings of the sort which i would rather not do
manually.
how can i programattically parse these sort of text files?
http://www.sec.gov/Archives/edgar/data/1074272/0001074272-08-000001.txt
http://www.sec.gov/Archives/edgar/data/1428793/000121465908000555/0001214659-08-
000555.txt
http://www.sec.gov/Archives/edgar/data/791191/0000791191-08-000001.txt
thank you,
> i am doing some research where i need to parse some data from SEC web site.
> the data is not in xml format and sort of unstructured.
[quoted text clipped - 5 lines]
>
> http://www.sec.gov/Archives/edgar/data/1074272/0001074272-08-000001.txt
As that document seems to be a mixture of XML and plain text I would
consider a mixed approach, use an XML parser to parse the XML, then
regular expression based text parsing.
XSLT 1.0 is certainly not a language that is suitable for that task. If
you use XSLT 2.0 however then you have support for regular expressions.
There are currently three XSLT processors, Saxon 9 has a Java and a .NET
version (http://saxon.sourceforge.net/), AltovaXML is a COM solution
(http://www.altova.com/altovaxml.html) and Gestalt is an Eiffel
implementation (http://gestalt.sourceforge.net/).
If you want to do it with tools available in the .NET framework class
library then combine XmlReader or XPathDocument/XPathNavigator with the
regular expression support in the .NET framework.

Signature
Martin Honnen --- MVP XML
http://JavaScript.FAQTs.com/
raj@aol.com - 12 Mar 2008 01:11 GMT