Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsFree MagazinesWhite PapersSubmit Content
Discussion GroupsASP.NETWindows FormsLanguages.NET FrameworkVisual Studio.NET
Articles.NET FrameworkASP.NETToolsWindows Forms
.NET DirectoryOpen Source ProjectsUser GroupsWeb Resources
Related Topics
Visual Basic 6SQL ServerMS AccessOther DB ProductsMS Server ProductsMore Topics ...

.NET Forum / .NET Framework / XML / December 2004

Tip: Looking for answers? Try searching our database.

Querying Very Large XML

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Greg - 27 Dec 2004 19:48 GMT
I am working on a project that will have about 500,000 records in an XML
document.  This document will need to be queried with XPath, and records
will need to be updated. I was thinking about splitting up the XML into
several XML documents (perhaps 50,000 per document) to be more efficient but
this will make things a lot more complex because the searching needs to go
accross all 500,000 records.   Can anyone point me to some best practices /
performance techniques for handling large XML documents?   Obviously, the
XmlDocument object is probably not a good choice...
Christoph Schittko [MVP] - 27 Dec 2004 20:55 GMT
Greg,

The recommended store to query large XML documents in .NET is the
XPathDocument. However, the XPathDocument, just as the XmlDocument will
keep all data from the document plus all the DOM-related information in
memory, i.e. you will need sufficient memory in your server. On top of
that, you have to deal with whatever query optimizing the XPathDocument
does under the covers. If you wanted to add any custom indexing, you
would have first walk the entire document to build your custom index.

Would be able to add a SQL Server database (MSDE might do, but
preferably SQL 2005 Express, currently in Beta 2) to your environment?
Is your XML format strongly structured, so it's easily shredded into
relational tables? If that's the case, you'd save yourself the headache
of managing memory and indexes and let SQL Server do the work for you.
With SQL 2005 you can even store the XML document as a whole in a column
and let SQL Server do the indexing.

HTH,
Christoph Schittko
MS MVP XML
http://weblogs.asp.net/cschittko

> -----Original Message-----
> From: Greg [mailto:na]
[quoted text clipped - 13 lines]
> performance techniques for handling large XML documents?   Obviously, the
> XmlDocument object is probably not a good choice...
Greg - 27 Dec 2004 21:20 GMT
Thanks for the info Chris.  I was thinking along the same lines w/ the XML
objects.   Unfortunately, a database isn't really an option for us due to
the cost (or percieved cost... and databases need DBA's...).   A big reason
for using XML is to avoid having to use and maintain a database.  We are
phasing out an old VAX program that currently does things completely file
based, and trying to do a similar thing with XML on the .NET platform.   The
data tends to be relatively simple- the general process is going from a
fixed flat file, converting to XML, and then allowing the user to build
queries for tweaking some of the data. The queries would be XPATH (of
course, built with a nice UI)...  Perhaps one of the biggest challenges is
eliminating duplicate records accross the entire data set. I'll probably
have to come up with an interesting data structure to do it efficiently in
conjunction with XML since I won't be loading everything in to the
XpathDocument at once.  I would think everything else that has to be done
should be relatively doable by chunking out the files and using
XpathDocuments and xpath queries.

> Greg,
>
[quoted text clipped - 43 lines]
> the
> > XmlDocument object is probably not a good choice...
Christoph Schittko [MVP] - 28 Dec 2004 04:41 GMT
Greg,

I was hoping that MSDE (or SQL 2005 Express) might let you get around
the "we don't want to run a database" argument. Both versions are free
and shouldn't require much maintenance. Yet they provide the same XML
support as the full version of SQL Server. The only downside is that
they are not really built for concurrent access by a bigger number of
users simultaneously.

You sound like you know what you're in for with not using a database in
terms of concurrency management, access control, indexing across the
individual chunks, transactional integrity, etc, i.e. all those reasons
why databases are popular ;).

If you determined that it's still more economical to build that
functionality then that's hard to argue with. The trickiest piece to
figure out is figuring out which file to add new XML and how to perform
any updates that spawn multiple files, but again ... you sound like
you're well aware of what you're in for.

HTH,
Christoph Schittko
MVP XML
http://weblogs.asp.net/cschittko

> -----Original Message-----
> From: Greg [mailto:na]
[quoted text clipped - 69 lines]
> > the
> > > XmlDocument object is probably not a good choice...
Greg - 28 Dec 2004 14:26 GMT
Chris, please see responses inline..

> Greg,
>
[quoted text clipped - 4 lines]
> they are not really built for concurrent access by a bigger number of
> users simultaneously.

MSDE is a good alternative but there are definitely some costs associated
with running it (at least that is what my manager will tell me).   MSDE is
vulnerable to many of the same exploits that SQL Server is, so that means it
will have to be updated periodically.  With my particular application,
that's probably the only real maintenance cost that would need to be
considered since I will be reloading the entire data set frequently.
However, I will definitely have to think about it as an alternative.  It
would be interesting to estimate out what it would take to do an MSDE
solution vs. an XML solution.  Even if it were cheaper to initially develop,
I think I could be challenged with the "what about maintenance and security"
concerns.  Concurrency is definitely not an issue because it is only a one
user application.  The only technical issue would be if there is a limit to
how much data you can store in MSDE, of which I don't believe there is one.

> You sound like you know what you're in for with not using a database in
> terms of concurrency management, access control, indexing across the
> individual chunks, transactional integrity, etc, i.e. all those reasons
> why databases are popular ;).

Transactional integrity and indexing is another good point.   With spanning
multiple files, I'll probably need to be able to rollback changes if an
update on one of them fails.  That may mean having to create new files, then
deleting the old ones when they are all successful.   I'm not that concerned
about indexing since most of the searching I'm doing will be on just about
any field.  XPath seems to do a pretty good job since most everything is
loaded in memory (at least for the file I'm searching...)

> If you determined that it's still more economical to build that
> functionality then that's hard to argue with. The trickiest piece to
> figure out is figuring out which file to add new XML and how to perform
> any updates that spawn multiple files, but again ... you sound like
> you're well aware of what you're in for.

I won't actually need to add new XML, I'll just need to update certain
records it in my particular case.  That definitely simplifies things.
Regardless, I think I'm going to take a look at what it may take to do an
MSDE solution. Thanks for the suggestion.

Greg

> HTH,
> Christoph Schittko
[quoted text clipped - 100 lines]
> > > the
> > > > XmlDocument object is probably not a good choice...
Mujtaba Syed - 27 Dec 2004 21:45 GMT
Dare Obasanjo wrote this article about efficient ways to handle (read,
update) large XML files:

http://msdn.microsoft.com/webservices/building/xmldevelopment/api/default.aspx?p
ull=/library/en-us/dnxmlnet/html/largexml.asp


Mujtaba.

> I am working on a project that will have about 500,000 records in an XML
> document.  This document will need to be queried with XPath, and records
[quoted text clipped - 4 lines]
> performance techniques for handling large XML documents?   Obviously, the
> XmlDocument object is probably not a good choice...
Greg - 28 Dec 2004 14:08 GMT
Mujtaba, thanks for the link to the article.  Those are some interesting
ideas!

Greg

> Dare Obasanjo wrote this article about efficient ways to handle (read,
> update) large XML files:

http://msdn.microsoft.com/webservices/building/xmldevelopment/api/default.aspx?p
ull=/library/en-us/dnxmlnet/html/largexml.asp


> Mujtaba.
>
[quoted text clipped - 8 lines]
> > performance techniques for handling large XML documents?   Obviously, the
> > XmlDocument object is probably not a good choice...

Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.