.NET Forum / Languages / C# / October 2007
reading xml into database
|
|
Thread rating:  |
mark_drewersback@yahoo.co.uk - 16 Oct 2007 10:14 GMT I've an XML document which is in the region of 20meg which comes in weekly, the contents of the document need to be read and some fields tweaked and then the content imported into a SQL Server database. Anybody got any recommendations on the best way of doing this? Any online articles on the best approach would be really useful.
MTIA
Marc Gravell - 16 Oct 2007 10:35 GMT What database version? In 2005, I wouldn't rule out (without testing) using the xml data-type; unlike XmlDocument, this shreds the file into a relational database model, meaning you can query and modify portions of it (with xpath etc) without having to load the entirety into memory. You can also query it quiet efficiently if your next step is to extract it to a relational set of tables.
The tricky bit would be getting the data over to the database; perhaps SqlBulkCopy would be the best option. The other question is: are there lots of first-level elements (directly under the document-element)? If so, I do something similar, but I treat each child of the document element as a new row in to database, and I have a cunning custom IDataReader implementation (since SqlBulkCopy can consume IDataReader very efficiently) that wraps an XmlReader for this purpose (i.e. IDataReader.Read() advances XmlReader to the next sibling, getting the entire inner[or outer, can't remember] xml in the process). This allows me to upload my uber-xml in a number of chunks, say 200 at a time. Makes it maneagable, and keeps the size of each record in the database acceptable. Let me know if you want more detail, or perhaps the custom IDataReader implementation.
I haven't read any performance details for the performance of very large xml fragments as a single row - but I have a setup that would allow me to investigate if you really want.
Marc
mark_drewersback@yahoo.co.uk - 16 Oct 2007 11:46 GMT > What database version? In 2005, I wouldn't rule out (without testing) > using the xml data-type; unlike XmlDocument, this shreds the file into [quoted text clipped - 22 lines] > > Marc Hi Marc
Thanks for the reply. The database would be 2005.
The xml is records for offices so I'm probably only looking at one table for the time being.
xml format is something like the following:-
<office> <OfficeName>officename</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office>
There may be in the region of 5000 offices which are likely to need updated regularly (don't ask ;). At the time of reading the XML I need to do some manipulation on the content in fields 1 - 10 before it goes to the database. I've looked into sqlBulkCopy but thought there was no way to manipulate the data being copied? Is this incorrect?
Thanks
Marc Gravell - 16 Oct 2007 12:47 GMT > The xml is records for offices so I'm probably only looking at one > table for the time being. I agree; one table, but that isn't what I am suggesting.
Suppose your xml for 5000 offices is:
<xml><office ... >...</office><!-- lots and lots in here --><office ...>...</office></xml>
What I am saying is that rather than dealing with this as a single bulk, you bulk copy into a table where each row is a different office - rather than a single CLOB of all the offices in one row. The advantage of this is that each record describes a single entity, and is easy to manage.
> I've looked into sqlBulkCopy but thought there > was no way to manipulate the data being copied? If you are the data-source, you can do whatever you like. I've checked my code, and my current IDataReader implementation contains too much project-specific junk to be easily postable - so I've tried to illustrate from your xml taking your requirement to edit the xml on the fly into account (although you'd normally start the XmlReader from a Stream; the StringReader is just for quick demonstration); the only bit you'd really need to change is the XmlDataReader.DoRead() method; the point being that you can hand this IDataReader to SqlBulkCopy, and it will use it as the data source. Note that at any point in time we only have the data for a single office in memory.
The SqlBulkCopy class can be used to copy this into a staging table (the ctor sets it up for a single column, but that can be changed); once at the database, you can use all the SQL2005 methods to move that into a relational model, or just leave as xml. Note that you can do all sorts of fun things like persisted calculated columns on xml data, so it isn't always (depending on scenario) necessary to make it relational at all.
The other approach you can take is to use the same SqlBulkCopy model, but to forget about xml at the database, and simple use XmlReader / XmlDocument to parse each office into columns (using SetValues() inside DoRead() to update the current row each time). This only works if the schema stays simple; if you start getting multi-valued, it gets very tricky to do this efficiently. My assumption (when working with xml) is always worst-case (i.e. assume the data might get more complex), which is why I have suggested working with xml first, and suggested parsing to columns second. But either would work.
Marc
(long code post)
using System; using System.Data; using System.Xml; using System.IO; using System.Diagnostics;
static class Program { const string xml = @"<offices> <office> <OfficeName>officename1</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><office> <OfficeName>officename2</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><office> <OfficeName>officename3</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><!-- comment test--><office> <OfficeName>officename4</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office></offices>"; static void Main() { using (IDataReader reader = GetReader()) { while (reader.Read()) { Trace.WriteLine(reader.GetValue(0)); } } } static IDataReader GetReader() { return new XmlDataReader(XmlReader.Create(new StringReader(xml))); } }
class XmlDataReader : SimpleDataReader { private XmlReader reader; readonly int targetDepth; public XmlDataReader(XmlReader reader) : base( new string[] {"Xml"}, new Type[] {typeof(string)}) {
if (reader == null) throw new ArgumentNullException("reader"); this.reader = reader; reader.MoveToContent(); targetDepth = reader.Depth + 1;
} protected override void DoClose() { if(reader!=null) reader.Close(); reader = null; } protected override bool DoRead() { while (reader.Read() && reader.Depth == targetDepth) { if (reader.NodeType == XmlNodeType.Element) { string xml; using (XmlReader subReader = reader.ReadSubtree()) { XmlDocument doc = new XmlDocument(); doc.Load(subReader); doc.DocumentElement.SelectSingleNode("field1").InnerText += " *"; xml = doc.OuterXml; } SetValues(xml); return true; } } return false; }
}
public abstract class SimpleDataReader : IDataReader { readonly string[] columnNames; readonly Type[] columnTypes; readonly object[] values;
protected SimpleDataReader(string[] names, Type[] types) { if (names == null) throw new ArgumentNullException("names"); if (types == null) throw new ArgumentNullException("types"); if (names.Length != types.Length) throw new ArgumentException("Names / types lengths must match");
columnNames = (string[])names.Clone(); columnTypes = (Type[])types.Clone(); values = new object[names.Length]; } protected void SetValues(params object[] values) { if (values == null) throw new ArgumentNullException("values"); if (values.Length != this.values.Length) throw new ArgumentException("Values length is invalid"); values.CopyTo(this.values, 0); } private bool isClosed = false; void IDisposable.Dispose() { Close(); } abstract protected void DoClose(); public void Close() { try { DoClose(); } finally { isClosed = true; } } public bool IsClosed { get { return isClosed; } } public int RecordsAffected { get { return -1; } }
#region IDataReader Members
public int Depth { get { return 0; } }
public DataTable GetSchemaTable() { return null; }
public bool NextResult() { Close(); return false; } public bool Read() { return DoRead(); } protected abstract bool DoRead();
#endregion
#region IDataRecord Members
public int FieldCount { get { return columnNames.Length; } }
public bool GetBoolean(int i) { return Convert.ToBoolean(GetValue(i)); }
public byte GetByte(int i) { return Convert.ToByte(GetValue(i)); }
public long GetBytes(int i, long fieldOffset, byte[] buffer, int bufferoffset, int length) { byte[] data = (byte[])GetValue(i); int max = data.Length + 1 - (int)fieldOffset; if (length > max) length = max; if (length > 0) { Buffer.BlockCopy(data, (int)fieldOffset, buffer, bufferoffset, length); return length; } return 0; }
public char GetChar(int i) { return Convert.ToChar(GetValue(i)); }
public long GetChars(int i, long fieldoffset, char[] buffer, int bufferoffset, int length) { string data = GetString(i); int max = data.Length + 1 - (int)fieldoffset; if (length > max) length = max; if (length > 0) { data.CopyTo((int)fieldoffset, buffer, bufferoffset, length); return length; } return 0; }
public IDataReader GetData(int i) { throw new NotSupportedException(); }
public string GetDataTypeName(int i) { return GetFieldType(i).Name; }
public DateTime GetDateTime(int i) { return Convert.ToDateTime(GetValue(i)); }
public decimal GetDecimal(int i) { return Convert.ToDecimal(GetValue(i)); }
public double GetDouble(int i) { return Convert.ToDouble(GetValue(i)); }
public Type GetFieldType(int i) { return columnTypes[i]; }
public float GetFloat(int i) { return Convert.ToSingle(GetValue(i)); }
public Guid GetGuid(int i) { object obj = GetValue(i); byte[] binary = obj as byte[]; if (binary != null) { return new Guid(binary); } else { return new Guid(Convert.ToString(obj)); } }
public short GetInt16(int i) { return Convert.ToInt16(GetValue(i)); }
public int GetInt32(int i) { return Convert.ToInt32(GetValue(i)); }
public long GetInt64(int i) { return Convert.ToInt64(GetValue(i)); }
public string GetName(int i) { return columnNames[i]; }
public int GetOrdinal(string name) { for (int i = 0; i < columnNames.Length; i++) { if (StringComparer.InvariantCultureIgnoreCase.Equals(name, columnNames[i])) return i; } return -1; }
public string GetString(int i) { return Convert.ToString(GetValue(i)); }
public object GetValue(int i) { return values[i]; }
public int GetValues(object[] values) { if (values == null) throw new ArgumentNullException("values"); int fields = values.Length < FieldCount ? values.Length : FieldCount; Array.Copy(this.values, values, fields); return fields; }
public bool IsDBNull(int i) { object obj = GetValue(i); return obj == null || obj is DBNull; }
public object this[string name] { get { return this[GetOrdinal(name)]; } }
public object this[int i] { get { return GetValue(i); } }
#endregion }
Marc Gravell - 16 Oct 2007 13:00 GMT > The xml is records for offices so I'm probably only looking at one > table for the time being. I agree; one table, but that isn't what I am suggesting.
Suppose your xml for 5000 offices is:
<xml><office ... >...</office><!-- lots and lots in here --><office ...>...</office></xml>
What I am saying is that rather than dealing with this as a single bulk, you bulk copy into a table where each row is a different office - rather than a single CLOB of all the offices in one row. The advantage of this is that each record describes a single entity, and is easy to manage.
> I've looked into sqlBulkCopy but thought there > was no way to manipulate the data being copied? If you are the data-source, you can do whatever you like. I've checked my code, and my current IDataReader implementation contains too much project-specific junk to be easily postable - so I've tried to illustrate from your xml taking your requirement to edit the xml on the fly into account (although you'd normally start the XmlReader from a Stream; the StringReader is just for quick demonstration); the only bit you'd really need to change is the XmlDataReader.DoRead() method; the point being that you can hand this IDataReader to SqlBulkCopy, and it will use it as the data source. Note that at any point in time we only have the data for a single office in memory.
The SqlBulkCopy class can be used to copy this into a staging table (the ctor sets it up for a single column, but that can be changed); once at the database, you can use all the SQL2005 methods to move that into a relational model, or just leave as xml. Note that you can do all sorts of fun things like persisted calculated columns on xml data, so it isn't always (depending on scenario) necessary to make it relational at all.
The other approach you can take is to use the same SqlBulkCopy model, but to forget about xml at the database, and simple use XmlReader / XmlDocument to parse each office into columns (using SetValues() inside DoRead() to update the current row each time). This only works if the schema stays simple; if you start getting multi-valued, it gets very tricky to do this efficiently. My assumption (when working with xml) is always worst-case (i.e. assume the data might get more complex), which is why I have suggested working with xml first, and suggested parsing to columns second. But either would work.
Marc
(long code post)
using System; using System.Data; using System.Xml; using System.IO; using System.Diagnostics;
static class Program { const string xml = @"<offices> <office> <OfficeName>officename1</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><office> <OfficeName>officename2</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><office> <OfficeName>officename3</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office><!-- comment test--><office> <OfficeName>officename4</OfficeName> <Address> <Street>street</Street> <Town>town</Town> <PostCode>postcode</PostCode> </Address> <field1>xxxx</field1> <field10>xxxx</field10> </office></offices>"; static void Main() { using (IDataReader reader = GetReader()) { while (reader.Read()) { Trace.WriteLine(reader.GetValue(0)); } } } static IDataReader GetReader() { return new XmlDataReader(XmlReader.Create(new StringReader(xml))); } }
class XmlDataReader : SimpleDataReader { private XmlReader reader; readonly int targetDepth; public XmlDataReader(XmlReader reader) : base( new string[] {"Xml"}, new Type[] {typeof(string)}) {
if (reader == null) throw new ArgumentNullException("reader"); this.reader = reader; reader.MoveToContent(); targetDepth = reader.Depth + 1;
} protected override void DoClose() { if(reader!=null) reader.Close(); reader = null; } protected override bool DoRead() { while (reader.Read() && reader.Depth == targetDepth) { if (reader.NodeType == XmlNodeType.Element) { string xml; using (XmlReader subReader = reader.ReadSubtree()) { XmlDocument doc = new XmlDocument(); doc.Load(subReader); doc.DocumentElement.SelectSingleNode("field1").InnerText += " *"; xml = doc.OuterXml; } SetValues(xml); return true; } } return false; }
}
public abstract class SimpleDataReader : IDataReader { readonly string[] columnNames; readonly Type[] columnTypes; readonly object[] values;
protected SimpleDataReader(string[] names, Type[] types) { if (names == null) throw new ArgumentNullException("names"); if (types == null) throw new ArgumentNullException("types"); if (names.Length != types.Length) throw new ArgumentException("Names / types lengths must match");
columnNames = (string[])names.Clone(); columnTypes = (Type[])types.Clone(); values = new object[names.Length]; } protected void SetValues(params object[] values) { if (values == null) throw new ArgumentNullException("values"); if (values.Length != this.values.Length) throw new ArgumentException("Values length is invalid"); values.CopyTo(this.values, 0); } private bool isClosed = false; void IDisposable.Dispose() { Close(); } abstract protected void DoClose(); public void Close() { try { DoClose(); } finally { isClosed = true; } } public bool IsClosed { get { return isClosed; } } public int RecordsAffected { get { return -1; } }
#region IDataReader Members
public int Depth { get { return 0; } }
public DataTable GetSchemaTable() { return null; }
public bool NextResult() { Close(); return false; } public bool Read() { return DoRead(); } protected abstract bool DoRead();
#endregion
#region IDataRecord Members
public int FieldCount { get { return columnNames.Length; } }
public bool GetBoolean(int i) { return Convert.ToBoolean(GetValue(i)); }
public byte GetByte(int i) { return Convert.ToByte(GetValue(i)); }
public long GetBytes(int i, long fieldOffset, byte[] buffer, int bufferoffset, int length) { byte[] data = (byte[])GetValue(i); int max = data.Length + 1 - (int)fieldOffset; if (length > max) length = max; if (length > 0) { Buffer.BlockCopy(data, (int)fieldOffset, buffer, bufferoffset, length); return length; } return 0; }
public char GetChar(int i) { return Convert.ToChar(GetValue(i)); }
public long GetChars(int i, long fieldoffset, char[] buffer, int bufferoffset, int length) { string data = GetString(i); int max = data.Length + 1 - (int)fieldoffset; if (length > max) length = max; if (length > 0) { data.CopyTo((int)fieldoffset, buffer, bufferoffset, length); return length; } return 0; }
public IDataReader GetData(int i) { throw new NotSupportedException(); }
public string GetDataTypeName(int i) { return GetFieldType(i).Name; }
public DateTime GetDateTime(int i) { return Convert.ToDateTime(GetValue(i)); }
public decimal GetDecimal(int i) { return Convert.ToDecimal(GetValue(i)); }
public double GetDouble(int i) { return Convert.ToDouble(GetValue(i)); }
public Type GetFieldType(int i) { return columnTypes[i]; }
public float GetFloat(int i) { return Convert.ToSingle(GetValue(i)); }
public Guid GetGuid(int i) { object obj = GetValue(i); byte[] binary = obj as byte[]; if (binary != null) { return new Guid(binary); } else { return new Guid(Convert.ToString(obj)); } }
public short GetInt16(int i) { return Convert.ToInt16(GetValue(i)); }
public int GetInt32(int i) { return Convert.ToInt32(GetValue(i)); }
public long GetInt64(int i) { return Convert.ToInt64(GetValue(i)); }
public string GetName(int i) { return columnNames[i]; }
public int GetOrdinal(string name) { for (int i = 0; i < columnNames.Length; i++) { if (StringComparer.InvariantCultureIgnoreCase.Equals(name, columnNames[i])) return i; } return -1; }
public string GetString(int i) { return Convert.ToString(GetValue(i)); }
public object GetValue(int i) { return values[i]; }
public int GetValues(object[] values) { if (values == null) throw new ArgumentNullException("values"); int fields = values.Length < FieldCount ? values.Length : FieldCount; Array.Copy(this.values, values, fields); return fields; }
public bool IsDBNull(int i) { object obj = GetValue(i); return obj == null || obj is DBNull; }
public object this[string name] { get { return this[GetOrdinal(name)]; } }
public object this[int i] { get { return GetValue(i); } }
#endregion }
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|