> Tom
>
[quoted text clipped - 7 lines]
> in "one big chunk". I don't know if this is correct or not, but it certainly
> reduced the time.
Unfortunately, std::map doesn't sit in memory in one large chunk - there
is one chunk for each entry in the map, so there is no way to write out
the map without iterating over the entries.
Now your suggestion goes reading one record at a time...
> mind me, your suggested code does work and takes only a few seconds to read
> the data. Still, I wonder if those few seconds could still be somehow reduced
> say from 8 to 4... I know I'm being ambitious, but I'd like to optimize this
> part of the program as much as possible. If not, I'll be happy with this
> solution.
I'm sure it is possible to reduce the time further. One approach is to
remove the calls to "read" and "write" and replace them with calls like
this:
FileOut.rdbuf()->sputn(same params as for write);
FileIn.rdbuf()->sgetn(same params as for read);
sputn/sgetn are quite a bit faster than write/read.
Another approach is to take the map and transfer its contents to a
vector, which can be written out in one chunk. I've posted two different
approaches, one legal but a bit slower, the other illegal, but likely to
work on most platforms:
typedef map<int, double> IMAP;
struct IMAP_POD
{
int key;
double value;
};
struct IMAPConverter
{
IMAP_POD operator()(IMAP::const_reference val) const
{
IMAP_POD p = {val.first, val.second};
return p;
}
std::pair<int, double> operator()(IMAP_POD const& val) const
{
return std::pair<int, double>(val.key, val.value);
}
};
void writeIMAP(IMAP const& m, ostream& os)
{
vector<IMAP_POD> v(m.size());
transform(m.begin(), m.end(), v.begin(), IMAPConverter());
//write the size:
vector<IMAP_POD>::size_type size = v.size();
os.write(reinterpret_cast<char*>(&size), sizeof size);
//write the map as a single vector:
os.write(reinterpret_cast<char*>(&v[0]), v.size() * sizeof v[0]);
}
void readIMAP(IMAP& m, istream& is)
{
vector<IMAP_POD>::size_type size;
//read the size:
is.read(reinterpret_cast<char*>(&size), sizeof size);
vector<IMAP_POD> v(size);
//read the map as a single vector:
is.read(reinterpret_cast<char*>(&v[0]), v.size() * sizeof v[0]);
vector<std::pair<int, double> > typedV;
typedV.reserve(size);
transform(v.begin(), v.end(), back_inserter(typedV), IMAPConverter());
//range insert for a sorted range
//is much faster than inserting one by one
m.insert(typedV.begin(), typedV.end());
}
Illegal approach:
typedef map<int, double> IMAP;
void writeIMAP(IMAP const& m, ostream& os)
{
typedef std::pair<int, double> non_const_value_type;
vector<non_const_value_type> v;
v.reserve(m.size());
v.insert(v.begin(), m.begin(), m.end());
//write the size:
vector<non_const_value_type>::size_type size = v.size();
os.write(reinterpret_cast<char*>(&size), sizeof size);
//write the map as a single vector:
os.write(reinterpret_cast<char*>(&v[0]), v.size() * sizeof v[0]);
}
void readIMAP(IMAP& m, istream& is)
{
typedef std::pair<int, double> non_const_value_type;
vector<non_const_value_type>::size_type size;
//read the size:
is.read(reinterpret_cast<char*>(&size), sizeof size);
vector<non_const_value_type> v(size);
//read the map as a single vector:
is.read(reinterpret_cast<char*>(&v[0]), v.size() * sizeof v[0]);
m.insert(v.begin(), v.end());
}
The reason that is illegal is that you can only copy the bytes into and
out of POD types, and std::pair<int, double> is not a POD type. However,
pair<int, double> is close to being a POD type (it doesn't have any base
classes or virtual functions, and the destructor is basically a no-op),
so the above is very likely to work on every platform.
> The second question is related to your comment about portability. A file
> saved as binary with this code in Windows cannot be read in UNIX?
The problem here is the format used by the CPU and compiler to hold ints
and doubles, and the sizes of those types. For example, some CPUs use a
big endian 64-bit 2s complement format for "int", while Windows (and
UNIX) compilers for x86 use a little endian 32-bit 2s-complement format.
Basically, the bits for a particular int value (such as 1234567) are
quite different on some platforms.
There's a bit more on it here:
http://www.eskimo.com/~scs/C-faq/q20.5.html
If you want portable binary, you need to decide on exactly the binary
format you want, and then make sure that your code writes out bytes in
the right format (byte-order swapping and padding as necessary).
I thought
> binary files could be read anywhere... Can this problem be solved? For
> example, should I leave the data file as text (ASCII) and load it as binary
> in the same amount of time? Can then the same file be read both in Windows
> and UNIX?
It should be possible to optimize the code you use to read it as text so
that it operates much faster. If you need portability, this may be the
best option. If you want to do this, I'd suggest posting the code you
have (in a new thread) and asking for help in speeding it up.
Tom
knapak - 08 Jun 2005 20:04 GMT
Tom
Thanks again for your invaluable help. As for the alternative to write and
read, yes it improved the loading time... by about 0.4 of a sec (4.2 to
3.8)... which to me is quite good. I have to admit that your methods were
completely unknwon to me (remember I'm an amateur). I guess my only question
would be if there's is any room for problems by using the reinterpret_cast.
As for the protability problem, you actually suggested to explore some
alternatives of standardized binary formats including netCDF. Actually I've
tried using netCDF but didn't quite follow the procedures and is very
difficult to find people with the expertise to provide assistance. For now
I'll try to work with your solution and eventually when my files get bigger
and do require switching between windows and unix I'll come back and ask
directly if anyone knows how to work with netCDF.
I very much appreciate the time you took to help me.
Carlos