> _Usually_, if there's a broken connection, you'll see exceptions shortly
> after you start trying to send data, if not as soon as you try to send
> data. It does depend somewhat on how the connection is broken and where.
> Depending on the network configuration, it is theoretically possible to
> _never_ get an exception.
For whatever reason (based on my testing in our development environment), I
see about a minute's worth of time go by before a socketexception gets
thrown and the disconnect recognized.
> * You may want to put an upper bound on how many send operations you
> perform for any connection without getting a response, or at least a
[quoted text clipped - 6 lines]
> available memory before the network layer can detect the broken
> connection.
The associated error at the time of crash tends to be
EVENT_SRV_NO_NONPAGED_POOL in Event Viewer (aka: we ran out of nonpaged pool
memory).
Is setting the socket buffer to 0 a non-default setting (right now we do not
explicitly make a setting to that value). If set that way, would it maybe
use our application memory (which obviously has a much larger pool to
allocate from) rather than nonpaged and possibly give the socket enough time
to recognize the disconnect?
The socket traffic is XML messages (typically one-way to the client). They
can range from maybe a couple hundred bytes to the largest being several
hundred KB. I don't remember what the cap is on non-paged memory, but we
put some counts in to look at number of calls to begin send vs number of
callbacks received, and the difference quickly grew into the thousands
during this minute or so time period.
> * You may have a bug where you don't handle an out-of-memory condition
> gracefully. Whether this is really a bug depends on your intended
[quoted text clipped - 15 lines]
> error. It seems like on a modern computer, you shouldn't be able to
> allocate memory fast enough to cause that to happen.
Our test setup it takes about a minute. However, our test setup also
doesn't crash. Watching nonpaged pool memory with perfmon, I do see a spike
begin after I yank the cord, but it does not crash. A minute goes by, a
"logout" (socket disconnection) gets logged by our application, and memory
falls back to normal. It seems one of two things is happening in the
production setup:
1) They have FAR more data flowing than we do in our test setup, causing the
spike to be of greater magnitude and big enough to crash the application
before the minute goes by (this is almost certainly true - they DO have more
data); and/or:
2) Their network configuration is not registering the disconnection for a
timeperiod longer than a minute - I am still working to verify exactly how
long it took the application to crash after uncompleted operations started
stacking up.
- Adam
Peter Duniho - 23 May 2008 18:55 GMT
> For whatever reason (based on my testing in our development
> environment), I
> see about a minute's worth of time go by before a socketexception gets
> thrown and the disconnect recognized.
Yuck. For what it's worth, I've never seen disconnects take that long to
detect when actually sending data (obviously, they can take indefinitely
longer if you don't try to send anything :) ). I typically see the
disconnect within a second or two.
It might be worth trying to explore what makes it take so long. I have
little enough experience with the lowest levels of networking that I can't
suggest specifics in that regard.
The only higher-level thing that comes to mind is the possibility that
there's some thread hogging all the CPU time, which is limiting how
quickly your i/o thread(s) get to process things. In this latter
scenario, the network driver itself would be detecting the disconnect
almost immediately, but wouldn't get a chance to report it until much
later.
But it'd be hard for me to say for sure even with a code sample. Without
one, it's just pure speculation. That said, if you have any code that's
raising thread priorities, you might consider disabling it to see if that
helps (hopefully you don't...it's almost never the right thing to do :)
). And if you have a thread that is compute-intensive, you might consider
_lowering_ that thread's priority so that in times of high i/o load, it
doesn't get in the way.
> [...]
> The associated error at the time of crash tends to be
[quoted text clipped - 9 lines]
> time
> to recognize the disconnect?
Maybe. However, I'm not really sure why the non-paged pool is involved.
Typically, the network driver is going to have a fixed sized buffer. I
wouldn't expect it to try to expand that buffer or add new ones. Instead,
it will either reject an attempt to queue new data (non-blocking i/o) or
it will force the attempt to wait until there is space (blocking i/o).
AFAIK a 0-sized buffer for your socket is not the default, and it has the
effect of telling the driver to not buffer at all, but rather to use the
buffer you provide. This is common for IOCP implementations of sockets,
and since the async Socket API uses IOCP, it's something to try. The main
advantage is actually one of performance -- it avoids one copy of the data
-- but I suppose if there's something about the network layer where it's
trying to allocate non-paged memory as you queue data, telling it not to
buffer might improve things.
Again, I'm not actually clear myself why non-paged memory would be getting
allocated at this point. But then, that's as likely just a gap in my
knowledge as it is an indication that that's abnormal and/or unrelated to
your problem. :)
I apologize for the vagueness in my comments. The bulk of my socket
programming experience is with the unmanaged Winsock API. Inasmuch as the
.NET Socket class is built on that, my previous knowledge is applicable,
but there may be details specific to .NET that I'm unaware of.
Pete