Jump to content

TCP Listener: can it fail?


Recommended Posts

Has anyone ever encountered a case where a TCP Listener is created and used, but at some point becomes invalid, throwing an error at the "TCP Wait on Listener" node?  I'm trying to understand a rare issue on a deplyed system that might be explained by that.

Link to comment
11 hours ago, ShaunR said:

What's the error?

Sadly, I don't know as such an error would get lost in the code as written.  This is a rare error in code deployed code on a low-powered single-board computer.  Appears to be a loss of TCP connection, followed by the Client not being able to reconnect (which is why I suspect the Listener dying).  An added clue is that an additional third-party non-LabVIEW TCP server seems to fail and restart itself at the same time (according to entries in its log file.

Link to comment
1 hour ago, drjdpowell said:

Sadly, I don't know as such an error would get lost in the code as written.  This is a rare error in code deployed code on a low-powered single-board computer.  Appears to be a loss of TCP connection, followed by the Client not being able to reconnect (which is why I suspect the Listener dying).  An added clue is that an additional third-party non-LabVIEW TCP server seems to fail and restart itself at the same time (according to entries in its log file.

If the socket library or one of its TCP/IP provider sub components resets itself, for whatever reason, it is definitely possible that a listener could report an error. This could happen because the library detected an unrecoverable error (TCP/IP is considered such an essential service on modern platforms that a simple crash is absolutely not acceptable whenever it can be avoided somehow) or even when you or some system component reconfigures the TCP/IP configuration somehow.

My TCP/IP listeners are actually a loop that sits there and waits on incoming connections as long as the wait returns only a timeout error. Any other error will close the listener refnum and loop back to the Create Listener before going again into the Wait Listener state. The Wait on Listener doesn't return an error cluster just to report that there is no new connection yet (timeout error 56) but effectively can return other errors from the socket library, even though that is rare. 

In case of any other errors than timeout, I immediately close the refnum, do a short delay to not let the loop monopolize the thread if the socket library should have another condition than a temporary hiccup, and then go back to Create Listener state until that succeeds. It's a fairly simple state machine but essential to continuous TCP/IP operation.

Technical details: the Wait on Listener basically does a select() (or possibly poll()) on the underlaying listener socket and this is the function that can fail if the socket library gets into a hiccup.

Edited by Rolf Kalbermatter
Link to comment

Without knowing what the problem (error) is with the underlying socket, you are on a hiding to nothing. The fact it is rare and affects other software points to a hardware or OS issue, not your code (except that your code doesn't report an error). That doesn't mean to say other software or the OS isn't responsible for crowbarring the socket underneath you but you need errors to figure out what is happening and what the state of the socket is when it fails. 

2 hours ago, Rolf Kalbermatter said:

Technical details: the Wait on Listener basically does a select() (or possibly poll()) on the underlaying listener socket and this is the function that can fail if the socket library gets into a hiccup.

Another technical detail. The LabVIEW Listener uses SO_EXCLUSIVEADDRUSE and assumes it has sole ownership of the socket. It then uses the "Internecine Avoider" to choose ports. If no net address is defined it binds to all network adapters.

Does the device have multiple network interfaces? What is the OS?

Link to comment
1 hour ago, ShaunR said:

Another technical detail. The LabVIEW Listener uses SO_EXCLUSIVEADDRUSE and assumes it has sole ownership of the socket. It then uses the "Internecine Avoider" to choose ports. If no net address is defined it binds to all network adapters.

Does the device have multiple network interfaces? What is the OS?

That about the Internecine Avoider is only true if you use the high level TCP Listener.vi. I usually use the low level primitives Create Listener and Wait on Listener instead (and always close the refnum if I detect any error other than timeout). The SO_EXCLUSIVEADDRUSE is in principle a good thing, you do not usually want someone else to be able to capture your port number.

Edited by Rolf Kalbermatter
Link to comment
24 minutes ago, Rolf Kalbermatter said:

The SO_EXCLUSIVEADDRUSE is in principle a good thing, you do not usually want someone else to be able to capture your port number.

Indeed....but!

Quote

 

A socket with SO_EXCLUSIVEADDRUSE set cannot always be reused immediately after socket closure. ...

Even after the socket is closed, the system must send all of the buffered data, transmit a graceful disconnect to the peer, and wait for a graceful disconnect from the peer. It is therefore possible that the underlying transport may never release the connection,

 

We are looking at a 0.1% case so it's things like this we have to bear in mind.

24 minutes ago, Rolf Kalbermatter said:

That about the Internecine Avoider is only true if you use the high level TCP Listener.vi.

Error messages not being reported. My money is on it being used. You are probably in the 0.00001% of LabVIEW programmers that don't use it.

 

 

Edited by ShaunR
Link to comment
17 minutes ago, ShaunR said:

Indeed....but!

We are looking at a 0.1% case so it's things like this we have to bear in mind.

Yes it takes some time after closing the refnum until the socket has gone through the entire RST, SYN, FIN handshaking cycle with associated timeouts. And that is even true if nobody has been connecting to the listener at that point to request a new connection. So with the SO_EXCLUSIVEADDRUSE flag you can end up having the listener to fail multiple times to create a new socket on the specified port. The alternative of not using exclusive mode is however in my opinion not really a good option.

And the Internecine Avoider actually is a potential culprit in the observed problem of the OP. It doesn't really close the socket but rather tries to reuse it. The internal check if the refnum is valid, is in fact not really checking that the socket has not been in error, just that LabVIEW has still a valid refnum, the socket this refnum refers to may still be in an unrecoverable error and keep failing. To recover from a (admittingly rarely occurring) socket library error on the listener socket, the socket needs to be closed. And that means that a socket that has been opened with SO_EXCLUSIVEADDRUSE may actually be blocked from being reopened for up to a minute or more. But trying to reuse the failed socket is even worse as that will never recover.

If Wait on Listener fails with any other error than a timeout error, you should close the listener refnum and try to reopen it until it succeeds or the user exits the application/operation.

Edited by Rolf Kalbermatter
Link to comment
45 minutes ago, drjdpowell said:

More details: I'm not using the "Internecine Avoider" and I'm using a "net address" of 127.0.0.1, which I beleive means I'm not going through any network card (all three apps are on the same computer).

The Internetcine Avoider should only come into play when you use the high level Listen.vi. If you use directly the Create Listener and Wait on Listener primitives there is no Internetcine Avoider unless you add it yourself.. But, Wait on Listener CAN return other errors than timeout errors and that usually means something got seriously messed up with the underlaying listener socket and the most prudent action is almost always to close that socket and open a new one. Except of course that when you close a listener socket it doesn't just go out of existence in a blink. It usually stays present in the underlaying socket library for a certain timeout period to catch potential late arriving connection requests and respond to them with a RST/NACK response to let the remote side know that it is not valid anymore.

And together with the SO_EXCLUSIVEADDRUSE flag this makes new requests to create a socket on the same port fail with an according error, since the port is technically still in use by that half dead socket. That socket gets eventually deleted and then a new Create Listener call on that port will succeed, unless someone else was able to grab it first.

And even if you stay entirely within the same system and there is no actual network card packet driver involved, can the socket library reset itself, for instance when a system service or the user does some reconfiguration of the network configuration.

But if your code doesn't do something like this:

do
{
    err = CreateListener(&listenRefnum);
    if (!err)
    {
        do
        {
            err = WaitOnListener(listenRefnum, waitInterval, &connectionRefnum);
            if (!err)
            {
                CreateNewConnectionHandler(connectionRefnum);
            }
            else if (err != timeout)
            {
                // if we have any other error than timout, leave the loop
                // which will close the listener and go back to create a new one
                LogError(err);
                break;
            }
        } while (!quit);
        Close(listenRefnum);
    }
    else
    {
        LogError(err);
        Delay(someWaitTime);
    }
} while (!quit);

it will keep trying to listen on a socket that might have been long going into an error condition.

Edited by Rolf Kalbermatter
Link to comment
4 hours ago, Rolf Kalbermatter said:

And together with the SO_EXCLUSIVEADDRUSE flag this makes new requests to create a socket on the same port fail with an according error, since the port is technically still in use by that half dead socket. That socket gets eventually deleted and then a new Create Listener call on that port will succeed, unless someone else was able to grab it first.

Are Listener ports affected by this "half-dead" issue?  I would have thought this is just and issue of TCP Connections (with a connected remote party) rather than a Listener.

Link to comment
43 minutes ago, drjdpowell said:

Are Listener ports affected by this "half-dead" issue?  I would have thought this is just and issue of TCP Connections (with a connected remote party) rather than a Listener.

According to this https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ it would seem that listener sockets are not supposed to linger around. Still you should probably be prepared that a Create Listener right after a Close Listener can fail just to be on the safe side!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.