TCP errors 56 and 66 on VxWork sbRIO

Stobber · April 7, 2015

I have a small LV code library that wraps the STM and VISA/TCP APIs to make a "network actor". My library dynamically launches a reentrant VI using the ACBR node that establishes a TCP connection as client or server, then uses the connection to poll for inbound messages and send outbound ones.

When I try to establish a connection using my Windows 7 PC as the client and my sbRIO (VxWorks) as the server, it connects and pushes 100 messages from the client if the server was listening first. If the client spins up first, sometimes it works and sometimes I get error 56 from the "TCP Open Connection" primitive repeatedly. Other times I've seen error 66.

When I try to make Windows act as the server, the sbRIO client connects without error, but it returns error 56 from "TCP Read" on the first call thereafter (inside "STM.lvlib:Read Meta Data (TCP Clst).vi"). This happens in whichever order I run them.

This test, and all others I've written for the API, works just fine when both client and server are on the local machine.

-------------------------------

I'm tearing my hair out trying to get a reliable connection-establishment routine going. What don't I know? What am I missing? Is there a "standard" way to code the server and client so they'll establish a connection reliably between an RT and a non-RT peer?

Neil Pate · April 7, 2015

Network streams?

ShaunR · April 7, 2015

A listener has to be running for a client to connect via TCPIP otherwise you should get an error 63 (connection refused). That's just the way TCPIP works. It is up to the client to retry [open] connection attempts in the event of a connect failure but the general assumption is that there is an existing listener (service) on the target port of the server that the client is attempting to connect to.

N.B.

Regardless of the purpose of the data.to be sent or recieved. In TCPIP, Client = Open and Server = Listen

Edited April 7, 2015 by ShaunR

Stobber · April 7, 2015

Network streams?

I used them for two years, and they repeatedly failed on their promise. Lots of errors thrown by the API that the application had to be protected from (including error codes documented only for Shared Variables?!), lots of issues with reconnection from the same app instance after an unexpected disconnection, lots of issues with namespacing endpoints, element buffering, etc. I never kept detailed notes on all of it to share with others, but the decision to use raw TCP was actually a decision to get the hell off Network Streams.

A listener has to be running for a client to connect via TCPIP otherwise you should get an error 63 (connection refused). That's just the way TCPIP works. It is up to the client to retry [open] connection attempts in the event of a connect failure but the general assumption is that there is an existing listener (service) on the target port of the server that the client is attempting to connect to.

N.B.

Regardless of the purpose of the data.to be sent or recieved. In TCPIP, Client = Open and Server = Listen

Right, thanks. Glad to know that observation makes sense. Now to debug the part where a connection is closed on the VxWorks target between "Listen" and "Read" for no apparent reason...

Edited April 7, 2015 by Stobber

ned · April 7, 2015

Error 56 - a timeout - isn't really an error, it just means there wasn't any data. The connection is still valid, and data could arrive later. In most of my TCP code I ignore error 56, especially if I'm polling multiple connections and expect that there won't be data on most of them most of the time. I bundle my TCP references with a timestamp indicating the last time that data was received on that connection, and if it's ever been longer than an hour since I last received data on that connection, then I close that connection. I've used this same approach with the STM (adapting existing code that used it) and it worked fine there too.

Neil Pate · April 7, 2015

I used them for two years, and they repeatedly failed on their promise. Lots of errors thrown by the API that the application had to be protected from (including error codes documented only for Shared Variables?!), lots of issues with reconnection from the same app instance after an unexpected disconnection, lots of issues with namespacing endpoints, element buffering, etc. I never kept detailed notes on all of it to share with others, but the decision to use raw TCP was actually a decision to get the hell off Network Streams.

Certainly the buffering has it's oddities, but I have not come across the other issues you mention. I still think you are brave for going back to pure TCP/IP though!

ShaunR · April 7, 2015

Error 56 - a timeout - isn't really an error, it just means there wasn't any data. The connection is still valid, and data could arrive later. In most of my TCP code I ignore error 56, especially if I'm polling multiple connections and expect that there won't be data on most of them most of the time. I bundle my TCP references with a timestamp indicating the last time that data was received on that connection, and if it's ever been longer than an hour since I last received data on that connection, then I close that connection. I've used this same approach with the STM (adapting existing code that used it) and it worked fine there too.

Timestamps are very useful.. That's the reason I added one to the Transport.lvlib so that I could ascertain the clock skew between the sender and receiver.

Stobber · April 7, 2015

Thank you all for the help. It turns out that some of the TCP Read calls inside STM.lvlib's VIs were causing error 56 or returning junk bytes when I set the timeout too low. (e.g. "0 ms" in an attempt to create a fast poller that was throttled by a different timeout elsewhere in the loop). When I increased the timeouts on all TCP Read functions, my problems went away. Well, that's how it looks right now, anyway.

Incidentally, setting a timeout of 0 ms works fine if I'm asking client and server to talk over the same network interface on the same PC. That kind of makes sense.

Update: I'm now having serious problems with backlogged messages. I get messages out of the TCP Read buffer in bursts, and it seems the backlog grows constantly while running my app. This is breaking the heartbeat I'm supposed to send over the link, so each application thinks the other has gone dead after a second or two. Anybody know what might cause the TCP connection to lag so badly?

Edited April 8, 2015 by Stobber

Neil Pate · April 8, 2015

Update: I'm now having serious problems with backlogged messages. I get messages out of the TCP Read buffer in bursts, and it seems the backlog grows constantly while running my app. This is breaking the heartbeat I'm supposed to send over the link, so each application thinks the other has gone dead after a second or two. Anybody know what might cause the TCP connection to lag so badly?

I have not messed about with this kind of thing in years, but perhaps it's Nagle's algorithm at play? You can disable it using Win32 calls, see here.

Edited April 8, 2015 by Neil Pate

ShaunR · April 8, 2015

Update: I'm now having serious problems with backlogged messages. I get messages out of the TCP Read buffer in bursts, and it seems the backlog grows constantly while running my app. This is breaking the heartbeat I'm supposed to send over the link, so each application thinks the other has gone dead after a second or two. Anybody know what might cause the TCP connection to lag so badly?

As a guess I would say you probably have a short timeout and the library uses STANDARD mode when reading.

Do you get this behaviour when the read timeout is large, say, 25 seconds?

Most TCPIP libraries use a header that has a length value. This makes them application specific, but really easy to write. They rely on reading x bytes and then using that as a length input to read the rest of the message. That's fine and works in most cases as long as the first n bytes of a read cycle are guaranteed to be the length.

What happens if you have a very short timeout though (like your zero ms)? Now you are not guaranteed to read all of the x length bytes. You might only get one or two of a 4 byte value and then timeout. If using STANDARD mode the bytes are returned even though timed out but now they are gone from the input stream (destructive reads). So the next stage doesn't read N bytes as normal because of the timeout but now you are out of sync so when you come around the next time you start half way through the length bytes. This is why Transport.lvlib uses BUFFERED rather than STANDARD since buffered isn't destructive after a timeout and you can circle back an re-read until you have all the bytes at the requested number of bytes.

Edited April 8, 2015 by ShaunR

Stobber · April 8, 2015

As a guess I would say you probably have a short timeout and the library uses STANDARD mode when reading.

Do you get this behaviour when the read timeout is large, say, 25 seconds?

Actually, I'm using a patched version of STM where I fixed that issue by changing all TCP reads to BUFFERED mode. So that's not a problem.

I do successfully read in a simple test VI with a large timeout, but debugging in my application which requires a heartbeat in each direction every 250 ms makes the whole thing lag and choke. I'm going to try and write a more complex test VI that does the heartbeating without any other app logic involved.

I have not messed about with this kind of thing in years, but perhaps it's Nagle's algorithm at play? You can disable it using Win32 calls, see here.

Let me look into that, too. Thanks!

ShaunR · April 8, 2015

Actually, I'm using a patched version of STM where I fixed that issue by changing all TCP reads to BUFFERED mode. So that's not a problem.

I do successfully read in a simple test VI with a large timeout, but debugging in my application which requires a heartbeat in each direction every 250 ms makes the whole thing lag and choke. I'm going to try and write a more complex test VI that does the heartbeating without any other app logic involved.

Let me look into that, too. Thanks!

250 ms is pushing it as that is what the default nagle is. That should not make it back up messages, though, just your heartbeat will timeout sometimes. If you have hacked the library to use BUFFERED instead of STANDARD then it is probable that you are just not consuming the bytes because a read no longer guarantees that they are removed. That will cause the windows buffer to fill and eventually become unresposive until you remove some bytes.

Edited April 8, 2015 by ShaunR

Stobber · April 8, 2015

250 ms is pushing it as that is what the default nagle is. That should not make it back up messages, though, just your heartbeat will timeout sometimes.

That's definitely happening, and it's the first-tier cause of my headache. I need to get heartbeating working consistently again.

If you have hacked the library to use BUFFERED instead of STANDARD then it is probable that you are just not consuming the bytes because a read no longer guarantees that they are removed. That will cause the windows buffer to fill and eventually become unresposive until you remove some bytes.

Huh...that's good to know. Is there a way to check the buffer without popping from it? I could add a check-after-read while debugging to make sure I'm getting everything I think I'm getting.

ShaunR · April 8, 2015

That's definitely happening, and it's the first-tier cause of my headache. I need to get heartbeating working consistently again.

Huh...that's good to know. Is there a way to check the buffer without popping from it? I could add a check-after-read while debugging to make sure I'm getting everything I think I'm getting.

Well. Neil has given you the VI to turn the Nagle algo off. So that shouldn't be a problem.

There is no primitive to use with the standard TCPIP VIs to find the number of bytes waiting. There is a Bytes At Port property for VISA, but people tend to prefer the simplicity of the TCPIP VIs.

ned · April 8, 2015

Another solution, although this may sound overly complicated, is maintain a buffer of all the received bytes from a given connection yourself. Since I'm already bundling a timestamp with my TCP connection IDs, adding a string to that cluster isn't a big deal. Each time I read bytes from a connection, I append them to the existing buffer, then attempt to process them as a complete packet (or as multiple packets, in a loop). Any remaining bytes after that processing go back in the buffer.

There might be another approach to work around the Nagle algorithm. Try doing a read (even of 0 bytes) immediately following each write. I believe this will force the data to be sent, on the assumption that the read is waiting for a response from the just-sent packet. I'm not completely certain of this, but I think I tried it once and it worked.

For a heartbeat, if you're just checking if the remote system is running, it might be worth switching to UDP.

Stobber · April 8, 2015

It was Nagle.

Summary:

Connection issues between two machines because of extremely short timeouts on the TCP Open/Listen functions and the STM Metadata functions. Packet buffering issues because of the Nagle algorithm.

Thank you guys very very much for all the help!

Sign In

TCP errors 56 and 66 on VxWork sbRIO

Recommended Posts

Stobber

Neil Pate

ShaunR

Stobber

ned

Neil Pate

ShaunR

Stobber

Neil Pate

ShaunR

Stobber

ShaunR

Stobber

ShaunR

ned

Stobber

Join the conversation

Similar Content

Communicator based on tcp protocol

remote control Initialize raspberry pi with LINX

Continuously Read data via TCP

Sending sensor data from a TCP socket in python(RPI) to LABVIEW

Control a robot from Universal Robot

Browse

Activity

Important Information