Jump to content

Delayed TCP message


Recommended Posts

I've got an issue that I'm hoping someone has seen before. Any thoughts would be appreciated.

 

I have three applications, all written in LabVIEW 2012 SP1 running on Windows 7. The first is a server application. The other two are clients to the server. TCP is used to talk between the client and server applications. Both clients use a common communication library for the server application. All applications are on the same computer. The network address is "localhost".

 

The first client, let us call it Client A, has no issues. Messages sent and received are 'immediate'. This is an application that has been about for a couple of years.

 

The second client, let us call it Client B, is new. Most messages sent and received are immediate. One message takes 3-5 seconds longer to get a response. I went back in to the server application and timed how long it takes to perform the actions from receipt of message to response; this is <30 msec. Wireshark confirms the response to the message from Client B to server takes 3-5 seconds to occur with no TCP transmit errors. I've disabled nagling on both ends with some reduction of a second, but that still leaves 2-4 seconds when there should be milliseconds.

 

Going back to my development PC, I setup a small test VI to just perform the messages (no overhead of other code). I see the same behavior over the network to the server application as with Client B.

 

The messages are mostly small (less than 100 bytes). Similar sized messages/response to the one that is problematic with Client B do now show the same delay issue.

 

I'm at the end of a project when we're looking at cycle time for production machinery. This message is sent 10 times during the test, so I've got 30-50 seconds of cycle time I'm trying to kill with just this item alone. I will find a work-around if I can't figure it out quickly, but I would rather solve the problem on this job as it will come back on the next.

 

This is a bit large to post. I've not been able to devote time to shrink it down to an example.

Link to comment

Here is an interaction I have had on more than a few occasions:

 

A: I have a random TCP problem

Me:  Sounds like a Nagle's Algorithm issue.

A:  I'll try that....Nope disabling Nagle did not help.

Me:  Sounds like a Nagle's Algorithm issue.

A:  I tried again, does not look like a problem with Nagle

Me:  That's odd.  Let me know what you find out....

<Time passes>

A:  Turns out I screwed up, it was actually a Nagle's Algorithm issue.

 

What I am saying is that this sounds like a textbook case of Nagle's algorithm.  Until I was really, really sure, I would not look for something besides Nagle to explain the issue, I would look for the reason why it is not being disabled everywhere like you think it should be.

 

One time I even started appending random garbage to the end of every message that was not of a given length.

Link to comment

In my experience, the Nagle algorithm isn't as problematic as everyone makes it out to be. Also, 2-4 seconds is longer than I'd expect due to Nagle-related delay. My first guess is you have a TCP read somewhere that's expecting slightly more data than it actually receives, so it waits the full timeout period. What TCP Read mode are you using? Let's say the client is using CRLF mode, but the server doesn't append the end-of-line character to the response - TCP Read will wait the full timeout period, and still return a valid response (assuming you don't check for that CRLF).

  • Like 1
Link to comment

Here is an interaction I have had on more than a few occasions:

 

A: I have a random TCP problem

Me:  Sounds like a Nagle's Algorithm issue.

A:  I'll try that....Nope disabling Nagle did not help.

Me:  Sounds like a Nagle's Algorithm issue.

A:  I tried again, does not look like a problem with Nagle

Me:  That's odd.  Let me know what you find out....

<Time passes>

A:  Turns out I screwed up, it was actually a Nagle's Algorithm issue.

 

What I am saying is that this sounds like a textbook case of Nagle's algorithm.  Until I was really, really sure, I would not look for something besides Nagle to explain the issue, I would look for the reason why it is not being disabled everywhere like you think it should be.

 

One time I even started appending random garbage to the end of every message that was not of a given length.

 

Here is a conversation I've had on more than a few occasions.

 

A: I have a random TCP problem

Me:  What are the symptoms?

A:  Message delays.

Me:  250ms?.

A:  No. [iNSERT SECONDS HERE] seconds. Do you think it is the NAGLE algo?

Me:  No. It's more than 250 ms. A read is timing out.

<Time passes>

A:  Turns out I screwed up, I was retrying/ignoring after a read error.

 

:D

Edited by ShaunR
Link to comment

I really do appreciate the responses.

 

I've run awry of Nagle before. This doesn't feel like that's the issue, but I tried disabling on one end, the other end, and both ends (yea, I'm not 100% on how Nagle gets implemented). Like I said, I did see an improvement, but not the whole solution.

 

There's no loop to the client; it's open connection, send message, wait for response, close connection. I'm sending binary data in some commands and responses, so I'm prepending the message length. Each side reads the length first. This is akin to the Data Client and Data Server examples, but I'm sending both length and data over one TCP write instead of two.

 

I went back and looked at how Immediate works. I remember having issues with missing/incomplete messages when I tried using it in spots during initial server-side development, but how I was trying to use it escapes me.

 

Put together a quick summary. Didn't put in the Nagle VIs, which I currently have on the server side only. Quick summary of client.vi Quick summary of server.vi

Edited by Tim_S
Link to comment

I really do appreciate the responses.

 

I've run awry of Nagle before. This doesn't feel like that's the issue, but I tried disabling on one end, the other end, and both ends (yea, I'm not 100% on how Nagle gets implemented). Like I said, I did see an improvement, but not the whole solution.

 

There's no loop to the client; it's open connection, send message, wait for response, close connection. I'm sending binary data in some commands and responses, so I'm prepending the message length. Each side reads the length first. This is akin to the Data Client and Data Server examples, but I'm sending both length and data over one TCP write instead of two.

 

Put together a quick summary. Didn't put in the Nagle VIs, which I currently have on the server side only. attachicon.gifQuick summary of client.vi attachicon.gifQuick summary of server.vi

 

Increase your 100 ms timeouts (and your 10ms listen)  to 1 sec.

 

Also. Try the Transport examples and see if you get the same problem.

Edited by ShaunR
Link to comment

While this is unlikely to be the problem, is there a difference in the network connections in how A and B are connected to the server? Is one directly on the same switch, and the other further away?

Link to comment

While this is unlikely to be the problem, is there a difference in the network connections in how A and B are connected to the server? Is one directly on the same switch, and the other further away?

All of the applications are on the same PC, so no network switch.

 

Increase your 100 ms timeouts (and your 10ms listen)  to 1 sec.

 

Also. Try the Transport examples and see if you get the same problem.

 

Extended the timeouts on the server. Not sure what I was thinking by setting those so short.

 

Having some trouble getting the transport.lvlib installer to work. I'll have to poke at that to see what's up while on the long haul out of India.

Link to comment

Having some trouble getting the transport.lvlib installer to work. I'll have to poke at that to see what's up while on the long haul out of India.

 

If you tell me what the troubles are (support thread), I will probably be able to figure it out for you ;)

Edited by ShaunR
Link to comment

I think I have this fixed. Tried a compiled version of the transport library; this worked without issue. I made the timeout changes, which did not seem to have an impact on performance. I then recompiled everything (the server-side code is used in multiple applications on the system); the delay I was seeing with the one message/response went to expected amounts in tests. I've been waiting to test this with the whole system up and going; unfortunately, we've been battling drive issues that are stopping everything else. Can't definitively say it's fixed. Can't point to a smoking gun. :frusty:

 

I'm appreciating this forum and the people on it right now. :beer_mug:

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.