Tim_S Posted May 14, 2015 Report Share Posted May 14, 2015 I've got an issue that I'm hoping someone has seen before. Any thoughts would be appreciated. I have three applications, all written in LabVIEW 2012 SP1 running on Windows 7. The first is a server application. The other two are clients to the server. TCP is used to talk between the client and server applications. Both clients use a common communication library for the server application. All applications are on the same computer. The network address is "localhost". The first client, let us call it Client A, has no issues. Messages sent and received are 'immediate'. This is an application that has been about for a couple of years. The second client, let us call it Client B, is new. Most messages sent and received are immediate. One message takes 3-5 seconds longer to get a response. I went back in to the server application and timed how long it takes to perform the actions from receipt of message to response; this is <30 msec. Wireshark confirms the response to the message from Client B to server takes 3-5 seconds to occur with no TCP transmit errors. I've disabled nagling on both ends with some reduction of a second, but that still leaves 2-4 seconds when there should be milliseconds. Going back to my development PC, I setup a small test VI to just perform the messages (no overhead of other code). I see the same behavior over the network to the server application as with Client B. The messages are mostly small (less than 100 bytes). Similar sized messages/response to the one that is problematic with Client B do now show the same delay issue. I'm at the end of a project when we're looking at cycle time for production machinery. This message is sent 10 times during the test, so I've got 30-50 seconds of cycle time I'm trying to kill with just this item alone. I will find a work-around if I can't figure it out quickly, but I would rather solve the problem on this job as it will come back on the next. This is a bit large to post. I've not been able to devote time to shrink it down to an example. Quote Link to comment
Darin Posted May 14, 2015 Report Share Posted May 14, 2015 Here is an interaction I have had on more than a few occasions: A: I have a random TCP problem Me: Sounds like a Nagle's Algorithm issue. A: I'll try that....Nope disabling Nagle did not help. Me: Sounds like a Nagle's Algorithm issue. A: I tried again, does not look like a problem with Nagle Me: That's odd. Let me know what you find out.... <Time passes> A: Turns out I screwed up, it was actually a Nagle's Algorithm issue. What I am saying is that this sounds like a textbook case of Nagle's algorithm. Until I was really, really sure, I would not look for something besides Nagle to explain the issue, I would look for the reason why it is not being disabled everywhere like you think it should be. One time I even started appending random garbage to the end of every message that was not of a given length. Quote Link to comment
ned Posted May 14, 2015 Report Share Posted May 14, 2015 In my experience, the Nagle algorithm isn't as problematic as everyone makes it out to be. Also, 2-4 seconds is longer than I'd expect due to Nagle-related delay. My first guess is you have a TCP read somewhere that's expecting slightly more data than it actually receives, so it waits the full timeout period. What TCP Read mode are you using? Let's say the client is using CRLF mode, but the server doesn't append the end-of-line character to the response - TCP Read will wait the full timeout period, and still return a valid response (assuming you don't check for that CRLF). 1 Quote Link to comment
chris754 Posted May 14, 2015 Report Share Posted May 14, 2015 Yes, I have had similar issues before and resolved it using the "immediate" function of TCP read. Of course this might cuase you to have to change the way you wait and read for messages. Quote Link to comment
ShaunR Posted May 14, 2015 Report Share Posted May 14, 2015 (edited) Here is an interaction I have had on more than a few occasions: A: I have a random TCP problem Me: Sounds like a Nagle's Algorithm issue. A: I'll try that....Nope disabling Nagle did not help. Me: Sounds like a Nagle's Algorithm issue. A: I tried again, does not look like a problem with Nagle Me: That's odd. Let me know what you find out.... <Time passes> A: Turns out I screwed up, it was actually a Nagle's Algorithm issue. What I am saying is that this sounds like a textbook case of Nagle's algorithm. Until I was really, really sure, I would not look for something besides Nagle to explain the issue, I would look for the reason why it is not being disabled everywhere like you think it should be. One time I even started appending random garbage to the end of every message that was not of a given length. Here is a conversation I've had on more than a few occasions. A: I have a random TCP problem Me: What are the symptoms? A: Message delays. Me: 250ms?. A: No. [iNSERT SECONDS HERE] seconds. Do you think it is the NAGLE algo? Me: No. It's more than 250 ms. A read is timing out. <Time passes> A: Turns out I screwed up, I was retrying/ignoring after a read error. Edited May 14, 2015 by ShaunR Quote Link to comment
Tim_S Posted May 15, 2015 Author Report Share Posted May 15, 2015 (edited) I really do appreciate the responses. I've run awry of Nagle before. This doesn't feel like that's the issue, but I tried disabling on one end, the other end, and both ends (yea, I'm not 100% on how Nagle gets implemented). Like I said, I did see an improvement, but not the whole solution. There's no loop to the client; it's open connection, send message, wait for response, close connection. I'm sending binary data in some commands and responses, so I'm prepending the message length. Each side reads the length first. This is akin to the Data Client and Data Server examples, but I'm sending both length and data over one TCP write instead of two. I went back and looked at how Immediate works. I remember having issues with missing/incomplete messages when I tried using it in spots during initial server-side development, but how I was trying to use it escapes me. Put together a quick summary. Didn't put in the Nagle VIs, which I currently have on the server side only. Quick summary of client.vi Quick summary of server.vi Edited May 15, 2015 by Tim_S Quote Link to comment
ShaunR Posted May 15, 2015 Report Share Posted May 15, 2015 (edited) I really do appreciate the responses. I've run awry of Nagle before. This doesn't feel like that's the issue, but I tried disabling on one end, the other end, and both ends (yea, I'm not 100% on how Nagle gets implemented). Like I said, I did see an improvement, but not the whole solution. There's no loop to the client; it's open connection, send message, wait for response, close connection. I'm sending binary data in some commands and responses, so I'm prepending the message length. Each side reads the length first. This is akin to the Data Client and Data Server examples, but I'm sending both length and data over one TCP write instead of two. Put together a quick summary. Didn't put in the Nagle VIs, which I currently have on the server side only. Quick summary of client.vi Quick summary of server.vi Increase your 100 ms timeouts (and your 10ms listen) to 1 sec. Also. Try the Transport examples and see if you get the same problem. Edited May 15, 2015 by ShaunR Quote Link to comment
ned Posted May 15, 2015 Report Share Posted May 15, 2015 While this is unlikely to be the problem, is there a difference in the network connections in how A and B are connected to the server? Is one directly on the same switch, and the other further away? Quote Link to comment
Tim_S Posted May 16, 2015 Author Report Share Posted May 16, 2015 While this is unlikely to be the problem, is there a difference in the network connections in how A and B are connected to the server? Is one directly on the same switch, and the other further away? All of the applications are on the same PC, so no network switch. Increase your 100 ms timeouts (and your 10ms listen) to 1 sec. Also. Try the Transport examples and see if you get the same problem. Extended the timeouts on the server. Not sure what I was thinking by setting those so short. Having some trouble getting the transport.lvlib installer to work. I'll have to poke at that to see what's up while on the long haul out of India. Quote Link to comment
ShaunR Posted May 16, 2015 Report Share Posted May 16, 2015 (edited) Having some trouble getting the transport.lvlib installer to work. I'll have to poke at that to see what's up while on the long haul out of India. If you tell me what the troubles are (support thread), I will probably be able to figure it out for you Edited May 16, 2015 by ShaunR Quote Link to comment
Tim_S Posted May 18, 2015 Author Report Share Posted May 18, 2015 Installed the second time tried. Not sure what happened the first. Quote Link to comment
Tim_S Posted May 21, 2015 Author Report Share Posted May 21, 2015 I think I have this fixed. Tried a compiled version of the transport library; this worked without issue. I made the timeout changes, which did not seem to have an impact on performance. I then recompiled everything (the server-side code is used in multiple applications on the system); the delay I was seeing with the one message/response went to expected amounts in tests. I've been waiting to test this with the whole system up and going; unfortunately, we've been battling drive issues that are stopping everything else. Can't definitively say it's fixed. Can't point to a smoking gun. I'm appreciating this forum and the people on it right now. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.