Tim_S Posted September 9, 2014 Report Share Posted September 9, 2014 Some of my code is giving me behavior I'm not understanding. I've been talking with NI tech support, but I'm trying to better understand what's going on for the foreseeable project down the road that is going to tax what TCP can transmit. I have a PC directly connected to a cRIO-9075 with a cable (no switch involved). I've put together a little test application that, on the RT side, creates an array of 6,000 U32 integers, waits for a connection, and then starts transmitting the length of the array (in bytes) and the array itself over TCP every 100 msec. The length and data are a single TCP write. On the PC side, I have a TCP open, a read of four bytes to get the length of the data, then a read of the data itself. The second TCP read does not occur if there is any errors with the first TCP read. Both reads have a 100 msec timeout. The error I'm getting is a sporatic timeout (error 56) at the second TCP read on the PC side. This causes the next read of the data length to be from my data so I get invalid data from there out. The error occurs from second to hours after the start of transmission. As a sanity check, I did some math on how long it should take to transmit the data. Ignoring the overhead of TCP communication, it should take ~2 msec for the write to transmit. A workaround seems to be to have an infinite timeout (value of -1) for the second TCP read. I'm rather leery of having an infinite (or very long) timeout in the second read. Tech support was able to get this working with 250 msec on the second read. Test VIs uploaded... Test Stream Data.zip Quote Link to comment
Phillip Brooks Posted September 10, 2014 Report Share Posted September 10, 2014 (edited) You might try turning off Nagling. http://forums.ni.com/t5/LabVIEW/quot-TCP-Write-quot-Timeout-error-56-seems-to-do-not-work/m-p/2123100#M689051 http://digital.ni.com/public.nsf/allkb/7EFCA5D83B59DFDC86256D60007F5839 Edited September 10, 2014 by Phillip Brooks Quote Link to comment
Rolf Kalbermatter Posted September 10, 2014 Report Share Posted September 10, 2014 Some of my code is giving me behavior I'm not understanding. I've been talking with NI tech support, but I'm trying to better understand what's going on for the foreseeable project down the road that is going to tax what TCP can transmit. I have a PC directly connected to a cRIO-9075 with a cable (no switch involved). I've put together a little test application that, on the RT side, creates an array of 6,000 U32 integers, waits for a connection, and then starts transmitting the length of the array (in bytes) and the array itself over TCP every 100 msec. The length and data are a single TCP write. On the PC side, I have a TCP open, a read of four bytes to get the length of the data, then a read of the data itself. The second TCP read does not occur if there is any errors with the first TCP read. Both reads have a 100 msec timeout. The error I'm getting is a sporatic timeout (error 56) at the second TCP read on the PC side. This causes the next read of the data length to be from my data so I get invalid data from there out. The error occurs from second to hours after the start of transmission. As a sanity check, I did some math on how long it should take to transmit the data. Ignoring the overhead of TCP communication, it should take ~2 msec for the write to transmit. A workaround seems to be to have an infinite timeout (value of -1) for the second TCP read. I'm rather leery of having an infinite (or very long) timeout in the second read. Tech support was able to get this working with 250 msec on the second read. Test VIs uploaded... You use buffered mode, This mode only returns: 1) once timeout has expired, without retrieving any data 2) whenever all the requested bytes have arrived Seems your communication somehow looses somewhere bytes. With buffered mode you return with a timeout and the data stays there, and then gets read and interpreted as anothere length code sending you even more into the offside. Quote Link to comment
Tim_S Posted September 10, 2014 Author Report Share Posted September 10, 2014 You might try turning off Nagling. http://forums.ni.com/t5/LabVIEW/quot-TCP-Write-quot-Timeout-error-56-seems-to-do-not-work/m-p/2123100#M689051 http://digital.ni.com/public.nsf/allkb/7EFCA5D83B59DFDC86256D60007F5839 I've just tried that with no improvement. I went back through the emails with tech support and see that they have tried that as well. 1) once timeout has expired, without retrieving any data 2) whenever all the requested bytes have arrived Seems your communication somehow looses somewhere bytes. With buffered mode you return with a timeout and the data stays there, and then gets read and interpreted as anothere length code sending you even more into the offside. Correct. Once I'm out of synch, the only recourse with this method is to stop everything and restart from a known point. What's is confusing about the loss of bytes is the length and data are a single write so should show up at "the same time". They will be multiple packets, but there should not be collisions or packet loss. Quote Link to comment
Rolf Kalbermatter Posted September 10, 2014 Report Share Posted September 10, 2014 Correct. Once I'm out of synch, the only recourse with this method is to stop everything and restart from a known point. What's is confusing about the loss of bytes is the length and data are a single write so should show up at "the same time". They will be multiple packets, but there should not be collisions or packet loss. Well the next step in debugging this, would be to enable another mode on the read side and see what data actually arrives. Then comparing that data with what you think you sent on the cRIO side. Quote Link to comment
Tim_S Posted September 10, 2014 Author Report Share Posted September 10, 2014 Well the next step in debugging this, would be to enable another mode on the read side and see what data actually arrives. Then comparing that data with what you think you sent on the cRIO side. Switched to Standard to read whatever is available at timeout. Of what does get received, the data is correct. I've Wiresharked the connection. This is looking something lower-level as there are responses to the TCP packets of "Reassembly error, protocol TCP: New fragment overlaps old data (retransmission?)". I'm now getting errors (code 56 and 66) on the RT side where I wasn't before. A coworker dropped off a PLC I'm supposed to talk Ethernet/IP to. I was able to get Ethernet/IP going for 5+ minutes without any errors using the same cable and PC. For grins, I changed from 6000 element array to 50 (should fit within one packet). The errors in the test routine went away. I'm seeing Reassembly errors in Wireshark, but those appear to be for the RT front panel (in development mode). Creeping up the array size, the issue returns soon as go to two packets for the message. Quote Link to comment
Testy Posted September 11, 2014 Report Share Posted September 11, 2014 I've had this issue. What worked for me was to continuously read data and buffer it, only timing out if no new data arrived: Quote Link to comment
Tim_S Posted September 22, 2014 Author Report Share Posted September 22, 2014 (edited) Update on this... This issue only appears on my development PC in development environment and run-time. I was able to set up executables to test non-development PC-to-PC and PC-to-cRIO for extended periods of time; these tests went for overnight without issue. I'm still working with technical support on this issue. Edited September 22, 2014 by Tim_S Quote Link to comment
Rolf Kalbermatter Posted September 22, 2014 Report Share Posted September 22, 2014 Update on this... This issue only appears on my development PC in development environment and run-time. I was able to set up executables to test non-development PC-to-PC and PC-to-cRIO for extended periods of time; these tests went for overnight without issue. I'm still working with technical support on this issue. If it seems limited to your PC then the most likely suspect would seem the network card and according driver in that PC. Wouldn't be the first time that network card drivers do buggy things. Maybe it has to do with jumbo frame handling. Try to see if you can disable that in the driver configuration. As far as I know cRIO doesn't support jumbo frames at all, so there shouldn't be any jumbo frames transmitted, but it could be that an enabled jumbo frame handling in the driver does try to be too smart and reassemble multiple TCP packets into such frames to pass to the Windows socket interface. Quote Link to comment
Tim_S Posted September 22, 2014 Author Report Share Posted September 22, 2014 If it seems limited to your PC then the most likely suspect would seem the network card and according driver in that PC. Wouldn't be the first time that network card drivers do buggy things. Amen. Maybe it has to do with jumbo frame handling. Try to see if you can disable that in the driver configuration. As far as I know cRIO doesn't support jumbo frames at all, so there shouldn't be any jumbo frames transmitted, but it could be that an enabled jumbo frame handling in the driver does try to be too smart and reassemble multiple TCP packets into such frames to pass to the Windows socket interface. I checked the driver on my development PC and have verified that Jumbo Packet is disabled. I haven't found much on Jumbo Frames with cRIO, but what I have found indicates it is only available with Windows and not RT. I am starting to think the 'standard corporate IT load and policies' are causing some instability in long-term TCP communication. Unfortunately, I don't think I can prove that or change it. Quote Link to comment
Tim_S Posted October 8, 2014 Author Report Share Posted October 8, 2014 Update: Reversing the data (Windows PC as server, cRIO as client) ran overnight without issues. Both NI tech support and I were able to get the behavior to reproduce with the shipping examples "simple TCP client" and "simple TCP server". NI tested using a dual-processor cRIO which did not show the behavior. The next test I've been asked to try is to monitor CPU usage on the cRIO. Hypothesis is the CPU usage momentarily goes very high causing the TCP communication to get delayed. We haven't thought on what could be causing CPU usage to spike. Quote Link to comment
Tim_S Posted November 17, 2014 Author Report Share Posted November 17, 2014 I do have a resolution, but not an explanation. Sending of commands/response is still through TCP, however the fast data transfer has been switched to a network stream. I was able to get 600,000 U32 per second to transfer overnight without issue (better performance than I need). Network streams are built on TCP, so this has NI tech support and myself scratching our heads as to why one works well and the other does not. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.