Jump to content

cRIO complete lockup with no(?) memory leaks


Heiso

Recommended Posts

Warning: this is a very open ended question and I ramble because I'm at the point of tearing my hair out trying to fix this.

We have a remotely deployed cRIO system that requires nothing short of an Act of God to physically access. For the past couple of months we have been trying to troubleshoot a complete cRIO lockup that requires us to power cycle it via a web-based relay between the cRIO and its power supply. This has been happening every 2-5 days on average. The architecture itself is a rtexe that dynamically loads a number of PPLs specified in a config file via Start Asynchronous Call and I've been able to run a similar system on benchtop for weeks without issue.

Physical architecture is a cRIO-9042 with a GPS module, a couple of 9220s, and 2x FD-11601s daisy chained using DAQmx to acquire data from ~70 channels. Software-wise, this particular system launches 6 unique PPLs: (1) performs DAQmx acquisition and publishes its values via a User Event, (2) registers for the DAQ data event and base64 encodes the data and does some other formatting on it to prepare it for network transmission which it sends to a named queue read by another PPL that performs the network transmission, (3) also registers for the DAQ data event and calculates the detection of transient events then sends those results via the same named queue read by the network transmission PPL, (4) one that performs the network transmission, (5) one that reads weather station data via Modbus TCP and formats/sends it to the network transmission PPL via the named queue, (6&7) the same FLIR camera PPL is launched twice but each one targeting a different networked camera which then parses the image and sends it to the network transmission PPL via named queue.

I've monitored memory usage with /proc/meminfo as well as monitoring the specific lvrt application via /proc/(lvrt_proc_id)/status. There seems to be no substantive change in memory over time (i.e., assumedly no leaks) and cRIO CPU usage varies between 10-20% per core, although average seems to be around 13% per core. I do see a very rapid increase in context switches, both voluntary and nonvoluntary, on the order of ~500 voluntary and ~5 nonvoluntary per second, but I have no frame of reference to determine if this is "high" or whether this is even a problem. My Linux knowledge is relatively limited so maybe there's another thing I should be monitoring of which I'm unaware.

Does anybody have any anecdotes about situations where they've had cRIOs crash sporadically without seeing a memory leak and what they did to troubleshoot/debug it? The disappointing thing is the logs I pull from MAX don't even show that anything happened -- it's basically a complete system freeze (doesn't reply to pings from other computers either). I see what I believe to be the power cycle event in the logs once I force the relay on the supply to open/close, but even if I wait days between it going non-responsive and power cycling it I won't see anything representative of the lockup in the logs except the power cycle. We're going through the arduous process of trying to gain physical access to the cRIO so we can look at the status lights on the front and measure the supply voltage when it's frozen, but that'll probably be another couple of weeks before we get permission and I'd love to have a checklist of other potential things we can try in our limited window of time we have physical access.

Unfortunately I can't post any screenshots, let alone any code, due to the nature of the deployment which is why my question is so general and I'm just soliciting some stuff to throw at the wall hoping that it sticks. I've started logging max queue size for the data transmission queue to the cRIO HD, but it consistently remains low enough not to be a concern. We also don't see any delay in ingestion of the data by the database computer -- all of the received timestamps and data are updating completely as expected until the cRIO locks up.

Whether anyone replies to this or not, I feel better that I've ranted so thank you for the mental relief!

Link to comment

Disk usage is in good shape. We're only logging to disk when we have errors or when the running state/health of one of the modules changes so we're still sitting at over 70% available. I appreciate the thought though!

I was looking also looking at some previous threads about TCP ports being used up, but I we're only re-establishing network connections when it drops (on working systems this usually only happens once every week or two depending on network quality to the site) and we log every time we call our open connection function but there doesn't seem to be any kind of correlation in this case from what I can tell.

Link to comment

Disclaimer: this is some time ago so my memory is pretty foggy!

I had some very strange crashes on bunch of cRIOs in a lab. Eventually after much trouble I diagnosed the problem as being related to setting the system time! I had a routine that tried to sync all the cRIO clocks and every now and then the call to the System Configuration VIs would hard crash a system. I never actually saw it happen with my own eyes but I have enough evidence to be able to blame it on that. I cannot recall exactly how I solved the problem in the end, I suspect I slowed down the rate at which the clocks were set.

As I said, this was a long time ago though and the cRIOs were probably not running RT-linux.

You mentioned you have a GPS module, so maybe just maybe this is related?

Link to comment

I remember having one cRIO that crashed randomly, it turned out it was somehow related to reading ini file. For some reason this ini was read from time to time in app, when it was changed to read only once at startup it solved the problem.  When we got desperate we started to add syslog messages for all state machines states and logging it into db to see if it was related to some code...

Link to comment

How realistic is the test setup that fails to reproduce the issue? Does it have the same potentially slow/unstable links/external devices, does it run the exact same operations etc? Perhaps a DNS server that disappears from the network now and then e.g, or other connected equipment that fails / you can cause errors in to expose potential issues? Is the test unit set up based on an image of the troublesome unit?

More difficult obviously but can you alternatively downscale software/ turn off some of the tasks / remove parts of the code on the field unit to see if that removes the issue?

When the cRIO fails to respond is it really dead locally as well, or is it running/responding locally? We have had units unreachable through existing network links due to DNS and other issues where we had to modify the code (remove any DNS lookups e.g.) to get it to respond properly...


Have you tried enabling and reading the debug output from the device?

We had some sbRIOs crash without much explanations due to a sudden spike in the memory usage (inferred, not fully observed) caused by a calculation that was triggered. There was no obvious reason; the debug logs and memory monitoring could not really explain why - we just had to push down the memory footprint until it started working...

Have you tried formatting the cRIO in the field and setting it up again?

Would it be possible to replace the cRIO in the field with another one - then bring the field unit back and use that to try to recreate the issue? If it stops in the field with the new cRIO it is most likely linked to the cRIO hardware...or a system corruption on it. If it shows up witht he new cRIO and not in the test setup the test setup is obviously not realistic enough...

Edited by Mads
Link to comment
On 12/1/2021 at 2:25 PM, Neil Pate said:

Disclaimer: this is some time ago so my memory is pretty foggy!

I had some very strange crashes on bunch of cRIOs in a lab. Eventually after much trouble I diagnosed the problem as being related to setting the system time! I had a routine that tried to sync all the cRIO clocks and every now and then the call to the System Configuration VIs would hard crash a system. I never actually saw it happen with my own eyes but I have enough evidence to be able to blame it on that. I cannot recall exactly how I solved the problem in the end, I suspect I slowed down the rate at which the clocks were set.

As I said, this was a long time ago though and the cRIOs were probably not running RT-linux.

You mentioned you have a GPS module, so maybe just maybe this is related?

This looked incredibly promising at first for a couple of reasons.

  1. We do not have a GPS drop in the lab so I had disabled the section with the FPGA call where we pull in the time stamp and then set the system clock in RT using the System Configuration Set Time vi, but this section was still active in the field deployment.
  2. Because we're using DAQmx and TSN to synchronize measurements of the cRIO and FieldDAQs, awhile ago we had noticed time drift in the system clock that seemed to correlate with TSN synchronization lock loss between the master cRIO and the FieldDAQ(s), which would be exacerbated if we used the Set Time vi. After NI graciously allowed us to have a chat with some members of the R&D team responsible for this in the past, they suggested disabling the coupling of the system and hardware clocks and put out this KB. While this did seem to fix that problem, we ended up going a different route and timestamping our sample records with GPS manually on some of our older deployments.
  3. This deployment does use the suggested disabling of Linux OS time synchronization. My first test was to disable the Set Time vi, but the cRIO still crashed -- this time after about 28 hours. Yesterday morning I re-enabled the OS time synchronization but kept the Set Time vi disabled, so fingers crossed. I also just saw in the release notes that they supposedly fixed the time drift with the latest release of NI-Sync in late October so I'll upgrade and try again if the current run locks up again.
On 12/2/2021 at 1:20 AM, pawhan11 said:

I remember having one cRIO that crashed randomly, it turned out it was somehow related to reading ini file. For some reason this ini was read from time to time in app, when it was changed to read only once at startup it solved the problem.  When we got desperate we started to add syslog messages for all state machines states and logging it into db to see if it was related to some code...

I'll definitely give this a shot the next time I rebuild the code. Each module reads from an INI file, but some of them read from the same INI. If a module fails for some reason, it relaunches itself and re-reads from that INI. Currently I don't have anything showing in the logs that a module failed/relaunched, but I'll dig through the code and make sure that I don't have some kind of race condition where multiple modules could fail and relaunch and try to access the config file simultaneously before the error messages could be logged by the error module.

On 12/2/2021 at 5:21 AM, Mads said:

How realistic is the test setup that fails to reproduce the issue? Does it have the same potentially slow/unstable links/external devices, does it run the exact same operations etc? Perhaps a DNS server that disappears from the network now and then e.g, or other connected equipment that fails / you can cause errors in to expose potential issues? Is the test unit set up based on an image of the troublesome unit?

Good questions. It's... pretty close... to what we have in the field. Both have the same cRIO hardware modules and they're both communicating to 2 FieldDAQs. In the lab I'm actually reading from all 8 channels per FieldDAQ, while in the deployment we're only using 6 from each, so we're actually doing more acquisition and calculations on the lab setup.

As for slow/unstable connection, yes, that is something with which we've struggled for this deployment. Currently we're sending everything via named queue into our communication PPL that sends the data out over TCP back to the database ingestion computer. We flush the queue into a conditional FOR loop that attempts to re-establish comms and then just flushes the comms queue and dumps the data if it fails and repeats the connection attempt and flush until it successfully starts transmitting again. It seems to work pretty well because our queue never reaches a size that would be concerning (i.e., I've seen it order of magnitude larger on other systems before memory issues start to arise). I tried to include an image of this since it's not a sensitive module, but the forum isn't playing nice with my attempt to attach it to this post -- it keeps throwing an error.

It's not based on a replicated image, just a redeployment of the rtexe and then using scp to copy over the specific .ini and PPL files. Does anyone have a particular suggestion on what to use for duplicating the image and deploying it? I used RTAD about 7 years ago, but I was never clear as to whether that was the best approach or not.

On 12/2/2021 at 5:21 AM, Mads said:

More difficult obviously but can you alternatively downscale software/ turn off some of the tasks / remove parts of the code on the field unit to see if that removes the issue?

Since we're trying to access the device physically, field testing has been paused so this would be a good opportunity to this. I'll start removing software modules from the config and try to figure out if there's one in particular that's causing issues.

On 12/2/2021 at 5:21 AM, Mads said:

When the cRIO fails to respond is it really dead locally as well, or is it running/responding locally? We have had units unreachable through existing network links due to DNS and other issues where we had to modify the code (remove any DNS lookups e.g.) to get it to respond properly...

This is a question we hope to resolve by putting eyes on the status LED on the cRIO, but I'm 99.99% sure that it's locally dead as well. The reason I say this is we log connection failures to the hard drive and we see none of these events in the log. In fact, we see 0 events in the log from the time we lose connection to it until we power cycle it remotely, which is what makes me think the entire thing is completely frozen.

On 12/2/2021 at 5:21 AM, Mads said:

Have you tried enabling and reading the debug output from the device?

We had some sbRIOs crash without much explanations due to a sudden spike in the memory usage (inferred, not fully observed) caused by a calculation that was triggered. There was no obvious reason; the debug logs and memory monitoring could not really explain why - we just had to push down the memory footprint until it started working...

Do you mean enabling the console out and plugging up a monitor to it? No, but that's something I'll try whenever we can get out there.

Memory footprint seems fine with ~2.8GB available. It fluctuates slightly up and down, but over time there's no noticeable downward trend. When I start disabling modules this should reduce the memory footprint further, so we'll see how it goes.

On 12/2/2021 at 5:21 AM, Mads said:

Would it be possible to replace the cRIO in the field with another one - then bring the field unit back and use that to try to recreate the issue? If it stops in the field with the new cRIO it is most likely linked to the cRIO hardware...or a system corruption on it. If it shows up witht he new cRIO and not in the test setup the test setup is obviously not realistic enough...

I think I may end up doing this on principle. Given how hard it is to get access to the site it would be worth swapping units for the heck of it so we don't have to jump through the hoops in the future if we've exhausted all other debugging avenues.

 

 

Thank you all for the great suggestions and sharing your past experiences.

Link to comment

As an update, 3 days in after disabling the Set Time vi and re-enabling the Linux OS time synchronization and it's going strong.  It looks promising, and if it makes it a week I'll let my guard down a little.

Assuming this fixes it, if I can't use the Set Time.vi to set the system clock, how can I synchronize multiple systems that are geographically isolated and have no common network timing source (e.g., NTP, PTP, etc.)? GPS is something that's readily available so I'd like to use that, but not if it's going to cause lockups days after I call the function a single time... Again, this is all predicated on disabling GPS fixing the issue.

Link to comment
46 minutes ago, Heiso said:

Assuming this fixes it, if I can't use the Set Time.vi to set the system clock, how can I synchronize multiple systems that are geographically isolated and have no common network timing source (e.g., NTP, PTP, etc.)? GPS is something that's readily available so I'd like to use that, but not if it's going to cause lockups days after I call the function a single time... Again, this is all predicated on disabling GPS fixing the issue.

I'm pretty sure remote time syncing ("remote" as in "light-years from civilization") is a core use-case.

Try asking at the Linux RT forums. Relevant NI engineers are quite active there: https://forums.ni.com/t5/NI-Linux-Real-Time-Discussions/bd-p/7111

Link to comment
5 hours ago, Heiso said:

As an update, 3 days in after disabling the Set Time vi and re-enabling the Linux OS time synchronization and it's going strong.  It looks promising, and if it makes it a week I'll let my guard down a little.

Assuming this fixes it, if I can't use the Set Time.vi to set the system clock, how can I synchronize multiple systems that are geographically isolated and have no common network timing source (e.g., NTP, PTP, etc.)? GPS is something that's readily available so I'd like to use that, but not if it's going to cause lockups days after I call the function a single time... Again, this is all predicated on disabling GPS fixing the issue.

I think if you can reach an NTP server you’ll be ok if sub-second accuracy is tolerable. If so, see my comment on this KB article. The whole thing is worth going through and there are a few different ways to go about it. I think mine combines them all including disabling NI-Sync. Note that this is only available on version 20.1 or higher. This has bit me since I have many systems with 20.0. 
 

https://forums.ni.com/t5/Example-Code/Installing-and-Configuring-NTP-on-NI-Linux-Real-Time-Devices/tac-p/4165930/highlight/true#M14787

Link to comment

Complete lockup again. Looks like we're going to have physical access to the cRIO late next week, but I'll start disabling some of the software modules in the meantime to see if I can narrow down the culprit to a specific PPL. My money's still on HW since it works benchtop, but we shall see...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.