Jump to content

Heiso

Members
  • Posts

    4
  • Joined

  • Last visited

LabVIEW Information

  • Version
    LabVIEW 2018
  • Since
    2011

Heiso's Achievements

Newbie

Newbie (1/14)

  • First Post Rare
  • Conversation Starter Rare
  • Week One Done Rare
  • One Month Later Rare
  • One Year In Rare

Recent Badges

0

Reputation

  1. As an update, 3 days in after disabling the Set Time vi and re-enabling the Linux OS time synchronization and it's going strong. It looks promising, and if it makes it a week I'll let my guard down a little. Assuming this fixes it, if I can't use the Set Time.vi to set the system clock, how can I synchronize multiple systems that are geographically isolated and have no common network timing source (e.g., NTP, PTP, etc.)? GPS is something that's readily available so I'd like to use that, but not if it's going to cause lockups days after I call the function a single time... Again, this is all predicated on disabling GPS fixing the issue.
  2. This looked incredibly promising at first for a couple of reasons. We do not have a GPS drop in the lab so I had disabled the section with the FPGA call where we pull in the time stamp and then set the system clock in RT using the System Configuration Set Time vi, but this section was still active in the field deployment. Because we're using DAQmx and TSN to synchronize measurements of the cRIO and FieldDAQs, awhile ago we had noticed time drift in the system clock that seemed to correlate with TSN synchronization lock loss between the master cRIO and the FieldDAQ(s), which would be exacerbated if we used the Set Time vi. After NI graciously allowed us to have a chat with some members of the R&D team responsible for this in the past, they suggested disabling the coupling of the system and hardware clocks and put out this KB. While this did seem to fix that problem, we ended up going a different route and timestamping our sample records with GPS manually on some of our older deployments. This deployment does use the suggested disabling of Linux OS time synchronization. My first test was to disable the Set Time vi, but the cRIO still crashed -- this time after about 28 hours. Yesterday morning I re-enabled the OS time synchronization but kept the Set Time vi disabled, so fingers crossed. I also just saw in the release notes that they supposedly fixed the time drift with the latest release of NI-Sync in late October so I'll upgrade and try again if the current run locks up again. I'll definitely give this a shot the next time I rebuild the code. Each module reads from an INI file, but some of them read from the same INI. If a module fails for some reason, it relaunches itself and re-reads from that INI. Currently I don't have anything showing in the logs that a module failed/relaunched, but I'll dig through the code and make sure that I don't have some kind of race condition where multiple modules could fail and relaunch and try to access the config file simultaneously before the error messages could be logged by the error module. Good questions. It's... pretty close... to what we have in the field. Both have the same cRIO hardware modules and they're both communicating to 2 FieldDAQs. In the lab I'm actually reading from all 8 channels per FieldDAQ, while in the deployment we're only using 6 from each, so we're actually doing more acquisition and calculations on the lab setup. As for slow/unstable connection, yes, that is something with which we've struggled for this deployment. Currently we're sending everything via named queue into our communication PPL that sends the data out over TCP back to the database ingestion computer. We flush the queue into a conditional FOR loop that attempts to re-establish comms and then just flushes the comms queue and dumps the data if it fails and repeats the connection attempt and flush until it successfully starts transmitting again. It seems to work pretty well because our queue never reaches a size that would be concerning (i.e., I've seen it order of magnitude larger on other systems before memory issues start to arise). I tried to include an image of this since it's not a sensitive module, but the forum isn't playing nice with my attempt to attach it to this post -- it keeps throwing an error. It's not based on a replicated image, just a redeployment of the rtexe and then using scp to copy over the specific .ini and PPL files. Does anyone have a particular suggestion on what to use for duplicating the image and deploying it? I used RTAD about 7 years ago, but I was never clear as to whether that was the best approach or not. Since we're trying to access the device physically, field testing has been paused so this would be a good opportunity to this. I'll start removing software modules from the config and try to figure out if there's one in particular that's causing issues. This is a question we hope to resolve by putting eyes on the status LED on the cRIO, but I'm 99.99% sure that it's locally dead as well. The reason I say this is we log connection failures to the hard drive and we see none of these events in the log. In fact, we see 0 events in the log from the time we lose connection to it until we power cycle it remotely, which is what makes me think the entire thing is completely frozen. Do you mean enabling the console out and plugging up a monitor to it? No, but that's something I'll try whenever we can get out there. Memory footprint seems fine with ~2.8GB available. It fluctuates slightly up and down, but over time there's no noticeable downward trend. When I start disabling modules this should reduce the memory footprint further, so we'll see how it goes. I think I may end up doing this on principle. Given how hard it is to get access to the site it would be worth swapping units for the heck of it so we don't have to jump through the hoops in the future if we've exhausted all other debugging avenues. Thank you all for the great suggestions and sharing your past experiences.
  3. Disk usage is in good shape. We're only logging to disk when we have errors or when the running state/health of one of the modules changes so we're still sitting at over 70% available. I appreciate the thought though! I was looking also looking at some previous threads about TCP ports being used up, but I we're only re-establishing network connections when it drops (on working systems this usually only happens once every week or two depending on network quality to the site) and we log every time we call our open connection function but there doesn't seem to be any kind of correlation in this case from what I can tell.
  4. Warning: this is a very open ended question and I ramble because I'm at the point of tearing my hair out trying to fix this. We have a remotely deployed cRIO system that requires nothing short of an Act of God to physically access. For the past couple of months we have been trying to troubleshoot a complete cRIO lockup that requires us to power cycle it via a web-based relay between the cRIO and its power supply. This has been happening every 2-5 days on average. The architecture itself is a rtexe that dynamically loads a number of PPLs specified in a config file via Start Asynchronous Call and I've been able to run a similar system on benchtop for weeks without issue. Physical architecture is a cRIO-9042 with a GPS module, a couple of 9220s, and 2x FD-11601s daisy chained using DAQmx to acquire data from ~70 channels. Software-wise, this particular system launches 6 unique PPLs: (1) performs DAQmx acquisition and publishes its values via a User Event, (2) registers for the DAQ data event and base64 encodes the data and does some other formatting on it to prepare it for network transmission which it sends to a named queue read by another PPL that performs the network transmission, (3) also registers for the DAQ data event and calculates the detection of transient events then sends those results via the same named queue read by the network transmission PPL, (4) one that performs the network transmission, (5) one that reads weather station data via Modbus TCP and formats/sends it to the network transmission PPL via the named queue, (6&7) the same FLIR camera PPL is launched twice but each one targeting a different networked camera which then parses the image and sends it to the network transmission PPL via named queue. I've monitored memory usage with /proc/meminfo as well as monitoring the specific lvrt application via /proc/(lvrt_proc_id)/status. There seems to be no substantive change in memory over time (i.e., assumedly no leaks) and cRIO CPU usage varies between 10-20% per core, although average seems to be around 13% per core. I do see a very rapid increase in context switches, both voluntary and nonvoluntary, on the order of ~500 voluntary and ~5 nonvoluntary per second, but I have no frame of reference to determine if this is "high" or whether this is even a problem. My Linux knowledge is relatively limited so maybe there's another thing I should be monitoring of which I'm unaware. Does anybody have any anecdotes about situations where they've had cRIOs crash sporadically without seeing a memory leak and what they did to troubleshoot/debug it? The disappointing thing is the logs I pull from MAX don't even show that anything happened -- it's basically a complete system freeze (doesn't reply to pings from other computers either). I see what I believe to be the power cycle event in the logs once I force the relay on the supply to open/close, but even if I wait days between it going non-responsive and power cycling it I won't see anything representative of the lockup in the logs except the power cycle. We're going through the arduous process of trying to gain physical access to the cRIO so we can look at the status lights on the front and measure the supply voltage when it's frozen, but that'll probably be another couple of weeks before we get permission and I'd love to have a checklist of other potential things we can try in our limited window of time we have physical access. Unfortunately I can't post any screenshots, let alone any code, due to the nature of the deployment which is why my question is so general and I'm just soliciting some stuff to throw at the wall hoping that it sticks. I've started logging max queue size for the data transmission queue to the cRIO HD, but it consistently remains low enough not to be a concern. We also don't see any delay in ingestion of the data by the database computer -- all of the received timestamps and data are updating completely as expected until the cRIO locks up. Whether anyone replies to this or not, I feel better that I've ranted so thank you for the mental relief!
×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.