Petr Mazůrek Posted August 25, 2022 Report Posted August 25, 2022 Hi everyone, this is a lengthy one but I think it deserves some more information. INTRODUCTION I'd like to first and foremost state the following: I understand that VMs are not officially supported and that idea exchange posts on how to setup your own Linux virtual machine are not an official source of information. That being said, we experienced with just that and discovered that we hit a certain performance cap that evidently exists on cRIOs and PXIs as well. SYSTEM DESCRIPTION Our applications are large and could be described followingly: 1) It is not object oriented but consists of individual processes with wrappers for commands. These processes are governed by what we call an IOC (input-output controller) and we use QMH for communication. 2) Several IOCs can run on a single machine and they publish data using network streams to 'Developer GUIs' - lower level access. 3) All variables are stored in a local datastore (basically a Queue with one variant in it, keys are variant attributes to accomodate fast search). 4) Meaningful data are then sent using EPICS 3.14 server. Such a server is hosted in one of the IOCs which we describe as a ROOT IOC. PV variables are then used by 'EPICS GUIs' - higher level control and supervision. 5) We used to develop this for PharLap on RMC-8354 therefore such machine are called RMCs. RMC-8354 on NI page 6) As an example with our labels and terminology: RMC101 host two IOCs: ROOT IOC and Camera Manager IOC. ISSUES PharLaps frequently crashed (once a week) and RMC-8354 became a mature product with no viable rack-mount replacement. Therefore we started experimenting with VMs on either physical or cloud-based platform (vxRail). It became soon obvious that: 1) Linux can be easily managed and it is very stable. 2) Deployment is in fact easier than in case of a PharLap machine. 3) Our software cannot run as many IOCs as one would expect. We confirmed this on genuine NI hardware as well (PXI and cRIO with varying numbers of IOCs active). Let me describe the behaviour: Everything scales nicely until we exceed a certain CPU load. We know that CPU load data are reliable and thus it made no sense that it stopped increasing when we started adding more software on to the machine. During several months of experiments and various hacks we isolated the problem to scheduling. We managed to make a few traces and soon it became obvious that PharLap and Linux are very different beasts. Pharlap trace is very organized and it manages to run 4-5 IOC with no problems (CPU is very high though, could be up to 95%). A) PHARLAP TRACE Linux on the other hand stops working when we add more than 2 IOCs and even 2 demanding IOCs can underperform. B) LINUX TRACE WITH FUNCTIONAL CODE C) LINUX TRACE WITH DYSFUNCTIONAL CODE IN-DETAIL DESCRIPTION Any increase in CPU load on a Linux machine causes more threads to be spawned and very high number of context switches is detected. So observing of CPU load does not help you because it is thread switching that becomes the issue. Why this happens is probably down to how compiler for Linux works and how it sets scheduling in NI Linux kernel. We tested multiple things, including changes in scheduling strategies but these hacks are never stable and help only in specific arrangements of our IOCs. One such example is pidof lvrt 2074 chrt -p -a -f 1 2074 This forces all threads to use FIFO scheduler with the lowest priority. Magically, this can improve performance - and then it crashes. CPU rises from 50 to 70% (since it can actually do more and it is not just swapping threads around). I investigated threadconfig.vi and started experimenting with the number of threads being spawned. My latest discovery shows that by limiting the number of available threads per core in combination with increase in the number of available cores improves performance. I added these into lvrt.conf ESys.Normal=10 ESys.StdNParallel=-1 ESys.Bgrnd=0 ESys.High=0 ESys.VHigh=10 ESys.TCritical=6 ESys.instrument.Bgrnd=0 ESys.instrument.Normal=0 ESys.instrument.High=0 ESys.instrument.VHigh=0 ESys.instrument.TCritical=0 ESys.DAQ.Bgrnd=0 ESys.DAQ.Normal=0 ESys.DAQ.High=0 ESys.DAQ.VHigh=0 ESys.DAQ.TCritical=0 ESys.other1.Bgrnd=0 ESys.other1.Normal=0 ESys.other1.High=0 ESys.other1.VHigh=0 ESys.other1.TCritical=0 ESys.other2.Bgrnd=0 ESys.other2.Normal=0 ESys.other2.High=0 ESys.other2.VHigh=0 ESys.other2.TCritical=0 I believe that the default number is 20 (threadconfig.vi somehow reported 26?) and we don't use other execution priorities so I limited standard threads to 10 per core. Also, if you completely omit time critical threads, it doesn't help in any way. Adding ESys.TCritical=6 improved performance to the best it has ever been. I don't know why, we are not using time critical priority anyway, I searched the code with a script just to make sure. One sure thing to improve it is going up with computational frequency - but that is not the latest trend and especially not in vmWare world. COMPARISON WITH WIN10 BUILD What makes me believe that this is not our fault? I admit that we are doing something very complex afterall. But the whole thing is not really using RT in any way and Windows 10 build of the same thing runs like a charm on much less powerful PCs. SUMMARY I am opened to any suggestions what to try next. Maybe you share the same experience? The scariest thing for us is the fact that NI does not offer a follow-up to RMC that could perform as well as an 8-year-old PharLap machine. We have contacted American NI team but the scale of the project does not help in explaining what is wrong. Their general response is - we will get back to you. While I understand that, we need to somehow move forward. Thank you for any tips, Petr Quote
hooovahh Posted August 25, 2022 Report Posted August 25, 2022 That is cool but I don't know how to help. We did install Linux RT on a bare metal rack mount PC as a preliminary test. We never did any performance testing, we were just looking for potential new designs for systems. I'm sure if you get a hold of the right person at NI they would be able to help, but I don't have any contacts at the Linux RT R&D department. However, if you contact support they will probably laugh at you, sorry. Quote
Petr Mazůrek Posted August 26, 2022 Author Report Posted August 26, 2022 Quote However, if you contact support they will probably laugh at you, sorry. My thoughts exactly The PreemptRT architect I've reached was interested but he was transfered to a different project because they consider Linux RT transition for key products solved. I guess that scaling another application to a reasonable size could reveal the same behaviour. PharLap has been used in other places where Big Physics happens but it is still too early to migrate to Linux for any large-scale project (even though you pay in gold for a single RMC). Maybe answers will start coming when key customers are affected. Quote
ShaunR Posted August 26, 2022 Report Posted August 26, 2022 (edited) It's been a very long time since I did any real-time stuff but If I'm reading your charts right... The Linux nisysapi-mDNS is a bottleneck? Is that due to your code looking up addresses or an artifact of the system in general? I also noticed the variable publisher in your first image. Are you using network variables? Edited August 26, 2022 by ShaunR Quote
Petr Mazůrek Posted August 29, 2022 Author Report Posted August 29, 2022 (edited) It's so great to have a fresh pair of eyes on this. mDNS struck me too. I suppose that EPICS engine takes quite a load but it opens virtual channels that don't need search happening all the time. We've had other related problems before with UDP. The buffer was too big. We optimized it and now it works like a charm. Any tips what to look for? Is this TCP? Regarding network variables, we are not using any. Codebase is large though and our partners from LLNL used variables in the past. Is it maybe spawned automatically by LVRT? It seemes to me like some of these services are. I noticed some DAQmx stuff in there. Thanks fo the feedback. Edited August 30, 2022 by Petr Mazůrek typo Quote
Petr Mazůrek Posted August 30, 2022 Author Report Posted August 30, 2022 I tried forcing of FIFO scheduling on all threads while making sure that policies are all set to other before that. lvrt.conf thpolicy_tcrit=other thpri_tcrit=0 thpolicy_vhigh=other thpri_vhigh=0 thpolicy_high=other thpri_high=0 Command line chrt -p -a -f 1 <pidnumber> I am at a better rate now (would be acceptable) but it is not comparable to the Win10 build in any way. Also, it is stable - that did not apply in the past when testing FIFO scheduling. The last thing I tried was to kill the thread for mDNS but it did not change anything. Here is the trace with FIFO enabled I will keep testing it. Quote
Petr Mazůrek Posted August 31, 2022 Author Report Posted August 31, 2022 So we made some progress with diagnostics of this thing. We tried to condense it in the appended report. In short, we managed to isolate this to access to notifier references. This is clearly realized differently on PharLap and Linux. We tried to swap notifiers for queues and it improves readings but writing is still slower. I can post the code after some revision to make sure that you don't need extra packages if you want to try it yourselves. LinuxRMC CPU Load_New_revised.pptx Quote
Neil Pate Posted August 31, 2022 Report Posted August 31, 2022 (edited) I guess there is much more code and it makes sense when looking at the bigger picture, but with this snippet I don't really get this. Why bother with making a single element queue the reference of which your store in the feedback node. Why not just store the variant itself? (using the feedback node?) Edited August 31, 2022 by Neil Pate Quote
Petr Mazůrek Posted September 1, 2022 Author Report Posted September 1, 2022 The DataStore is shared for the machine which runs multiple IOCs. So Queue can be accessed by name _DATASTORE from other pieces of code. Quote
Neil Pate Posted September 1, 2022 Report Posted September 1, 2022 (edited) 36 minutes ago, Petr Mazůrek said: The DataStore is shared for the machine which runs multiple IOCs. So Queue can be accessed by name _DATASTORE from other pieces of code. Sure, but I still don't see the point as you could just keep the variant in the feedback node (and this VI is accessible by all processes) . What does adding it to a single element queue do? Edited September 1, 2022 by Neil Pate Quote
ShaunR Posted September 1, 2022 Report Posted September 1, 2022 On 8/31/2022 at 12:49 PM, Petr Mazůrek said: We tried to swap notifiers for queues and it improves readings but writing is still slower All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data. On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers). Quote
Rolf Kalbermatter Posted September 4, 2022 Report Posted September 4, 2022 On 9/1/2022 at 11:42 PM, ShaunR said: All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data. On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers). No flame from me for this. Under your constraint (only ever write from one place and never anywhere else) it is a valid use case. However beware of doing that for huge data. This will not just incur memory overhead but also performance, as the ENTIRE global is everytime copied even if you do right after an index array to read only one element from the huge array. Quote
Petr Mazůrek Posted September 5, 2022 Author Report Posted September 5, 2022 On 9/1/2022 at 11:42 PM, ShaunR said: All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data. On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers). Are there any more details available on how occurences are used in case of notifiers and queues? Quote
Petr Mazůrek Posted September 5, 2022 Author Report Posted September 5, 2022 23 hours ago, Rolf Kalbermatter said: No flame from me for this. Under your constraint (only ever write from one place and never anywhere else) it is a valid use case. However beware of doing that for huge data. This will not just incur memory overhead but also performance, as the ENTIRE global is everytime copied even if you do right after an index array to read only one element from the huge array. Scaling of data is important for us so we would like to build upon existing structures within the code. Quote
Petr Mazůrek Posted September 5, 2022 Author Report Posted September 5, 2022 (edited) On 9/1/2022 at 10:36 AM, Neil Pate said: Sure, but I still don't see the point as you could just keep the variant in the feedback node (and this VI is accessible by all processes) . What does adding it to a single element queue do? I tried to remember what led to this design, I think it was the idea of having the whole thing wrapped in something better defined than just a feedback node (used to be FGV). I suppose it doesn't achieve much. Edited September 5, 2022 by Petr Mazůrek Quote
Petr Mazůrek Posted November 2, 2022 Author Report Posted November 2, 2022 Minor update: We tested the whole thing on a virtualized Windows server. We can see slowdown of the datastore loop but no other loops are affected. Therefore, behaviour on LinuxRT really is different and it causes the problems to appear sooner. Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.