Jump to content

NI Linux RT Scheduler vs PharLap or Win10


Petr Mazůrek

Recommended Posts

Hi everyone,

this is a lengthy one but I think it deserves some more information.

 

INTRODUCTION

I'd like to first and foremost state the following:

I understand that VMs are not officially supported and that idea exchange posts on how to setup your own Linux virtual machine are not an official source of information.

That being said,  we experienced with just that and discovered that we hit a certain performance cap that evidently exists on cRIOs and PXIs as well.

 

SYSTEM DESCRIPTION

Our applications are large and could be described followingly:

1) It is not object oriented but consists of individual processes with wrappers for commands. These processes are governed by what we call an IOC (input-output controller) and we use QMH for communication.

2) Several IOCs can run on a single machine and they publish data using network streams to 'Developer GUIs' - lower level access.

3) All variables are stored in a local datastore (basically a Queue with one variant in it, keys are variant attributes to accomodate fast search).

4) Meaningful data are then sent using EPICS 3.14 server. Such a server is hosted in one of the IOCs which we describe as a ROOT IOC. PV variables are then used by 'EPICS GUIs' - higher level control and supervision.

5) We used to develop this for PharLap on RMC-8354 therefore such machine are called RMCs. RMC-8354 on NI page

6) As an example with our labels and terminology: RMC101 host two IOCs: ROOT IOC and Camera Manager IOC.

 

ISSUES

PharLaps frequently crashed (once a week) and RMC-8354 became a mature product with no viable rack-mount replacement. Therefore we started experimenting with VMs on either physical or cloud-based platform (vxRail). It became soon obvious that:

1) Linux can be easily managed and it is very stable.

2) Deployment is in fact easier than in case of a PharLap machine.

3) Our software cannot run as many IOCs as one would expect. We confirmed this on genuine NI hardware as well (PXI and cRIO with varying numbers of IOCs active).

 

Let me describe the behaviour:

Everything scales nicely until we exceed a certain CPU load. We know that CPU load data are reliable and thus it made no sense that it stopped increasing when we started adding more software on to the machine. During several months of experiments and various hacks we isolated the problem to scheduling. We managed to make a few traces and soon it became obvious that PharLap and Linux are very different beasts.

Pharlap trace is very organized and it manages to run 4-5 IOC with no problems (CPU is very high though, could be up to 95%).

A) PHARLAP TRACE

image.png.614fb15a61b615b5c009bd110232e2ef.png

Linux on the other hand stops working when we add more than 2 IOCs and even 2 demanding IOCs can underperform.

B) LINUX TRACE WITH FUNCTIONAL CODE

Trace_functional.png.2cc0844dd5894c6ff6cce618a2df4590.png

C) LINUX TRACE WITH DYSFUNCTIONAL CODE

trace_not_functional.png.5e5cd562d1f85ca363e57a97459b2b87.png

IN-DETAIL DESCRIPTION

Any increase in CPU load on a Linux machine causes more threads to be spawned and very high number of context switches is detected. So observing of CPU load does not help you because it is thread switching that becomes the issue. Why this happens is probably down to how compiler for Linux works and how it sets scheduling in NI Linux kernel.

 

We tested multiple things, including changes in scheduling strategies but these hacks are never stable and help only in specific arrangements of our IOCs. One such example is

pidof lvrt
2074

chrt -p -a -f 1 2074

This forces all threads to use FIFO scheduler with the lowest priority. Magically, this can improve performance - and then it crashes. CPU rises from 50 to 70% (since it can actually do more and it is not just swapping threads around).

I investigated threadconfig.vi and started experimenting with the number of threads being spawned. My latest discovery shows that by limiting the number of available threads per core in combination with increase in the number of available cores improves performance. I added these into lvrt.conf

ESys.Normal=10
ESys.StdNParallel=-1
ESys.Bgrnd=0
ESys.High=0
ESys.VHigh=10
ESys.TCritical=6
ESys.instrument.Bgrnd=0
ESys.instrument.Normal=0
ESys.instrument.High=0
ESys.instrument.VHigh=0
ESys.instrument.TCritical=0
ESys.DAQ.Bgrnd=0
ESys.DAQ.Normal=0
ESys.DAQ.High=0
ESys.DAQ.VHigh=0
ESys.DAQ.TCritical=0
ESys.other1.Bgrnd=0
ESys.other1.Normal=0
ESys.other1.High=0
ESys.other1.VHigh=0
ESys.other1.TCritical=0
ESys.other2.Bgrnd=0
ESys.other2.Normal=0
ESys.other2.High=0
ESys.other2.VHigh=0
ESys.other2.TCritical=0

I believe that the default number is 20 (threadconfig.vi somehow reported 26?) and we don't use other execution priorities so I limited standard threads to 10 per core. Also, if you completely omit time critical threads, it doesn't help in any way. Adding 

ESys.TCritical=6

improved performance to the best it has ever been. I don't know why, we are not using time critical priority anyway, I searched the code with a script just to make sure.

 

One sure thing to improve it is going up with computational frequency - but that is not the latest trend and especially not in vmWare world.

COMPARISON WITH WIN10 BUILD

What makes me believe that this is not our fault? I admit that we are doing something very complex afterall. But the whole thing is not really using RT in any way and Windows 10 build of the same thing runs like a charm on much less powerful PCs.

SUMMARY

I am opened to any suggestions what to try next. Maybe you share the same experience?

The scariest thing for us is the fact that NI does not offer a follow-up to RMC that could perform as well as an 8-year-old PharLap machine.

We have contacted American NI team but the scale of the project does not help in explaining what is wrong. Their general response is - we will get back to you. While I understand that, we need to somehow move forward.

 

Thank you for any tips,

 

Petr

Link to comment

That is cool but I don't know how to help.  We did install Linux RT on a bare metal rack mount PC as a preliminary test.  We never did any performance testing, we were just looking for potential new designs for systems.  I'm sure if you get a hold of the right person at NI they would be able to help, but I don't have any contacts at the Linux RT R&D department.  However, if you contact support they will probably laugh at you, sorry.

Link to comment
Quote

However, if you contact support they will probably laugh at you, sorry.

My thoughts exactly :D The PreemptRT architect I've reached was interested but he was transfered to a different project because they consider Linux RT transition for key products solved.

 

I guess that scaling another application to a reasonable size could reveal the same behaviour. PharLap has been used in other places where Big Physics happens but it is still too early to migrate to Linux for any large-scale project (even though you pay in gold for a single RMC). Maybe answers will start coming when key customers are affected.

Link to comment

It's been a very long time since I did any real-time stuff but If I'm reading your charts right...

The Linux nisysapi-mDNS is a bottleneck? Is that due to your code looking up addresses or an artifact of the system in general? I also noticed the variable publisher in your first image. Are you using network variables?

Edited by ShaunR
Link to comment

It's so great to have a fresh pair of eyes on this.

 

mDNS struck me too. I suppose that EPICS engine takes quite a load but it opens virtual channels that don't need search happening all the time. We've had other related problems before with UDP. The buffer was too big. We optimized it and now it works like a charm. Any tips what to look for? Is this TCP?

Regarding network variables, we are not using any. Codebase is large though and our partners from LLNL used variables in the past. Is it maybe spawned automatically by LVRT? It seemes to me like some of these services are. I noticed some DAQmx stuff in there.

 

Thanks fo the feedback.

Edited by Petr Mazůrek
typo
Link to comment

I tried forcing of FIFO scheduling on all threads while making sure that policies are all set to other before that.

lvrt.conf

thpolicy_tcrit=other
thpri_tcrit=0
thpolicy_vhigh=other
thpri_vhigh=0
thpolicy_high=other
thpri_high=0

Command line

chrt -p -a -f 1 <pidnumber>

I am at a better rate now (would be acceptable) but it is not comparable to the Win10 build in any way. Also, it is stable - that did not apply in the past when testing FIFO scheduling.

 

The last thing I tried was to kill the thread for mDNS but it did not change anything.

 

Here is the trace with FIFO enabled

image.png.bff9704a257eeeff545aebc76bbd541d.png

I will keep testing it.

Link to comment

So we made some progress with diagnostics of this thing. We tried to condense it in the appended report.

 

In short, we managed to isolate this to access to notifier references. This is clearly realized differently on PharLap and Linux. We tried to swap notifiers for queues and it improves readings but writing is still slower.

 

I can post the code after some revision to make sure that you don't need extra packages if you want to try it yourselves.

LinuxRMC CPU Load_New_revised.pptx

Link to comment

I guess there is much more code and it makes sense when looking at the bigger picture, but with this snippet I don't really get this. Why bother with making a single element queue the reference of which your store in the feedback node. Why not just store the variant itself? (using the feedback node?)

image.png.cf0ba11717792e702f3ca6b7ec8ba5ea.png

Edited by Neil Pate
Link to comment
36 minutes ago, Petr Mazůrek said:

The DataStore is shared for the machine which runs multiple IOCs. So Queue can be accessed by name _DATASTORE from other pieces of code.

Sure, but I still don't see the point as you could just keep the variant in the feedback node (and this VI is accessible by all processes) . What does adding it to a single element queue do?

Edited by Neil Pate
Link to comment
On 8/31/2022 at 12:49 PM, Petr Mazůrek said:

We tried to swap notifiers for queues and it improves readings but writing is still slower

All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data.

On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers).

Link to comment
On 9/1/2022 at 11:42 PM, ShaunR said:

All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data.

On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers).

No flame from me for this. Under your constraint (only ever write from one place and never anywhere else) it is a valid use case. However beware of doing that for huge data. This will not just incur memory overhead but also performance, as the ENTIRE global is everytime copied even if you do right after an index array to read only one element from the huge array.

Link to comment
On 9/1/2022 at 11:42 PM, ShaunR said:

All of the synchronisation primitives use occurrences under the hood. Occurrences are the most efficient but hardest to use since they don't carry any data.

On a side note (and people will hate/flame/badger me for this). If you have a write one and read many architecture then the most efficient for task switching is a global variable (but memory suffers).

Are there any more details available on how occurences are used in case of notifiers and queues?

Link to comment
23 hours ago, Rolf Kalbermatter said:

No flame from me for this. Under your constraint (only ever write from one place and never anywhere else) it is a valid use case. However beware of doing that for huge data. This will not just incur memory overhead but also performance, as the ENTIRE global is everytime copied even if you do right after an index array to read only one element from the huge array.

Scaling of data is important for us so we would like to build upon existing structures within the code. 

Link to comment
On 9/1/2022 at 10:36 AM, Neil Pate said:

Sure, but I still don't see the point as you could just keep the variant in the feedback node (and this VI is accessible by all processes) . What does adding it to a single element queue do?

I tried to remember what led to this design, I think it was the idea of having the whole thing wrapped in something better defined than just a feedback node (used to be FGV). I suppose it doesn't achieve much.

Edited by Petr Mazůrek
Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.