Performance testing

Daklu · December 3, 2010

The last couple weeks I've been chasing a performance issue with an I2C data collection component I wrote. Specifically I've been using Process Explorer to monitor (and try to reduce) the cpu load. The attached WaveformStressTest vi is a simplified model of the display component of my application that I've been using to try and figure out the issues. The questions below are some of the weird things I've seen while exploring the limits of the waveform chart.

*Noob Questions Alert*

Sure waveform charts have been around forever, but I actually haven't used them for anything meaningful until this project. So while this is probably a "well, duh" moment for the rest of you, for me these are a "huh?"

Q1

The waveform 'Synchronous Display' property help says...

In multithreaded systems, you can use the Synchronous Display property to set whether to defer updates for controls and indicators.

I've always took that to mean setting that property true synchronizes the display updates with the data arrival. Now that I'm looking at it again it's ambiguous. Does setting it true synchronize display updates with the data arrival (refreshing every time new data arrives) or synchronize display updates with the regular screen refreshes (buffering the data until the OS screen refresh occurs?)

Q2

Why does a strip chart's cpu load increase so much once the data actually starts scrolling? Enable the disabled case for Test Config 1, make sure all four charts are empty and set to strip chart mode, then run the vi. I get ~25% load while the data makes its way across the chart. Once the data starts scrolling the load jumps to 50%. I don't think it's due to the strip chart's circular buffer background processing, since minimizing the window drops the load to essentially 0. Is it because once the data starts scrolling the entire chart surface has to be repainted as opposed to just the section that has new data? Is repainting *that* inefficient? This is one place where I'd expect the synchronous display setting to help out, but changing it doesn't seem to have any effect on load or display.

Q3

Why does a scope chart use so much less cpu time than a strip chart before the strip chart data starts scrolling? I'd expect them to be about the same, since in both cases they only have to repaint the part that has new data. Yet the strip charts use ~25% of the cpu while the scope charts use ~13% of the cpu.

Q4

Try this... Start the test vi. Set each chart to strip mode then back to scope mode so they are all in scope mode but reach the end of the display at different times. Minimize the front panel for a second or two then restore it. Things look normal. Now minimize the panel for 10 seconds or so and then restore it. All the data that should be displayed has been erased and the four charts are synched. Is Labview being a good citizen and releasing that memory if we don't need it within 10 seconds?

Q5 (This one has me stumped)

Enabled the disabled cases for Test Config 2. In the display loop I set up a simple case stucture so each chart only displays every 20th data point. The charts should show a smooth line with a constant slope. Instead I get long delays between screen refreshes and a jagged line, indicating the chart isn't processing all the data it is receiving. Probing inside the cases with the chart controls shows they are executing correctly. For some reason the charts are not processing the data they receive. Note the processor load in this test is relatively low at ~10%.

Test Config 3 is the one that most resembles where I am now in passing parsed I2C data to my display loop. Each of the four sensors sends out a data point ~every 3 ms. Once I realized I couldn't send each point to the chart as I acquired it and let the Synchronous Display property manage refresh resources, I used the case structure as a simple way to decimate the data.

Oddly, in the test vi the charts appear to miss data. In my app they don't stutter or miss data, but they do use an excessive amount of cpu time even though they are only getting about 15 pts/sec. I have also seen situations where the chart control appears to buffer data points before displaying it. When that happens I can stimulate the signals on my desk and there's a 1-2 sec delay before it is displayed on the chart.

So... any insights into waveform charts y'all would like to share?

WaveformStressTest.vi

ShaunR · December 4, 2010

Haven't had time to go into all the above.

But have noticed that your bottom loop (dequeue) is free running (0ms between iterations). I get a very stuttery update (just ran it straight out of the box). Changing the time-out (deleting the local and hard-wiring) to, say 100ms, and everything is happy and the dequeue loop updates at 3ms. I imagine there's a shed-load of some time-outs going on which is causing you to miss updates (not sure of the mechanism at the moment)..

Daklu · December 5, 2010

But have noticed that your bottom loop (dequeue) is free running (0ms between iterations).

It shouldn't be free running straight out of the box. The default update delay is set to 3ms. Am I missing something obvious?

The original testing was done on my work laptop. Downloading and running the code at home on my desktop has very different results. Running it straight out of the box used <3% of my cpu. My home computer is 3 years old so it's no powerhouse, but it does have a good video card. I wonder if that could be the difference? I'm also running XP at home and Win7 at work, so that could contribute too. I dunno... seems weird.

ShaunR · December 5, 2010

It shouldn't be free running straight out of the box. The default update delay is set to 3ms. Am I missing something obvious?

The original testing was done on my work laptop. Downloading and running the code at home on my desktop has very different results. Running it straight out of the box used <3% of my cpu. My home computer is 3 years old so it's no powerhouse, but it does have a good video card. I wonder if that could be the difference? I'm also running XP at home and Win7 at work, so that could contribute too. I dunno... seems weird.

I think so.

Yes the bottom loop is free-running. Put an indicator on the front panel wired to the difference between 2 "GetTickCount" VIs. You will see that it is pretty much "0".

You are in fact putting all the data onto 1 un-named queue. So, every 3 ms you are adding 4 lots of data, but you don't know which order the bottom de-queues execute in or, indeed, which order they were originally placed (but that's another issue).

Your time-out is is exactly the same (3 ms) so if there are 4 pieces of data on the queue, then everything is fine. However this is unlikely since the producer and consumer are not synchronised. If a timeout occurs; you still increment your counter so, due to jitter, sometimes you are reading data, and sometimes you are just timing out.....but.... you still only display the data (and add it to the history) in case 19. And although you will read it next time around, you don't know on which graph it will be. You have a race condition between data arriving and timing out. Does that make sense?

Edited December 5, 2010 by ShaunR

Daklu · December 5, 2010

I think so... You have a race condition between data arriving and timing out. Does that make sense?

Absolutely. Good catch! :thumbup1: That also could be why I get very different results on my laptop and desktop.

On my home computer probing the loop counters had the lower loop running about twice as fast as the upper loop. The easiest solution is to remove the dequeue timeout completely, though adding 2 ms to Update Delay before wiring it to the timeouts solved the problem too. (I'll have to add more code to more closely model what my app is *really* doing... it only has a single queue.)

When I was testing with this I was making big changes to the Update Delay and I didn't want to have to stop and restart to fix the timeout value. Why did I have the timeout value in the first place when it isn't needed? I don't remember exactly, but I think when I was first putting it together I didn't have the release queue prims outside the top loop, so the lower loop would block and the vi would hang. Looks like I wasn't diligent enough checking my code as it evolved.

You are in fact putting all the data onto 1 un-named queue.

I'm pretty sure this is incorrect, unless I have a fundamental misunderstanding of how the obtain queue prim works. I should be getting four different unnamed queues, not four instances of the same unnamed queue. Probing the queue wires gives me four unique queue refnums and the code runs correctly when I change the top loop as shown here. Regardless, your point about the race condition was spot on.

Daklu · December 5, 2010

Here is a better model of my app. In particular, the case structures in the display loop was the quick and dirty way I decimated the data. (Friday afternoon I replaced that with a time-based decimation, but I think this is sufficient for the model.)

Some of the differences between my app and this model are:

-Data is collected from a Beagle I2C monitor.

-The data collection module runs in a dynamically launched actor state machine.

-App has more threads... data collection, user input, chart update, mediator, etc.

-Uses my object-based messaging library to pass messages between the loops, so there is a (tiny?) bit more overhead with the boxing/unboxing.

-Each of the four I2C ICs output data at slightly different rates. The data collection module sends the data out as soon as it gets it from the Beagle. Overall I receive ~1250 data packets/sec.

I added a multiplier to simulate changes to the source signals. In the app, sometimes source signal changes don't show up on the chart until several seconds have passed, even though the receive queue is empty. I haven't been able to recreate that condition in this model. I also added code to monitor the queue size. Sometimes, inexplicably, the data queue in the app will continuously grow. The only way to fix it is to close the project. (I posted a bug about that recently.) That hasn't occurred in my model either.

WaveformStressTest2.vi

ShaunR · December 5, 2010

I'm pretty sure this is incorrect, unless I have a fundamental misunderstanding of how the obtain queue prim works. I should be getting four different unnamed queues, not four instances of the same unnamed queue. Probing the queue wires gives me four unique queue refnums and the code runs correctly when I change the top loop as shown here. Regardless, your point about the race condition was spot on.

Probably. I don't use un-named queues since I have a single VI that wraps the queue primitives and ensures there are no resource leaks (and I don't like all the wires going everywhere )

Are you still getting differences in the graphs? When I run your example, I see no difference in CPU usage between scrolling and not (~5-10%) although I haven't looked at all the "Questions"

Daklu · December 5, 2010

I don't use un-named queues...

Really? I don't use named queues since that opens a hole in my encapsulation.

Are you still getting differences in the graphs?

Yep. With the last vi I posted the load is <2% prescrolling and ~8% while scrolling if I enable decimation. When I disable decimation the prescrolling load is ~5% and scrolling load is ~30%.

My app does use decimation though, so the question I'm facing now is why does repainting the model use so much less cpu time than repainting the app? They do essentially the same thing, except the apps functional components are wrapped in more layers of abstraction. That question is going to have to wait until I get back to work on Monday.

ShaunR · December 5, 2010

Really? I don't use named queues since that opens a hole in my encapsulation.

Indeed. But our topologies (and philosophies) are entirely different.

Yep. With the last vi I posted the load is <2% prescrolling and ~8% while scrolling if I enable decimation. When I disable decimation the prescrolling load is ~5% and scrolling load is ~30%.

My app does use decimation though, so the question I'm facing now is why does repainting the model use so much less cpu time than repainting the app? They do essentially the same thing, except the apps functional components are wrapped in more layers of abstraction. That question is going to have to wait until I get back to work on Monday.

Ahhh. I see what your getting at now. Yes. they will be different because you have to redraw more data, more of the control and more often..

It's well known that redraws are expensive (put a control over your graph or make them all twice as large and see what happens). And it is not just Labview. The major difference is in what and how much has to be updated. If you (for example) turn off the scaling (visibility) and keep the data-window the same size, you will see that the CPU use drops a little when it is scrolling. Presumably this is because it is no longer has to redraw the entire control; only the data window (just guessing). You are much better off updating a graphical display in "chunks" so you need to refresh less often. After all, its only to give the user something to look at whilst drinking coffee

Humans can only process so much graphical data (and only so much data can be represented due to pixel size) so there is no need to show every data point of a 10,000 point graph. It's meaningless- It just looks like noise! But I bet you want the user to be able to zoom in and out eh?

Edited December 5, 2010 by ShaunR

Daklu · December 6, 2010

It's well known that redraws are expensive

I knew they were expensive--I didn't know they were *that* expensive. I wouldn't normally consider 15 pts/sec on four charts an excessive amount of data. Running the decimated case on my work computer loaded the cpu ~20% when scrolling the data.

If you (for example) turn off the scaling (visibility) and keep the data-window the same size, you will see that the CPU use drops a little when it is scrolling.

Holy missing processor load Batman! Turning off y-axis autoscaling on each of the four charts reduced the load from ~20% to less than 5%. (Too bad it's already turned off on my app...)

ShaunR · December 6, 2010

I knew they were expensive--I didn't know they were *that* expensive. I wouldn't normally consider 15 pts/sec on four charts an excessive amount of data. Running the decimated case on my work computer loaded the cpu ~20% when scrolling the data.

Holy missing processor load Batman! Turning off y-axis autoscaling on each of the four charts reduced the load from ~20% to less than 5%. (Too bad it's already turned off on my app...)

I only see a huge CPU jump when not decimating, so its more like 333 points (or 333 redraws/sec). If you think 20% is bad, wait until you've tried the vision stuff

Sign In

Performance testing

Recommended Posts

Daklu

Link to comment

ShaunR

Link to comment

Daklu

Link to comment

ShaunR

Link to comment

Daklu

Link to comment

Daklu

Link to comment

ShaunR

Link to comment

Daklu

Link to comment

ShaunR

Link to comment

Daklu

Link to comment

ShaunR

Link to comment

Join the conversation

Browse

Activity

Important Information