Jump to content

Loop Parallelization


Recommended Posts

Just playing around with loop parallelization (sp?) and the hi-rez tick counter.  I got some interesting results.  Looks like loop iterations can execute out of order (parallel loops are not really parallel (of course) since the CPU must serialize them at some level)

 

Not sure if this is all that meaningful but I found it interesting.  Here are some code snippets to play with.

 

post-2411-0-44545900-1376440636.png

 

post-2411-0-21880500-1376440670.png

Link to comment
(parallel loops are not really parallel (of course) since the CPU must serialize them at some level)

If you have multiple processors, as most machines do these days, they really are parallel.

LV2013John

 

I can't get your snippet to work? It don't transfor into code when I drop the .png at the BD.

By the way. Where do you find the Hi-rez tick counter (I use LV2013)?

 

regards Bjarne

Bjarne: I've heard -- but have not confirmed -- that some of the browsers have stopped moving the meta information for some reason -- security, I think, with people doing things like, oh, embedding runnable code in the meta data... like LV is doing. ;-) Try saving the PNG to disk and then add it to your diagram.

Link to comment
If you have multiple processors, as most machines do these days, they really are parallel.

Yes, with multiple CPUs, you get some parallelism. The interesting thing is how you sometimes get a jagged line as some elements that come after others were created earlier in time.

I expected more of a staircase effect where the first 4 elements (I have 4 CPUs) would have the same time-stamp and the next 4 would have a later time stamp, and so on.  It appears that instead the compiler is breaking the loop up into sections of some predetermined length and then giving those sections to different CPUs.  The length of the sections also appears to decrease with each iteration of processing as the jagged line become less pronounced as you approach the end.  It also appears that the processing on each CPU does not start at the same time but is rather staggered, as if the sections were determined and the first CPU started working on its section while the second section for the second CPU was still being handed off for processing.

Adding the 0ms delay did allow the CPU's processing to line up, or so it appears.

This is, of course, all conjecture from just observing the output.  I just found this very curious.

 

Bjarne: I've heard -- but have not confirmed -- that some of the browsers have stopped moving the meta information for some reason -- security, I think, with people doing things like, oh, embedding runnable code in the meta data... like LV is doing. ;-) Try saving the PNG to disk and then add it to your diagram.

I have tried dragging the original PNG to a LV diagram and it generated code just fine.  But saving to disk from LAVA and doing the same did not work.  So, the stripping might have happened on my end (using Chrome browser).

Here are the VIs, FWIW.

loop parallelization.vi

loop parallelization with delay.vi

Link to comment
I expected more of a staircase effect where the first 4 elements (I have 4 CPUs) would have the same time-stamp and the next 4 would have a later time stamp, and so on.

Why would you expect that? Are you are doing your testing on a real time operating system? If not, that's a huge amount of jitter. This code will NOT time how long it takes to do a Subtract (even if we assume that the Tick Count primitive had infinite precision, which it doesn't):

post-5877-0-11223200-1376506107.png

It will only tell you how long it takes to do the subtract IF that the whole Flat Sequence executes without being swapped out by the processor. It could be that the first Tick Count executes, then the thread swaps out, then the Subtract executes, then the thread swaps out, then the second Tick Count executes. There could be hours of time between those swaps (ok, probably not in practice, but in theory, yes) unless you're on a deterministic OS and that code is running in the highest priority thread.

 

Even on a real-time OS, then you'll run into additional problems because parallel code can't run in the single highest priority thread, by definition. So you would still have skew across your CPUs as they get serviced in various ways. But it is easier to close down all the other services of the system (like removing any scans of network ports, for example).

 

Timing benchmarks for low level components are not something you can do on your own without lots of training and, at the least, a real-time system. I don't have that training... I leave it to a team of experts here at NI when I need benchmark questions answered.

Link to comment

I was not really trying to benchmark anything.  I actually came across this when trying to determine if turning on loop parallelization was worth while in a separate project.  I have a broadcast method that sends a message to multiple listeners.  It does this in a for loop (looping through the various transports and putting the message on each one).

I added some code to time when the message was received at each destination.  I then compared the times.  After that, I turned on parallelization in the for loop and ran it again.  From what I could see in this limited test, there was no noticeable difference.

That is what led me to the VI posted here and the observation I made.  I know that the OS gets in the way and messes things up.  This is quite clear if you run my VI multiple times and see the varying results.  I was mostly just curious to understand what was going on behind the scenes that could explain the patterns that emerged.  The randomness can be explained by the OS but the patterns seem to indicate some underlying behavior.  I read a bit about 'chunking' in the LV help but am not sure if that is what I am seeing in the patterns.

 

Anyways, just some curiosity, that is all.  At least I have learned from this thread that posting VI snippets is a bad idea...

Link to comment
.... It appears that instead the compiler is breaking the loop up into sections of some predetermined length and then giving those sections to different CPUs. The length of the sections also appears to decrease with each iteration of processing as the jagged line become less pronounced as you approach the end. ...

What you're talking about is called "Chunk Size"

Notice when you enable loop parallelism, you get a radio button that defaults to "Automatically Partition Iterations". You can however change it to "Specify paritioning with chunk size ( C ) terminal".

If you specify the paritioning, you'll get to specify how many iterations are in each chunk. The compiler will then break them up into chunk. Each "thread" (sometimes that means processor, sometimes not) will execute it's chunk, then grab a new chunk from the pile waiting to be processed, and repeat. Obviously there is some overhead to this. That means that you may not want to set the chunk size to 1, which would result in ultra parallel execution, but you'd get hit way more by the overhead. You also don't want to set the chunk size to P (meaning the total number of iterations), that means that 1 thread will get assigned all the work, and the other will just sit there waiting!

By default LabVIEW uses large chunks at first, then smaller chunks at the end. This hopes to minimize overhead, at the beginning of the process, then to minimize idle processors at the end. Honestly, 90% of the time, it is probably best to leave it like this.

Here's an example to illustrate why chunk size matters. Lets say you have 2 processors (meaning P=2) and you have 10 iterations to perform. Lets say that the default chunking algorithm says first go each processor will get 3 iterations, then the remaning 4 iterations will be in two chunks.

First we'll assume that each iteration takes exactly 100ms to execute. That means that each processor will first get 300ms of work to do, then go back and pickup 200ms of more work to do, so the total time to execute: 500ms. Easy

Now, let's saythat each iteration takes exactly 100ms to execute, except for one randomly determined iteration. The randomly determined iteration will take 5000ms.

Making the assumption that random long iteration gets into one of the first set of chunks (the ones that contain 3 iterations each). So when the processors execute their chunks, processor 1 will have 300ms of work and processor 2 will have have 5200ms of work to do. While processor 2 is chugging through it's work, processor 1 will finish it's first chunk, and ask for another, it'll get 200ms more work to do. After that 200ms, processor 2 is still chugging away (we're only 500ms from when we started) so processor 1 is assigned the last chunk, it does it's 200ms of work, and then sits idle. Finally after the 5200ms, processor 2 finishes, and our loops stops execution.

so what if we specify chunk size as 1. Best case senario: processor 1 or 2 gets the long iteration on the first chunk it gets. So that processor works on it, while the other processor chugs away on all the other chunks. This means one processor will do 5000ms of work while the other does 900ms of work. They are executed at the same time and the loop takes a total of 5000ms. Now if we get unlucky and the long interation gets into one of the last two chunks, then processor 1 and two will do 800ms of work before one of them hits the long iteration, so our total time of executiong is 5800ms.

obviously there are a few other ways of this situation playing out, depending on what iteration gets the long instruction and what chunk that falls into. Just some food for thought.

Edited by QueueYueue
  • Like 1
Link to comment

thanks for the explanation.  I suspected something like that but now it is more clear to me.  Always good to understand what is going on behind the scenes.

I read somewhere that the compiler will try to parallelize the for loop even if you don't turn on loop parallelization.  If that is true, the main reason I can see to turn it on is to have the IDE detect if the loop is not parallelizable based on its design (like iteration-to-iteration dependencies).

 

One place this can bite most people is passing something through a loop using a shift register.  In the past, I am sure we have all made the mistake of passing something through a for loop without using a shift register (simply using tunnels) and having the data on the wire get lost because in one situation the for loop executed zero iterations.  Once you get bit by this, you learn to always use a shift register to prevent that data (perhaps a reference or something that does not get modified in the loop) from being lost.  But this then makes the loop un-parallelizable (is that even a word?) because the shift register implies that there can be an interation-to-iteration dependency.  The better design is to branch the wire before the for loop and send the new branch on to later functions, letting the other branch that enters the for loop die there.  This of course causes a data copy and we have been taught to avoid those when possible.  But, in this case I think it is the best solution if you wish to take full advantage of loop parallelization.

 

Did that all make sense?  Or am I off in the weeds on this?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.