ShaunR

July 12, 2013

No, the results are essentially the same.

Hmmm. OK. The Elapsed Time.vi is set to subroutine. Does setting it to "Normal" help? (I don't have access to a dual core ATM)

July 12, 2013

Hm... the Uninit is going to cause a race condition problem. If we just deallocate the pointers, everything crashes if there's a read still in progress. Either we add a mutex guard for delete -- which blows the whole point of this exercise -- or delete is something that can only execute when Write has written some "I'm done" sentinel value and the Reads have all acknowledge it.

And we have a second problem of guarding against two writes on the same refnum. They can't be proceeding in parallel but we have no guards against that.

Any proposals for how to prevent two writes from happening at the same time on the same refnum? (Or two reads on the same refnum?)

I don't understand what you are getting at here. Refnums?

The disruptor pattern can cope with multiple writers (ours currently can't). They use a two stage process where a writer "claims" a slot before writing to it. What's the problem with having multiple readers? (isn't that the point!)

We effectively have a reference counter (number of readers) and we have a method of skipping in the writer (the bool field of the pointer indexes which the readers could also use so they don't go ahead and read data-the manager only manipulates that field) so we would only deallocate the pointer when the last is unregistered. The deinit would effectively iterate through unregistering all the readers (it becomes an "unregister all readers" then kill pointer rather than just a pointer killer). I think we just need a NOP for the readers (they need to not read until the registration manager has given them an index slot and they know the current cursor position) Don't know. Just thinking on my feet at the moment but "seems" doable and I think it might give us the same sort of feature as the queues when the handle is destroyed (error out)..

Is there a maximum amount that the indexes can be negative and he can use even more negative numbers? Or some mathematical game like that?

Not really. I have visions of fixing the i64 cursor wrap-around problem mathematically by using the fact that it goes negative but still increases (gets less negative, minus + minus = plus, sort of thing). As I haven't fixed it yet, you could use a negative number as you will hit the same problem at the same point and everything starts at zero.

But you would need to acquire the mutex in every read and every write just in case a resize tried to happen.

The only time when you know only one operation is proceeding is when the buffer is empty (all current readers have read all enqueued items) and the write operation is called (assuming we figure out a way to prevent multiple parallel writers on the same refnum, which is solvable if we can find some way to expose the atomic compare-and-swap instruction, which might require changes for LV 2014 runtime engine). That's the only bottleneck operation in this system. In that moment, the write operation could resize the read buffer and then top-swap a new handle for the read indexes into a single atomic instruction into the block of pointers. So if you have the Write VI able to have a mode for Resize (and likely for Uninit, as I mentioned in my previous post) then you could resize.

I don't see any other way to handle the resize of the number of readers. Anyone else attempting to swap out the pointer blocks faces the race condition of someone trying to increment a block and then having their increment overwritten by a parallel operation.

Well. I think we need to look at the Disruptor code again to see how they handle it (they probably have the same problem). If we can't think of a way to be able to reorganise, we can at least dynamically increase as that doesn't overwrite existing indexes; just adds new ones and disables old ones (haven't seen a realloc but have seen functions to resize handles......somewhere). So we could create, say 5 at a time and increase in blocks. We are back to the MJE kind of approach which I first outlined with the flags then. (I did say IF we were smart )

I think this is where we need to revisit the disruptor pattern details. They talk about consumer barriers, claiming slots and getting the "highest" index rather than just seeing if it is greater (as we currently do). I think we have the circular buffer nailed. It's the management and interfaces we are beginning to brainstorm now and they must have solved these already.

July 12, 2013

With Version 3 my dual-core machine gives the following result when the Buffer size is smaller than the number of iterations:
CB Test BUFFER_FP.png

There is still a long wait at an interval determined by the buffer size, but it doesn't usually occur from the start of the run. Both the iteration time and its variability seem to reduce once the long waits start.

In the writer there is a disable structure. If you disable the currently enabled (and enable the other) it will use the previous index reading method. Does this affect the result you see?

July 12, 2013

So I have been trying to get the OpenG zip library working on a cRio (VxWorks) and I think I have finally got it to work but not before finding a nasty little bug. If any of the file path controls use an uppercase 'C' for the drive letter (like "C:ni-rtsystem") the zip functions will fail to find the files in question (The open/create returns an error code of 7). Now maybe I am missing something but this took me about an hour to diagnois so I thought I would share incase anyone else runs into this problem they will know. I have no idea if the bug is a problem on any other systems (windows or Pharlap) but I suspect it is not. Hopefully it is an easy fix in the code. Thanks for getting zip functionality to work on the Crio!

Stephen

VxWorks paths are case sensitive (windows aren't by default),

July 12, 2013

Hi Shaun,

This should put me in a good position to fix the TCP bit in the middle at a later stage.

Yes. That's the tricky bit though since you cannot serialise LabVIEW objects so it's not just a case of "(un)Flatten To String" to get them across a network..

This thread on "Lapdog over the network" will highlight most of the issues (and solutions)..

July 12, 2013

I have a half-implemented version that does something similar. It works by pre-allocating an array of reader cursors representing the positions of each reader. The array size represents the max number of registered readers. All the cursors start as negative indicating nothing is registered for that index. The writer init returns with no registered readers. Reader initialization/registration involves finding an available cursor (negative), setting it to a valid starting value (either zero or some offset depending on where in the lifecycle the writer is). This has forced init/release mechanisms to use a lock to ensure atomic operations on the cursor array to avoid race conditions while doing the read-write sequence but looping on the cursors during enqueue/dequeue operations can happen without a lock as in Shaun's examples. Releasing a reader involves setting it's cursor negative, allowing that cursor index to be reused if necessary.

You can't use a negative number with this because the indexes can be negative (strange, I know, but it is to do with wrap-around).

I'm thinking that we could have a boolean with the R and Cnt (a block is converted to a cluster for processing-see the CB GetIndexes3.vi) which tells the index iterator (in the write) whether that index should be considered in the "lowest" check. This would have very little impact on performance (read a few more bytes in the block at a time and 2 AND operators.).

Then you would need a "registration manager". That would be responsible for finding out how many are currently registered, updating the registered count and choosing a slot which has a F flag to give to the reader that wants to register (not disimilar to what you are currently doing). The additional difference would be that it is also capable of resizing the array if there are not enough index slots It could even shrink the array and reorganise if we wanted to be really smart so that the index iterator doesn't process any inactive slots and do away with the choosing of an inactive slot altogether. This only ever needs to happen when something registers/unregisters so even if this is locking, it will not have much of an impact during the general running (now we are starting to get to the Disruptor pattern )

July 11, 2013

Still need to propose a solution for handling abort. The DLL nodes have cleanup call backs that you can register, but that's not going to help us here. If you're going to do it in pure G, I think you have to launch some sort of async VI that monitors for the first VI finishing its execution and then cleans up the references if and only if that async VI was not already terminated by the uninit VI. Either that or the whole system has to move to a C DLL, which means recompiling the code for each target. Undesirable. Anyone have a better proposal for not leaking memory when the user hits the abort button?

Yeah. Don't want to go to a DLL. Been there, done that, torn holes in all my T-shirts. Is there no way we can utilise the callbacks (I thought that was what the InstanceDataPointer was for).

I do.

When you do Init, wire in an integer of the number of readers. Have the output be a single writer refnum and N reader refnums. If you make the modification to get rid of the First Call node and the feedback nodes and carry that state in the reader refnums, then you can't have an increase in the number of readers.

That wouldn't work well for my use cases since I don't know how many I need up front (could be 10, could be 30 and may change during the program lifetime as modules are loaded and unloaded->plugins).

The rest (about first call etc) is valid. It really needs to be in the next layer up but as that doesn't exist yet, its in the current read and writes. What we (or more specifically, I) don't want is for them to be reliant on shift registers in the users program to maintain the state info. How that is handled will depend whether the next layer is class based or classic labview based which is why I havn't broached it yet. At that layer, I don't envisage a user selectable "Init". More that the init is invoked on a read or write depending on what gets there first (self initialising). I ultimately want to get to a point where you just slap a write VI down and slap read VIs wherever you need them (even in different diagrams) without having to connect wires (see my queue wrapper vi for how this works, although not sure how it will in this case, ATM, since the queue wrapper uses names for multiple instances).

July 11, 2013

Well dynamic registration, unless you forbid a once registered reader to unregister, makes everything quite a bit complexer. Since then you can get holes in the index array that would then block the writer at some point or you have to do an additional intermediate refnum/index translater that translates the static refnum index that a reader gets when registering into a correct index into the potentially changing index array. I'm not sure this is worth the hassle as it may as well destroy any time benefits you have achieved with the other ingenious code.

Indeed.I know the issues and have some ideas (easiest of which is to have a boolean in the index array cluster alongside the R and Cnt). Do you have a suggestion/method/example ?

July 11, 2013

ShaunR: I'd like you to try a test to see how it impacts performance. I did some digging and interviews of compiler team. In the Read and the Write VIs, there are While Loops. Please add a Wait Ms primitive in each of those While Loops and wire it with a zero constant. Add it in a Flat Sequence Structure AFTER all the contents of the loop have executed, and do it unconditionally. See if that has any impact on throughput.

Done. MJE mentioned this previously and, until now I have resisted since I was more concerned with absolute speed performance. But it has nasty side effects on thread constrained systems (Dual cores etc) unless they can yield. You lose performance of using subroutine on constrained CPUs and spread obviously increases as we give labview the chance to interfere. But nowhere near as much as the queue or event primitives. So you see an increase in mean and STD but the median and quartile (which I think are better indicators as they are robust to spurii) remains pretty much unchanged.

I've also now modified the index reader in the write to be able to cope with an arbitrary number of multiple readers. We just need to think how we store and manage registration (a reader jumps on board when the writer is halfway through the buffer, for example).

Also added the event version of the test harnesses.

Version 3 is here. After this version I will be switching to a normal versioning number system (x.x.x style) as I think we are near the end of prototyping.

July 11, 2013

Take a look at Dispatcher in the CR

It may provide most of your TCPIP architecture as it is publisher/subscriber and can cope with many clients/servers.

July 11, 2013

Ok... the hang when restarting your code is coming from the First Call? and the Feedback Nodes. I'm adding allocation to your cluster to store those values. It needs a "First Call?" boolean and N "last read index" where N is the number of reader processes.

The first call (in the write) is because when everything starts up all indexes are zero. The readers only continue when the write index is greater so they sit there until the cursor index is 1. With the writer, it has to check that the the lowest reader index.isn't the same as its current write. At start-up,, when everything is zero, it is the same therefore if you don't have the First Call, it will hang on start. Once the writer and reader indexes get out of sync, everything is fine (until the I64 wraps around of course-needs to be addressed). If you have a situation where you reset all the indexes and the cursor back to zero AND it is not the first call; it will hang as neither the readers or the writer can proceed.

July 11, 2013

I was going to suggest using a User Event, which which I'd always thought of as a One-to-Many framework. Just tried it, and this "works" with a time of 500ms, and with proper parallel execution, but Profiling shows that the data is copied on one of the event structure readers. It looks as though multiple registrations sets up separate event "queues", and Generate User Event then sends the data independently to each one. I had never realised it behaved like this, though thinking about it now, I guess it's not surprising.

Indeed. Events are a more natural topological fit (as I think I mentioned many posts ago). Initially, I was afraid that the OS would interfere more with events than queues (not knowing exactly how they work under the hood, but I did know they each had their own queue). For completeness, I'll add an event test harness so we can explore all the different options.

July 10, 2013

The processes are parallel -- the computation done on each value is done in parallel with the computation done on the other value. The only reason you can make the argument that they aren't parallel is because you have contrived that one operates on the odd values and the other operates on the even values. If you have a situation like that, just stick the odd values in one queue and the even values in another queue at the point of Enqueue. Now, I realize you can make the system "both loops execute on 2/3rs of the items, with some overlap" or any other fractional division... as you move that closer to 100%, the balance moves toward the copy being the overhead. My argument has been -- and remains -- that for any real system that is really acting on every value, the single queue wins.

If you have two processes that have dissimilar execution, then just arrange the first one first in a cascade, like this (which does run at 500 ms): AQ1_CascadeQueue.vi

I disagree. The processes are most certainly not parallel,even though your computations are (as is seen from the time).

In your second example you are attempting to "pipeline" and now using 2 queues (I say attempting becauseit doesn't quite work in this instance).

You are

a) only processing 100 values instead of all 200.(true case in bottom loop never gets executed)

b) lucky there is nothing in the other cases because pipelining is "hybrid" serial (they have something to say about that in the video)

c) lucky that the shortest is last (try swapping the 5 with the -1 so -1 is top with a) fixed->back to 600ms)

d) No different to my test harness (effectively) if you place the second enqueue straight after the bottom loops dequeue instead of after the case statement (which fixes c).

July 10, 2013

What's special on July 4?

Will Smith beat up some aliens, aparently

July 10, 2013

I think two issues are being conflated here. Everything AQ posted with respect to objects is of course correct. However regardless of the nature of an object's structure the data needs to be brought into some scope for it to be operated on. It's that scope that's causing the problem.

Using a global variable will demand copying the whole object structure before you can operate to get your local copy of the single buffer element.

Using a DVR or a FGV demands an implicit lock, either via the IPE required to operate on the DVR or the nature of a non-reentrant VI for the FGV. So while a class does not have any built in reference or exclusion mechanics, pulling that data into some useful scope such that it can be operated on does.

The same issue is what has prevented me from posting an example of how to pull a single element from a buffer without demanding a copy of the entire buffer or running through some sort of mutual exclusion lock. Short of using the memory manager functions as Shaun has already demonstrated I don't see how to do it. I know there are flaws with the memory manager method, I just don't see an alternative without inducing copies or locks.

Well. You have managed to concisely put into a paragraph or two what I have been unsuccessfully trying to get across in about 3 pages of posts :yes: . (+1).

July 10, 2013

That's very nice, but don't hold your breath for it! I'm currently overloaded with more important things.

July 10, 2013

But without some header file that documents at least the parameter list this is a painful process to figure out

I have faith in you ! :lol:

July 10, 2013

I'm still not sure this buys us anything in terms of your MoveBlock() (or SwapBlock()) attempt since the variant obviously needs to be constructed somewhere.

Indeed. However a variant is constructed (constant on the diagram), but from what AQ was saying about descriptors, a simple copy is not adequate since the "void" variant has the wrong ones. I'm now wondering about the LVVariantGetContents and LVVariantSetContents which may be the correct way to modify a variants data and cause the descriptor to be updated appropriately.

July 10, 2013

Yes. Doh!

That's why you do a top-swap instead of a move block. There's one allocated inside the variant in the first place and you swap the [pointers/top-level data structure which may our may not contain pointers depending upon the data type under discussion] so that the buffer still contains one.

Hmmmm. Not sure what this "top swap" actually is. Is that just swapping handles/pointers so basically the pointer in the buffer then points to the empty variant for a read? That would be fine for a write, but for a read the variant needs to stay in the buffer.

Can you demonstrate with my original example?

July 10, 2013

Here you go:
AQ1_AsSingleQueue.vi

Not quite.. The execution time is 500+100 (they aren't parallel processes). Unlike the buffer which is 500 (the greater of the two).

Try again please

July 10, 2013

"a disruptor implementation that copies an element out of a buffer N times for each of N processes running in parallel"
can ever beat

"a single queue that dequeues an element and hands it to N processes running in parallel without making a copy"

for any data type beyond basic numerics, and for basic numerics I doubt that the gains are particularly measurable. The data copy is just that expensive.

OK. Words aren't working.

What is your "single queue that dequeues an element and hands it to N processes running in parallel", implementation of this?: AQ1.vi

July 10, 2013

LvVariant is a C++ class. There is a reference count in the variant object, but I wouldn't expect it to come into play here... can't say for certain. When a To Variant primitive executes, LV does "newVariant = new Variant()". Then newVariant.mTD is assigned the data type object -- this is an object with its own fairly complex structure itself. Then the data on the wire is top-swapped into the variant's "newVariant.mData" field (To Variant's input is a stomper so LV has already guaranteed that the data is independent of any other parallel branch and is thus free to be stomped on). There's also an mAttrs field that the attribute table, which is left as null unless/until an attribute is added to the variant.

LvVariant does not have any virtual methods so there is no vTable on the C++ object. But the actual data size of the LvVariant class is (1 + sz + sz + 4 + 4 + sz) bytes, where sz is 8 or 16 depending on how many bytes a pointer has on your system. The data is in the first of those pointers. The reference count is the first of the 4 bytes. There's also a transaction count -- the second 4 byte number -- which is unused within Labview.exe and is exported apparently for use with flex data -- you can basically ignore that.

You'll never be able to change out the type descriptor -- that's the final sz in the size. Those are refcounted C++ objects that you have to go through the constructors to instantiate. So you could in theory change out the data within a variant, provided you started with a variant that had the type descriptor you needed already in it. You would need to either top swap (swap, not move) the data into the variant or copy the data in, depending upon whether the source data is stompable. A variant claims right of destruction over the data pointer it contains.

Don't know if any of that helps you.

+1. Plenty for me to play with here. Ta very muchly (did you mean 4 or 8 bytes for the pointer size rather than 8 or 16?)

If the variant claims right of destruction, what happens to the copies in the buffer? Is this the reason why it crashes on deallocation?

July 10, 2013

The what? There are no locks around the private data of a class. It unbundles just the same as a cluster. I've got no idea what you're referring to here.

*blink* These are not equatable in any mental construct I have in my head. I cannot think of any architecture where a storage mechanism can be substituted for a data type or vice versa.

I am really and truly lost reading these posts.

I look at it very simplistically. The private data cluster is just a variable that is locked so you can only access it via public methods. FGV is also a variable that is locked so you can only access it with the typdef methods. In my world. There is no difference between a class and an FGV as a storage mechanism.

ShaunR is looking for ways to achieve this in LabVIEW by cheating values onto the wires that LV thinks are separate data instances but are not actually.

I've got no idea what labview thinks (that's why your input is invaluable). I just want it to do what I need. If labview requires me to make a copy of an element so it is a data copy rather than shared, then I'm ok with that (LVCopyVariant, LVVariantGetContents, LVVarintSetContents et. al. docs would be useful ). But I'm not ok with a copy of an entire array for one element that causes buffer-size dependant performance. The only reason I have gone down this path is because a global copies an entire array so I can get to one element. If you have another way without using the LVMM functions then I'm all ears. I just don't see any though (and I've tried quite a few).

Throw me a bone here eh?

July 10, 2013

Ahh I think you got that wrong. Ever used the inplace structure node? There you have an inplace replace array element without having to read and write the entire array.

Indeed. But it seems to lock the entire array while it does it. When I experimented, it one of the slowest methods.

As to SwapBlock() I can't remember the details but believe it is more or less an inplace swapping of the memory buffer contents but without any locking of any sort. As such it does more than MoveBlock() which only copies from one buffer to the other, but not fundamentally more. That API comes from the LabVIEW stoneage, when concurrent access was not an issue since all the multithreading in LabVIEW was handled through its own cooperative multithreading execution system so there was in fact no chance that two competing LabVIEW modules could attempt to work on the same memory location without the LabVIEW programmer knowing it if he cared. With preemptive multitasking you can never have this guarantee, as your SwapBlock() call could be preempted anywhere in its executation.

One thing about SwapBlock could be interesting, as it does operate on buffers as 4 byte integers if both buffers are 4 byte aligned and the length to operate on is a multiple of 4 bytes.

Ahhhh. Those were the days [misty, wavey lines]. When GPFs where what happened to C programmers and memory management was remembering what meetings to avoid.

Where is this documented about SwapBlock (or anything for that matter)? I couldn't even find what arguments to use. I also found a load of variant functions (e.g. LVVariantCopy) but have no idea how to call them.

July 9, 2013

Now you have switched to byte arrays; you don't need a terminating string. But we still need to get rid of the flatten and unflatten which are too slow.

Sign In

ShaunR

Posts

Joined

Days Won

Content Type

Profiles

Forums

Downloads

Gallery

Posts posted by ShaunR

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Queues Vs Ring Buffers

ZLIB CRio (VxWorks) bug?

TCP Host to Multiple cRIO Architecture

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Queues Vs Ring Buffers

TCP Host to Multiple cRIO Architecture

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Need help from a LAVA moderator

Queues Vs Ring Buffers

Pointers to Variants

Pointers to Variants

Pointers to Variants

Pointers to Variants

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Pointers to Variants

Queues Vs Ring Buffers

Queues Vs Ring Buffers

Pointers to Variants

Browse

Activity

Important Information