Jump to content

What can kill a queue?


Recommended Posts

I have run into a very strange problem. I am getting sporatic occurances of an error with one of my queues. Here is the error:

Error 1122 occurred at Dequeue Element in Process GUI Events.vi:34->Engine 422.vi

Possible reason(s):

LabVIEW: Refnum became invalid while node waited for it.

The wierd thing is, as far as I know, this can ONLY happen if the queue is destroyed in some parallel process while this VI is waiting for an element to be enqueued. But, I have searched all the VIs and the only one where the queue is destroyed is in the cleanup VI that comes after this VI and is connected by the error wire. So, there is no way that cleanup VI could execute before the VI that is waiting.

I have a sneaking suspicion that there are some latent bugs in the queue feature. I have a large number of reentrant VIs running and I create a lot of unnamed queues that I pass inside a cluster to sub VIs. So, there are many many instances of this queue (all unique, supposedly) that exist within each tree of reentrant VIs. I thought labVIEW used a GUID to name unnamed queues so they could never step on each other, but maybe because I have so many, the 'name' is getting reused?

Any other ideas? I am at a total loss.

thanks,

-John

Link to comment

QUOTE (jlokanis @ Aug 29 2008, 04:04 PM)

I have run into a very strange problem. I am getting sporatic occurances of an error with one of my queues. Here is the error:

Error 1122 occurred at Dequeue Element in Process GUI Events.vi:34->Engine 422.vi

Possible reason(s):

LabVIEW: Refnum became invalid while node waited for it.

The wierd thing is, as far as I know, this can ONLY happen if the queue is destroyed in some parallel process while this VI is waiting for an element to be enqueued. But, I have searched all the VIs and the only one where the queue is destroyed is in the cleanup VI that comes after this VI and is connected by the error wire. So, there is no way that cleanup VI could execute before the VI that is waiting.

I have a sneaking suspicion that there are some latent bugs in the queue feature. I have a large number of reentrant VIs running and I create a lot of unnamed queues that I pass inside a cluster to sub VIs. So, there are many many instances of this queue (all unique, supposedly) that exist within each tree of reentrant VIs. I thought labVIEW used a GUID to name unnamed queues so they could never step on each other, but maybe because I have so many, the 'name' is getting reused?

Any other ideas? I am at a total loss.

thanks,

-John

Hi John,

I'm not sure if the following may be what is hitting you but you did ask for "other ideas"

When a VI is no longer running, all of the resource tht were allocated by that VI are destroyed. That includes queues. So if the Queue was created in a VI that goes idle the queues it created are destroyed. The work-around it to make sure the VI's that creaed the queue don't go idle until after the queues is destroyed.

Ben

Link to comment

Thanks for the reply. That is definitely a way to cause a queue to be deallocated. In my case, however, I don't think that is possible. The structure of my code has a main vi that calls a sub VI to create the queue and then passes the queue ref to another sub VI that listens to the queue. When the listener quits, it passes its error cluster to the sub VI that destroys the queue. Since all of these VIs are part of the main VI, i don't see how it is possible that the queue reference would be automatically removed from memory. The VI that get the error is running as a sub VI of the same VI that called the VI that created the queue.

The interesting thing is everything seems to work well for a long time and then it all goes to heck. As you can see from the error, the 'main.vi' has been spawned from a template 422 times and the reentrant subVI that got the error is one of 34 in memory right now, all listening to their own 'version' of this queue.

I think the LV engine get 'confused' and screws this up. I can see many examples of this happening in various parts of my code where queues either become invalid while waiting or are invalid when passed to a subVI, even though a release was never called and their creator VI is still in memory and supposedly 'reserved for run' still...

Perhaps there is some issue with all these VIs being reentrant? I only use the shared clones mode, but none of them have a uninitialized shift register...

Link to comment

I will say that Ben is probably right.

This has happened to me countless time, and every single time this was a lifetime issue.

Scenario Example:

Create Queue in a VI

Put Queue refnum in LV2 Gbl

Launch (asynchronously) other code that need the Queue (other code call LV2 Gbl)

Create Queue VI stops --> Queue refnum become invalid because LV garbage collect it.

--> Get error in you asynchronous code

PJM

Link to comment

QUOTE (jlokanis @ Aug 29 2008, 01:56 PM)

Thanks for the reply. That is definitely a way to cause a queue to be deallocated. In my case, however, I don't think that is possible. The structure of my code has a main vi that calls a sub VI to create the queue and then passes the queue ref to another sub VI that listens to the queue. When the listener quits, it passes its error cluster to the sub VI that destroys the queue. Since all of these VIs are part of the main VI, i don't see how it is possible that the queue reference would be automatically removed from memory. The VI that get the error is running as a sub VI of the same VI that called the VI that created the queue.

I think the LV engine get 'confused' and screws this up. I can see many examples of this happening in various parts of my code where queues either become invalid while waiting or are invalid when passed to a subVI, even though a release was never called and their creator VI is still in memory and supposedly 'reserved for run' still...

Perhaps there is some issue with all these VIs being reentrant? I only use the shared clones mode, but none of them have a uninitialized shift register...

John:

We also use plenty of queues and a smattering of reentrant VIs. We get error 1122 all the time, because we kill the queues on purpose to stop our processes, but it never happens unexpectedly.

Are you using the "destroy" input for the Close Queue function? You should not need to destroy the queues, just close all the references you open. (there are good times to set destroy=True, but don't just set it because you feel like it).

If not, then I would try to set a breakpoint immediately after the enqueue or dequeue function which is throwing that error and then poke around to see which of your parallel Vis is still running.

Good luck.

Link to comment

We don't use a true GUID. We use a fixed count for the first several bits and a random value for the last few. In order to get any recycling of the unnamed queue IDs you would not only have to generate roughly 30 million queues, you would also need to get particularly (un)lucky on the other bits. That seems unlikely.

Link to comment

QUOTE (jdunham @ Aug 29 2008, 04:07 PM)

I am only creating the queue in one place and destroying it in another. I do not 'obtain' an existing quene anywhere because I am using unnamed queues. I just pass the queue reference to the VIs that need it.

I do use force destroy, however. Maybe I should stop doing that, even though in this case it should not matter.

QUOTE (Aristos Queue @ Aug 29 2008, 04:18 PM)

We don't use a true GUID. We use a fixed count for the first several bits and a random value for the last few. In order to get any recycling of the unnamed queue IDs you would not only have to generate roughly 30 million queues, you would also need to get particularly (un)lucky on the other bits. That seems unlikely.

What about memory corruption? I notice that when this problem occurs, the whole app also starts to slow down AND memory usage starts to increase.

BTW: This problem only happens in the EXE deployed to a target machine and only after running for several days. So, I really have no way to debug it with breakpoints or anything. At least I log the errors to the event handler...

Link to comment
QUOTE (jlokanis @ Aug 29 2008, 06:25 PM)
What about memory corruption? I notice that when this problem occurs, the whole app also starts to slow down AND memory usage starts to increase.
That was the other thing I was going to say ... not only would you have to allocate millions of queues, you'd have to have them all continuously in play in order for the refnums to ever hit up against each other. Now if somewhere you're calling Obtain Queue and you're not calling Release Queue, you might be running your machine out of memory, and perhaps something strange is going on there (though I still can't imagine what would just cause the refnum to get deallocated).
Link to comment

QUOTE (Aristos Queue @ Aug 29 2008, 08:25 PM)

That was the other thing I was going to say ... not only would you have to allocate millions of queues, you'd have to have them all continuously in play in order for the refnums to ever hit up against each other. Now if somewhere you're calling Obtain Queue and you're not calling Release Queue, you might be running your machine out of memory, and perhaps something strange is going on there (though I still can't imagine what would just cause the refnum to get deallocated).

Well, I do not allocate millions of queues, but I do allocate 1000's over the course of running the app.

My concern is this: I spawn many instances of a VIT to run a set of tests on a product. Each of these instances is composed completely of reentrant VIs (so they will not block each other). These reentrant VIs are all of the 'shared clone' type. They create the unnamed queues and pass them around to move data between parallel portions of the program. The only code in the entire app that can kill these queues is in the cleanup VI that is forced by dataflow to be the last thing executed by this spawned (from the VIT) vi.

Now, the launcher that spawns these VITs sets the spawned VI to Autoclose reference. So, the launcher is not responsible for dealing with this reference. When the spawned VI finishes execution, it will leave memory, as will all of its queues, notifiers, etc.

So what is confusing to me is if each spawned VIT creates its own queues (in sub VIs) and then listens to the queues (in other sub VIs), and the only code that can destroy those queue refs is also in a sub-VI of that VIT that is forced to execute last by dataflow, how could I ever get the error "Refnum became invalid while node waited for it.". Even if the VIT was stopped by an external VI, this error would never happen and the code that logs the error to the event log would also not execute. So, something is stepping on my queue refs. If it was memory corruption, then what could be causing it? When I see this, my app is using about 100MB. The machine has 4GB of RAM and no other apps are running.

I suspect that the 'shared clone' reentrant mode and queue refs have some latent bug.

Link to comment

QUOTE (jlokanis @ Aug 31 2008, 05:37 PM)

I suspect that the 'shared clone' reentrant mode and queue refs have some latent bug.
Without someone actually inspecting the code, there's no further recommendations that I can make, but those are two independent subsystems, so I would be very doubtful of a bug caused by their interaction. I don't rule out the possibility of a bug in something, but not that.

Can you post your code on ni.com for an AE to investigate? That's going to be the best way to get NI to push further on this.

PS: Even if your app is thousands of VIs, if you're able to share it with the AEs, they'll try to replicate the bug. There's an assumption that customers have to get their architectures down small before a bug will get investigated. But if you're convinced there's a bug in LV and only a huge application replicates it, then, please, submit the whole app if you can.

Link to comment

QUOTE (jlokanis @ Aug 31 2008, 06:37 PM)

...Now, the launcher that spawns these VITs sets the spawned VI to Autoclose reference. So, the launcher is not responsible for dealing with this reference. When the spawned VI finishes execution, it will leave memory, as will all of its queues, notifiers, etc.

...

Yes I have shared app's with hundreds of VIs to support to demo bugs that only happen in large aps.

Is there a possibility the launcher is going idle?

Ben

Link to comment

QUOTE (Aristos Queue @ Aug 31 2008, 10:52 PM)

Can you post your code on ni.com for an AE to investigate? That's going to be the best way to get NI to push further on this.

I could zip up the code and send it to them, but they would not be able to run it. The code relies on a continuous flow of data between an in-house SQL server and the application. I suppose they could inspect it, but that would be about it.

Also, since I only see this error from the EXE version of the code and only after 100's of units are tested (many days of testing), I suspect it would be nearly impossible. My only hope now is to have an NI engineer visit our site and see the code in action or to find the bug myself. It would be great if I could find something I screwed up that is causing this, but I can't think of a single thing that could. I wish I knew of another condition that could cause the refs to become invalid...

Regarding the launcher, no, that part of the app continues to run even after the error. It displays the status of each of the spawned VITs and allows you to view their FP via another VI with a sub panel. I am able to interact with it even after the error is reported. Just not the VIT that reported the error.

Link to comment

QUOTE (jlokanis @ Sep 2 2008, 08:37 AM)

Also, since I only see this error from the EXE version of the code and only after 100's of units are tested (many days of testing), I suspect it would be nearly impossible.

Regarding the launcher, no, that part of the app continues to run even after the error. It displays the status of each of the spawned VITs and allows you to view their FP via another VI with a sub panel. I am able to interact with it even after the error is reported. Just not the VIT that reported the error.

Have you tried debugging the built executable? Enable Debugging when building the exe and try it out. It works great.

N.

Link to comment
  • 3 weeks later...

I have narrowed this down to what I think is a nasty latent bug in LabVIEW (I am using 8.5).

I have attached an image to see what is happening. What it boils down to is LabVIEW is corrupting its own memory when running large parallel apps with a lot of shared clone Vis.

Here is an image of my code:

post-2411-1221770180.jpg?width=400

Looking at the image, at step 1, an unnamed queue is created. At step 2 we wait for messages to appear. The wire I marked 3 shows that the cleanup code after the upper loop is data dependant on output from the lower loop. This means that the release queue marked 4 can never execute before the lower loop completes.

I get the following errors from this VI:

Error 1122 occurred at Dequeue Element in BIG-IP_TESTS_TM Get Scheduler Response.vi:21->BIG-IP_TESTS_TEST Call Scheduler.vi:9

Possible reason(s):

LabVIEW: Refnum became invalid while node waited for it.

And then:

Error 1 occurred at Release Queue in BIG-IP_TESTS_TM Get Scheduler Response.vi:21->BIG-IP_TESTS_TEST Call Scheduler.vi:9

Possible reason(s):

LabVIEW: An input parameter is invalid. For example if the input is a path, the path might contain a character not allowed by the OS such as ? or @.

This code is run 1000s of times per day and we only see this happening 2 or 3 times per day. Since it is impossible for these error to occur due to a coding issue, there must be a bug in LabVIEW causing it to destroy a queue reference outside this process. My guesses are: shared clones can sometime step on each other *or* memory is being corrupted.

Note: This VI is set to 'Shared Clone' and there can be up to 40 instances of this running in different thread at once.

Link to comment

QUOTE (PJM_labview @ Sep 18 2008, 02:23 PM)

Just to make sure, you don't have a release queue in the true frame of the upper loop right?

PJM

There is only one release queue in the entire VI and that is marked with a 4 in the picture.

BTW: looking at the error log, I see that many refnums were corrupted throughout my program at the exact same time. I think there is a bug in the code that manages refnums. I have contected NI for support on this so we will see what they say.

Maybe it is time to move up to 8.6? Maybe I should not use shared clones?

Link to comment

QUOTE (jlokanis @ Sep 18 2008, 11:46 PM)

... I think there is a bug in the code that manages refnums...

Have you tried to use named Queues? Just naming the queues after the caller VI plus some additional numbering should be enough.

If memory is corrupted in shared clones, maybe named queues are better protected.

/J

Link to comment

QUOTE (JFM @ Sep 18 2008, 10:28 PM)

Have you tried to use named Queues? Just naming the queues after the caller VI plus some additional numbering should be enough.

If memory is corrupted in shared clones, maybe named queues are better protected.

/J

This is a good idea; you could even use the clone instance name as the queue name since it is already unique.

PJM

Link to comment

QUOTE (PJM_labview @ Sep 19 2008, 09:31 AM)

This is a good idea; you could even use the clone instance name as the queue name since it is already unique.

PJM

I took this one step further and changed all my unnamed queues (except this one) to be named using a GUID (from a .NET call).

So far, this has not helped (other refnums are going 'invaild' at the same time as this one). I think there is a bug in the code that manages references where clones leaving memory can somehow kill the references in other clones of the same VI.

I have NI support digging into this now. I hope they can offer me a work-around.

Link to comment

Your vi makes a number of assumptions. It assumes that the second while loop can process the data in the time it takes the first while loop to acquire it. If it does not the number of elements in the queue will grow very quickly. Are you sure your second while loop can process the data as fast as it is acquired? Are you sure your CPU is fast enough to acquire and process the data simultaneously. Also what thread are you running this vi in?

Naming the queue is a good idea. Also try declaring the size of the queue and check the queue to see if it is full before you enqueue another element. My guess is that Labview is not dequeing the elements fast enough. I have had similar problems and was able to track the problem by monitoring the queue status while the vi is running. Use the get queue status vi to monitor the queue.

If you don't declare the size of the queue Labview has to constantly allocate more memory for the newly acquired data.

bluesky

QUOTE (jlokanis @ Sep 19 2008, 10:18 AM)

I took this one step further and changed all my unnamed queues (except this one) to be named using a GUID (from a .NET call).

So far, this has not helped (other refnums are going 'invaild' at the same time as this one). I think there is a bug in the code that manages references where clones leaving memory can somehow kill the references in other clones of the same VI.

I have NI support digging into this now. I hope they can offer me a work-around.

Link to comment

QUOTE (bluesky @ Sep 23 2008, 09:19 AM)

Your vi makes a number of assumptions. It assumes that the second while loop can process the data in the time it takes the first while loop to acquire it. If it does not the number of elements in the queue will grow very quickly. Are you sure your second while loop can process the data as fast as it is acquired? Are you sure your CPU is fast enough to acquire and process the data simultaneously. Also what thread are you running this vi in?

Naming the queue is a good idea. Also try declaring the size of the queue and check the queue to see if it is full before you enqueue another element. My guess is that Labview is not dequeing the elements fast enough. I have had similar problems and was able to track the problem by monitoring the queue status while the vi is running. Use the get queue status vi to monitor the queue.

If you don't declare the size of the queue Labview has to constantly allocate more memory for the newly acquired data.

Thanks for the observations. In the case of this VI, the upper loop, or 'producer' is trying to read all the data from a TCP connection (in this case a remote serial port broadcasing at 19200bps). The lower loop (consumer) is processing the data, looking for a certain string. I want to ensure that no data is missed so the top loop has to run as fast as possible to avoid a buffer overrun. It reads data in large chucks but uses immediate mode with a timeout of 10ms. So, every 10 ms it reads all the data in the buffer and passes it to the lower loop. I have been using this code for a very long time (5+ years) without having a memory overrun issue.

The problem I am having now is something is killing the queue reference outside of this VI. Since the queue is unnamed, it is supported to be technically impossible for the refernce to go invalid while the VI is running and the release queue has not been called.

Also, this is not the only reference that does invalid. I have error logging throughout my code that shows multiple references going invalid at the same time in unrelated VIs (all of these use 'private' unnamed queues). What appears to be happening is when some reentrant or VIT spawned code leaves memeory, it accidentally kills other queues in other instances of the same VIs. Or, something is corrupting LabVIEW's memory.

NI App support is trying to reproduce the issue. I will let you all know what they tell me in the end.

-John

Link to comment

John,

Are you running this as an application or in the development environment? Also has the hardware you are running it on changed? Did you upgrade OS's. Have you tried running a memory check on the machine your are using? Have you tried checking the CPU usage on the machine? ie are you running out of CPU cycles? Also try using Wireshark on the TCP/IP port to check for strange traffic.

It just seems strange that Labview throws this error only occasionally. I use a number of queues and parallel processing some of them with high speed data with very few problems especially on Labview 8.5.1

Have you tried mass compiling your entire project. I have found on several occasions that Labview seems to have trouble keeping track of changes to queues. The problems disapear when I mass compile.

Mark

QUOTE (jlokanis @ Sep 23 2008, 10:57 AM)

Thanks for the observations. In the case of this VI, the upper loop, or 'producer' is trying to read all the data from a TCP connection (in this case a remote serial port broadcasing at 19200bps). The lower loop (consumer) is processing the data, looking for a certain string. I want to ensure that no data is missed so the top loop has to run as fast as possible to avoid a buffer overrun. It reads data in large chucks but uses immediate mode with a timeout of 10ms. So, every 10 ms it reads all the data in the buffer and passes it to the lower loop. I have been using this code for a very long time (5+ years) without having a memory overrun issue.

The problem I am having now is something is killing the queue reference outside of this VI. Since the queue is unnamed, it is supported to be technically impossible for the refernce to go invalid while the VI is running and the release queue has not been called.

Also, this is not the only reference that does invalid. I have error logging throughout my code that shows multiple references going invalid at the same time in unrelated VIs (all of these use 'private' unnamed queues). What appears to be happening is when some reentrant or VIT spawned code leaves memeory, it accidentally kills other queues in other instances of the same VIs. Or, something is corrupting LabVIEW's memory.

NI App support is trying to reproduce the issue. I will let you all know what they tell me in the end.

-John

Link to comment
  • 3 weeks later...

Hi!

We too have many troubles with queues and other LV-functions in a big application. They freeze during our long time tests. Sometimes it takes hours and sometimes it takes only 50 minutes.

We have at least found out that it has something todo with the timed loops (or timed structures). The LV-functions that causes the application to freeze are all inside a timed loop.

When we replaced it with a while loop we hadn't any problems any more. So my question is, is the above shown VI somewhere in a timed structure???

Martin

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.