Custom subarrays

HugoChrist · September 23, 2021

Hi,

I need to do windowing on big arrays (4-D with millions of elements) which is very expensive.

Just to illustrate what I mean by windowing.

From what I've read here and on the internet, LabVIEW stores arrays as a structure with the dimension sizes and a pointer to the first element.

In the example 2-D Array Handle in the "External Code (DLL) Execution" vi, we can find a definition:
typedef struct {
   int32 dimSizes[2];
   double elt[1];
   } TD1;

When we do simple operations on 1D or 2D arrays such as reverse or transpose, LabVIEW doesn't copy any data, instead it creates a structure with the information needed to read or reconstruct the array.
Rolf Kalbermatter said that this information is stored in the wire Type description.

We can read this information with the vi ArrayMemInfo.
It gives the pointer to the first element, the size and stride of each dimension and the element size. The stride is the number of bytes to skip to get the next element in a given dimension.

Numpy has a similar approach and this is what it gives for sizes and strides.

We can modify it with the function "numpy.lib.stride_tricks.as_strided", and do really fast windowing thanks to it.

The information given by ArrayMemInfo seems incorrect but assuming it's just a reading error, modifying this information would allow me to do efficient windowing and much more, like transposing or permuting N-D arrays.

I have no idea if this is possible and even if I could change this information, I don't know if LabVIEW would have no problem interpreting it.
What do you think ?

Gribo · September 23, 2021

As far as I know, there is no guarantee that the array is allocated in a single block.

Rolf Kalbermatter · September 23, 2021

6 hours ago, Gribo said:

As far as I know, there is no guarantee that the array is allocated in a single block.

Actually arrays (of scalars) normally are allocated as one block.

And while LabVIEW internally does indeed use subarrays, there is also a function that will convert subarrays to normal arrays whenever a function doesn't like subarrays. Basically functions need to tell LabVIEW if they can deal with subarrays and unless they explicitly say that they can for an array parameter, LabVIEW will simply convert it to a full array for them before passing it to the function. And the Call Library Node is a function that explicitly does not want subarrays parameters. Theoretically it may be possible but the subarray data structure is more complex than the one that you display in your post. The interface to subarrays is not documented for external tools in LabVIEW, never passed to any external function, interface or data client.

It is not trivial to work with, and if LabVIEW would allow that at the Call Library Node interface, EVERY code would need to be prepared that there could be a subarray entry, or there would have to be some involved need for letting a DLL tell LabVIEW that it can accept subarrays for parameter x, z and s, but not for a, b, and c. Totally unmanageable!!! 🤮

So no a Call Library Node will always receive a full array. If necessary LabVIEW will create one!

Edited September 23, 2021 by Rolf Kalbermatter

HugoChrist · September 24, 2021

Thanks Rolf,

I would be okay with LabVIEW converting the array because I think it would still be faster than doing the two for loops I do. Do you have any information or idea on how I could change the sizes and strides ?

Rolf Kalbermatter · September 25, 2021

23 hours ago, HugoChrist said:

Thanks Rolf,

I would be okay with LabVIEW converting the array because I think it would still be faster than doing the two for loops I do. Do you have any information or idea on how I could change the sizes and strides ?

It's really unclear to me what you try to do. There is simply no way to create subarrays in external code until now. And there is no LabVIEW node that allows you to do that either. LabVIEW nodes decide themselves if they can accept subarrays or not and if they want to create subarrays or not but there is simply no control about that.

Also subarrays support a bit more options than what ArrayMemInfo returns. Aside from stride it also contains various flags such as if the array is considered to be reversed (its internal pointer points to the end of the array data), transposed (rows and colums are swapped meaning that the sizes and strides are swapped), etc.

Theoretically, Array Subset should be able to allocate subarrays, and quite likely does so, but once you display them in a front panel control, that front panel control will always make a real copy for its internal buffer, since it can't rely on subarrays. Subarrays are like pointers or reference and you do not want your front panel data element to automatically change its values at anytime except when dataflow dictates that you pass new data to the terminal.

And the other problem is once you start to autoindex subarrays, things get extremely hairy very quickly. You would need subarrays containing subarrays containing subarrays to be able to represent your data structure and that is aside from very difficult to make generic also quickly consuming even more memory than your 8 * 8 * 3 * 3 element array would require. Even if you extend your data to huge outer dimensions a subarray takes pretty much as much data to store than your 3 * 3 window, so you win very little. Basically LabVIEW nodes can generate subarrays, auto indexing tunnels on loops could only with a LOT of effort on figuring out the right transformations, with very little benefit in most situations.

Mads · September 25, 2021

What is the cost you are worried about? Is it speed, memory, or both?

If it is speed - is the example representative? In that case I would calculate the window function within the inner loop and just output the result, not generate an array of all the subwindows and *then* apply the actual window calculation.

The loop(s) can also be configured for parallelism to speed things up significantly (assuming you have more than one core at least)...

More speed, less memory. If none of that reduces the cost enough you will probably have to do the processing in external (non-G) code.

Edited September 25, 2021 by Mads

HugoChrist · September 27, 2021

The cost I am worried about is speed.
I realize I did not give you enough information.

I am implementing convolution layers of neural networks.
These layers multiply each window with a filter and sums the result.
Here is a naive implementation of a convolution.

This is really slow. One way to make it faster is to move the multiplication and summation out of the loop and do it later.
Instead, you flatten each window into a 1D array and get a big matrix with "batch_size*new_height*new_width" lines and "channels" columns.
Then you flatten the filters into a matrix of "channels" lines and "filters" columns and you do a matrix multiplication.
BLAS (Basic Linear Algebra Subprograms) is so well optimized that having so much redundant information is worth it.

The function that flattens the windows is called im2col (or im2row in my case) and makes convolutions around 4 times faster.

Now there is still room for improvement. You can compute a new shape and new strides to do that. By changing the shape and strides, you only change how the data is viewed as the array is in one contiguous block of memory.

The adress of the element at index i,j is pointer + i*strides[0] + j*strides[1].

So this is how it is done in Numpy, PyTorch and other librairies. Arrays (or Tensors) also have a shape and strides like in LabVIEW and there is a function to change them to create a "view" of the same array without any data being copied.
This is why my original question was about creating custom subarrays.

You can find more details here
To illustrate the gain in time, here is a plot from the linked article.

Obviously there is still a copy of data when you call BLAS but I thought that given the new shape and strides, LabVIEW would be faster than my im2row function.
From Rolf's answers, I get that there is no way to do that, so I am writing external code to do it. I did configure one loop for parallelism but it is not significantly faster. I also tried the convolution vi in LabVIEW but it only works with 2D array and filters and my code is actually faster with BLAS.

Thank you for your help.

GregSands · September 27, 2021

Two suggestions if you haven't tried them already:

ShaunR · September 27, 2021

You could implement it using the LabVIEW memory manager but it's not for the faint-at-heart.

I once played with it for a circular buffer which isn't a million miles away from what you need.

Oooh. Look. Words similar

Edited September 27, 2021 by ShaunR

HugoChrist · September 28, 2021

10 hours ago, GregSands said:

Two suggestions if you haven't tried them already:

Multicore Analysis and Sparse Matrix Toolkit

GPU Toolkit

We looked into both and will definitely use multiple cores or the GPU in the future but the computation is not what slows us now.

9 hours ago, ShaunR said:

You could implement it using the LabVIEW memory manager but it's not for the faint-at-heart.

I once played with it for a circular buffer which isn't a million miles away from what you need.

Oooh. Look. Words similar

I could create my own Array object (maybe using using DSNewAlignedHandle) but I not familiar enough with LabVIEW memory manager. I think this would require me to write external code for every array operations, essentially building a complete array librairy (like numpy), I don't have time to do this now but it would be great because any time I need some sort of subarray of an array more than 2D, I'm facing the same problem.

For now, I wrote a simple DLL to handle the windowing and it is about 8x faster than the naive implementation.

Thank you for the suggestion, we will look into it in the future.

Edited September 28, 2021 by HugoChrist

ShaunR · September 28, 2021

53 minutes ago, HugoChrist said:

. I think this would require me to write external code for every array operations, essentially building a complete array librairy (like numpy),

Nope. All the nodes are calls to the LabVIEW executable which houses the memory manager and you have most of what you need from the circular buffer example I linked to.

53 minutes ago, HugoChrist said:

For now, I wrote a simple DLL to handle the windowing and it is about 8x faster than the naive implementation.

Job done. I'll go to lunch.

Edited September 28, 2021 by ShaunR

Sign In

Custom subarrays

Recommended Posts

HugoChrist

Gribo

Rolf Kalbermatter

HugoChrist

Rolf Kalbermatter

Mads

HugoChrist

GregSands

ShaunR

HugoChrist

ShaunR

Join the conversation

Browse

Activity

Important Information