Fastest Search Array of String

hooovahh · July 24, 2012

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.

There are many different ways to do what I just mentioned, some more obvious then others, and I wanted to know which was the methods that should not be used for larger arrays, and which method appears to be the best, on a Windows machine with time to process being the only thing to judge on.

So attached you'll find a VI (saved in 2011) which generates an array of strings. Each string will be between 5 and 15 characters long and the array size is controllable from the front panel. For approximately half the elements in the array, a string will be appended to the front of the generated string. These are the values we will search for.

We then search the array looking for the indexes that contain the string we are looking for. In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning. I use 6 different methods for finding the indexes and calculate the time it took to find them.

I noticed for arrays of 1000 elements or less it doesn't matter too much. Each of the 6 methods take 0 or 1 milliseconds to complete. For 10,000, or 100,000 the winner is Method 2, which uses the Search/Split String, combined with the OpenG Conditional Auto-Indexing Tunnel.

I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Search Array Of Strings.vi

ShaunR · July 24, 2012

So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array.

<snip>

I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory.

Well. For me anything that doesn't use 3rd party products is preferred

But generally when we are talking about this sort of thing we are also talking about convergence on a solution for the final optimisation step. As was indicated with the "Fast Trim"; using LabVIEW parallelism will always converge faster since the worst case is a solution is in the middle (if split over 2 for loops) rather than at the end. This however comes with the caveat for this case in how do you tell one loop that the other has found it without introducing complexity that mitigates the benefit.

I haven't tried your example (I don't have the JKI toolkit) but in theory enabling parallelism of the for loop should also yield benefits although I have yet to see a real world scenario where it actually does.

Edited July 24, 2012 by ShaunR

lordexod · July 24, 2012

You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance.

Mellroth · July 24, 2012

You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance.

Why would you do this in this specific case?

I think Shaun´s parallel FOR loop has much more potential, since a VI call always has a small time associated with it regardless of the reentrancy setting.

/J

Wouter · July 24, 2012

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

First off al the first method can be made much more efficient, by just preallocation the data. If the input has 10000 then you should also preallocate a array of 10000 elements for both the string array as the index array. Then after the loop you just split the array and hence remove the elements you didn't use.

The first method, (500 ms), is then most times faster then the second method, (550 ms).

However of course you can also apply this to the second method. Which makes the second faster by more then 50% (550 ms to 250 ms)

(yes the ss is of a subvi, but I did that for other testing purposes after I benchmarked it with the above results, I took the ss of the code when I was finished)

I then thought I could optimize it further but my attempt failed. I tried combining the two OpenG VI's into one. You namely do not have to keep track of the indices in the first for loop since they are also recored in the boolean array. The index at which the value is true, is the index of the element that we want to keep. After that you know the exact size of the array you are going to create and you can prealocate that. Then just fill it. However the execution then came back on its original level (+/- 250 ms to 550 ms)

Edited July 24, 2012 by Wouter

hooovahh · July 24, 2012

I played with this just now. Let me post my remarks. The tests where done with 1 million elements

....

Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.

lordexod · July 25, 2012

Maybe something like this:

test.vi

search.vi

Edited July 25, 2012 by lordexod

hooovahh · July 25, 2012

Maybe something like this:

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

Mellroth · July 25, 2012

Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall.

Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

lordexod · July 25, 2012

I forgot, I have a better computer.

Edited July 25, 2012 by lordexod

ShaunR · July 25, 2012

A bit of lateral thinking?

Mellroth · July 25, 2012

I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory.

To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison.

I'm getting times very close to this as well, and I believe that the reason is that the size of each string is also set to the size of the array... i.e. basically 10000 elements at approximately 10000 bytes each. On my system I run out of memory, or at least start swapping all the time.

Setting the size of the generated random string to more moderate 20 bytes, gives much better results... but still not close to the original code.

/J

Jordan Kuehn · July 25, 2012

Maybe something like this:

test.vi

search.vi

On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements.

1-4ms on the original post for 10,000 elements.

hooovahh · July 25, 2012

Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result.

I have not bechmarked it against the other solutions, so I don't know if it is faster or slower.

/J

I'm starting to get inconsistent results, but from what I've seen it looks to be one of the better solutions but not the best. The best I've seen would either be Method 3, or Method 3, modified as wouter suggests.

A bit of lateral thinking?

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

Darin · July 25, 2012

In general I am not a fan of the "plow through an array to select a subset of items" then "plow through the subset of items" methodology. When possible I prefer to do whatever it is I am going to do when I find the element. Therefore, the code I would optimize and package for reuse would simply take an array and a start index and return the index of the first element which matches, and perhaps the element itself in this case. Sort of a special version of Search 1D array. This use case of collecting all of the items would simply use that code in a loop, with appropriate pre-allocation code.

My other general point (don't have LV11 to test any of this code), is that benchmarking string operations is a dicey proposition and I often find very high variances. Be careful about drawing too many conclusions. I would check proper scaling with array size (should scale as N not N^2 or worse), and avoid the third-party code when practical. Hard to beat the good-old swap/cut method to move the desired items to the front of the input array, then use Reshape Array or Array Subset to lop of the undesired tail.

Wouter · July 26, 2012

A bit of lateral thinking?

I get a memory is full error for a size of 1 million

On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements.

1-4ms on the original post for 10,000 elements.

Thats because proberbly some intern machine code results are cached.

ShaunR · July 26, 2012

I get a memory is full error for a size of 1 million

Thats because I cut and pasted your seed, method so it's not surprising (i.e you were trying to fit 1,0000,0000 x 1,000,0000 chars in memory)

In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer.

There's no pleasing some people

Edited July 26, 2012 by ShaunR

asbo · July 26, 2012

Thats because proberbly some intern machine code results are cached.

It's more likely related to a (lack of) memory allocations.

Jordan Kuehn · July 26, 2012

It's more likely related to a (lack of) memory allocations.

System info:

asbo · July 26, 2012

System info:

What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run.

Wouter · July 28, 2012

What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run.

Yes you are totally correct. In the second run the memory is already allocated and hence does not need to be done anymore.

Aristos Queue · July 29, 2012

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

Wouter · July 29, 2012

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

I'm not suprised by it. I actually also build a version which used the string subset but then i thought of the case that the string "test" may not be on the front in each case, then I just choose to create a more general solution.

hooovahh · July 30, 2012

I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested:

I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously.

I realize some of my posts are a little long winded so I don't blame you for not picking up on what I was trying to say. In my first post I said this:

In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning.

Which was my way of saying that the search should be throughout the string. If I knew it was at the beginning I would use your technique with the string subset.

Also I may be missing something, but why does concatenating a "^" symbol cause the Match Pattern method to work faster?

Wouter · July 30, 2012

Also I may be missing something, but why does concatenating a "^" symbol cause the Match Pattern method to work faster?

http://en.wikipedia....lar_Expressions

^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

Edited July 30, 2012 by Wouter

Sign In

Fastest Search Array of String

Recommended Posts

hooovahh

ShaunR

lordexod

Mellroth

Wouter

hooovahh

lordexod

hooovahh

Mellroth

lordexod

ShaunR

Mellroth

Jordan Kuehn

hooovahh

Darin

Wouter

ShaunR

asbo

Jordan Kuehn

asbo

Wouter

Aristos Queue

Wouter

hooovahh

Wouter

Join the conversation

Browse

Activity

Important Information