hooovahh Posted July 24, 2012 Report Posted July 24, 2012 So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array. There are many different ways to do what I just mentioned, some more obvious then others, and I wanted to know which was the methods that should not be used for larger arrays, and which method appears to be the best, on a Windows machine with time to process being the only thing to judge on. So attached you'll find a VI (saved in 2011) which generates an array of strings. Each string will be between 5 and 15 characters long and the array size is controllable from the front panel. For approximately half the elements in the array, a string will be appended to the front of the generated string. These are the values we will search for. We then search the array looking for the indexes that contain the string we are looking for. In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning. I use 6 different methods for finding the indexes and calculate the time it took to find them. I noticed for arrays of 1000 elements or less it doesn't matter too much. Each of the 6 methods take 0 or 1 milliseconds to complete. For 10,000, or 100,000 the winner is Method 2, which uses the Search/Split String, combined with the OpenG Conditional Auto-Indexing Tunnel. I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory. Search Array Of Strings.vi 1 Quote
ShaunR Posted July 24, 2012 Report Posted July 24, 2012 (edited) So I embarked on a little journey this morning. Many times when developing code I have the need to search an array of string for some pattern. Usually I am looking for some set of characters in a string and then I need to get the whole string, and possibly the index that the string was found in the array. <snip> I know this is such a simple task, and for most cases it doesn't matter, but I was wondering what others have found. Is there a method that is generally preferred more? My thoughts would be that Method 1 would have been the fastest, since we only have to iterate over the items once, instead of an OpenG method, but I do realize the Build Array is an operation which can be bad for memory. Well. For me anything that doesn't use 3rd party products is preferred But generally when we are talking about this sort of thing we are also talking about convergence on a solution for the final optimisation step. As was indicated with the "Fast Trim"; using LabVIEW parallelism will always converge faster since the worst case is a solution is in the middle (if split over 2 for loops) rather than at the end. This however comes with the caveat for this case in how do you tell one loop that the other has found it without introducing complexity that mitigates the benefit. I haven't tried your example (I don't have the JKI toolkit) but in theory enabling parallelism of the for loop should also yield benefits although I have yet to see a real world scenario where it actually does. Edited July 24, 2012 by ShaunR Quote
lordexod Posted July 24, 2012 Report Posted July 24, 2012 You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance. Quote
Mellroth Posted July 24, 2012 Report Posted July 24, 2012 You can set: File -> VI Properties -> Execution -> Reentrant execution -> Preallocate clone for each instance. Why would you do this in this specific case? I think Shaun´s parallel FOR loop has much more potential, since a VI call always has a small time associated with it regardless of the reentrancy setting. /J Quote
Wouter Posted July 24, 2012 Report Posted July 24, 2012 (edited) I played with this just now. Let me post my remarks. The tests where done with 1 million elements First off al the first method can be made much more efficient, by just preallocation the data. If the input has 10000 then you should also preallocate a array of 10000 elements for both the string array as the index array. Then after the loop you just split the array and hence remove the elements you didn't use. The first method, (500 ms), is then most times faster then the second method, (550 ms). However of course you can also apply this to the second method. Which makes the second faster by more then 50% (550 ms to 250 ms) (yes the ss is of a subvi, but I did that for other testing purposes after I benchmarked it with the above results, I took the ss of the code when I was finished) I then thought I could optimize it further but my attempt failed. I tried combining the two OpenG VI's into one. You namely do not have to keep track of the indices in the first for loop since they are also recored in the boolean array. The index at which the value is true, is the index of the element that we want to keep. After that you know the exact size of the array you are going to create and you can prealocate that. Then just fill it. However the execution then came back on its original level (+/- 250 ms to 550 ms) Edited July 24, 2012 by Wouter Quote
hooovahh Posted July 24, 2012 Author Report Posted July 24, 2012 I played with this just now. Let me post my remarks. The tests where done with 1 million elements .... Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall. Quote
lordexod Posted July 25, 2012 Report Posted July 25, 2012 (edited) Maybe something like this: test.vi search.vi Edited July 25, 2012 by lordexod Quote
hooovahh Posted July 25, 2012 Author Report Posted July 25, 2012 Maybe something like this: I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory. To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison. Quote
Mellroth Posted July 25, 2012 Report Posted July 25, 2012 Wow thanks alot Wouter. I'm not sure why I didn't think of pre-allocation. I think this would make a nice little reuse nuggest for the community or OpenG some day. I also did notice that with a smaller sample this method 2 with pre-allocation doesn't always win but it may be the best solution overall. Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result. I have not bechmarked it against the other solutions, so I don't know if it is faster or slower. /J 1 Quote
lordexod Posted July 25, 2012 Report Posted July 25, 2012 (edited) I forgot, I have a better computer. Edited July 25, 2012 by lordexod 1 Quote
Mellroth Posted July 25, 2012 Report Posted July 25, 2012 I'm not quite sure why but your VIs on my system took 34 seconds to search through, using 10,000 items in the array, and it took over 900MB of private memory. To compare my original post took a minimum of 169ms and 1.4GB of private memory for all 6 methods. At the start of all of this I said the only real criteria was time for execution so the memory usage is a little unfair to add as a requirement, I simply wanted to give a comparison. I'm getting times very close to this as well, and I believe that the reason is that the size of each string is also set to the size of the array... i.e. basically 10000 elements at approximately 10000 bytes each. On my system I run out of memory, or at least start swapping all the time. Setting the size of the generated random string to more moderate 20 bytes, gives much better results... but still not close to the original code. /J Quote
Jordan Kuehn Posted July 25, 2012 Report Posted July 25, 2012 Maybe something like this: test.vi search.vi On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements. 1-4ms on the original post for 10,000 elements. Quote
hooovahh Posted July 25, 2012 Author Report Posted July 25, 2012 Our internal solution is quite similar to wouters solution. The big differences are that we use "ArraySubset" instead of "DeleteSubset" and that we reuse the incoming search array to store the result. I have not bechmarked it against the other solutions, so I don't know if it is faster or slower. /J I'm starting to get inconsistent results, but from what I've seen it looks to be one of the better solutions but not the best. The best I've seen would either be Method 3, or Method 3, modified as wouter suggests. A bit of lateral thinking? In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer. Quote
Darin Posted July 25, 2012 Report Posted July 25, 2012 In general I am not a fan of the "plow through an array to select a subset of items" then "plow through the subset of items" methodology. When possible I prefer to do whatever it is I am going to do when I find the element. Therefore, the code I would optimize and package for reuse would simply take an array and a start index and return the index of the first element which matches, and perhaps the element itself in this case. Sort of a special version of Search 1D array. This use case of collecting all of the items would simply use that code in a loop, with appropriate pre-allocation code. My other general point (don't have LV11 to test any of this code), is that benchmarking string operations is a dicey proposition and I often find very high variances. Be careful about drawing too many conclusions. I would check proper scaling with array size (should scale as N not N^2 or worse), and avoid the third-party code when practical. Hard to beat the good-old swap/cut method to move the desired items to the front of the input array, then use Reshape Array or Array Subset to lop of the undesired tail. Quote
Wouter Posted July 26, 2012 Report Posted July 26, 2012 A bit of lateral thinking? I get a memory is full error for a size of 1 million On the initial run I got 14.8s, but when running again without closing I got 278ms for 10,000 elements. 1-4ms on the original post for 10,000 elements. Thats because proberbly some intern machine code results are cached. Quote
ShaunR Posted July 26, 2012 Report Posted July 26, 2012 (edited) I get a memory is full error for a size of 1 million Thats because I cut and pasted your seed, method so it's not surprising (i.e you were trying to fit 1,0000,0000 x 1,000,0000 chars in memory) In my tests this did not perform as well as the others. It would usually be 4th or 5th place in any array size. Not terrible but not the method I would prefer. There's no pleasing some people Edited July 26, 2012 by ShaunR Quote
asbo Posted July 26, 2012 Report Posted July 26, 2012 Thats because proberbly some intern machine code results are cached. It's more likely related to a (lack of) memory allocations. Quote
Jordan Kuehn Posted July 26, 2012 Report Posted July 26, 2012 It's more likely related to a (lack of) memory allocations. System info: Quote
asbo Posted July 26, 2012 Report Posted July 26, 2012 System info: What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run. Quote
Wouter Posted July 28, 2012 Report Posted July 28, 2012 What's your point? It still takes time to execute a memory allocation, regardless of how much you have. It is very normal for a memory-intensive VI to execute magnitudes faster on a second run. Yes you are totally correct. In the second run the memory is already allocated and hence does not need to be done anymore. Quote
Aristos Queue Posted July 29, 2012 Report Posted July 29, 2012 I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested: I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously. Quote
Wouter Posted July 29, 2012 Report Posted July 29, 2012 I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested: I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously. I'm not suprised by it. I actually also build a version which used the string subset but then i thought of the case that the string "test" may not be on the front in each case, then I just choose to create a more general solution. Quote
hooovahh Posted July 30, 2012 Author Report Posted July 30, 2012 I did some testing... everyone was using Match Pattern... I found a code snippet that was faster than that in every scenario I tested: I wasn't sure if it was only prefixes that were being sought ... all the test harnesses only created prefix test cases. If you want "anywhere in the string", then this doesn't help, obviously. I realize some of my posts are a little long winded so I don't blame you for not picking up on what I was trying to say. In my first post I said this: In this example the string is appended to the front, but I would like to make this generic and I am assuming the string we are looking for could be any where in the string not just the beginning. Which was my way of saying that the search should be throughout the string. If I knew it was at the beginning I would use your technique with the string subset. Also I may be missing something, but why does concatenating a "^" symbol cause the Match Pattern method to work faster? Quote
Wouter Posted July 30, 2012 Report Posted July 30, 2012 (edited) Also I may be missing something, but why does concatenating a "^" symbol cause the Match Pattern method to work faster? http://en.wikipedia....lar_Expressions ^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line. Edited July 30, 2012 by Wouter Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.