Jump to content

Regex Challenge: How to exclude variable-width parts of a match?


Recommended Posts

How can I exclude a variable-width, inner part from a regex match?

 

For instance, given the three following inputs:

  • The quick brown fox jumped over the lazy dog
  • The quick brown fox jumped over the sleepy dog
  • The quick brown fox jumped over the hotdog
I want to match the following:
  • The quick brown fox jumped over the dog

I have investigated negative lookarounds, but since these are zero-width assertions, I can't easily figure out how to include additional regex directives on *both sides* of the lookaround.

 

Is this problem relegated to the use of Search and Replace String configured for regex matching, or can this be achieved with a simple regex match?

 

Here's a screenshot of one of my naïve attempts using RegexPal; a successful attempt would also show "dog" highlighted as part of the match, but excluding the dog modifier. At least it doesn't match 'frog' :-)

 

post-17237-0-39273600-1363286217.png

Link to comment

OK; so, the consensus is that there does not exist a single regex to solve this?  :( (I was really hoping to learn of a solution that could be achieved purely by the regex engine.)

 

The best name I can come up with for what I'm trying to do is submatch extraction, where the whole match depends on matching directives before and after the submatch, yet excludes the submatch like this:

 

post-17237-0-63598200-1363293747.png

Link to comment

Yeah, S&R will be way faster. But you asked...

 

Don't forget the * and + operators are greedy, they will match as long of a string as possible. So a simple "The quick brown fox jumped over the (.*)dog" will do the trick. Substitute + for * if you want to require at least 1 character, or use the {m,n} syntax if you have other length restrictions.

 

post-11742-0-21795800-1363300741.png

 

Obligatory:

regular_expressions.png

Link to comment
...a simple "The quick brown fox jumped over the (.*)dog" will do the trick.

 

Well, simply finding a submatch is the easy part ;-)

 

What I'm really interested in doing is returning the original input string as the match, minus the submatch. Figure this out, and then we can fly around on vines saving days :-)

Link to comment

Why does S&R not fulfill your desire? 

 

The regex engine deals in offsets and lengths in its internal state machine.  Implementing dropped characters would really require a fundamental change to this representation.  I am not sure what it would do to its speed (lack thereof) but my guess is that it will not speed up.

 

There seems to be a reason why regular expressions are low-level tools used in higher-level languages.

 

In Perl you can sprinkle some other code inside, but it really makes things hard to read.  Most people do not realize that you can comment inside regexes, and even fewer ever want to deal with a regex which requires commenting.

Link to comment

[blockquote class=ipsBlockquote data-author=JackDunaway data-cid=102091 data-time=1363301998]<p>

What I'm really interested in doing is returning the original input string as the match, <em class='bbc'>minus</em> the submatch. Figure this out, and then we can fly around on vines saving days :-)</p>

<br />

Oh, I misread. Yes, the look ahead/behind are what you want.<br />

<br />

<span style='font-family: courier new', courier, monospace'>(?<=The quick brown fox jumped over the ).*(?=dog)</span><br />

<br />

Using the S&R in regex mode will get you the string you want, but the match primitive won't since the whole point of the match is not to create new strings but to only return substrings. You could still use the match though if you concatenate the before/after substrings.

Ok, I give up. My phone completely screwed that up:

 

(?<=The quick brown fox jumped over the ).*(?=dog)

 

If that doesn't work, I give up.

Link to comment

(?<=The quick brown fox jumped over the ).*(?=dog)

 

If that doesn't work, I give up.

 

That got me excited... but it's not quite right. The whole match is still just the submatch, because the lookarounds are zero-width.

 

Virtual +50 bounty for the regex that makes the green light come on in the following test harness:

 

post-17237-0-00062600-1363307882_thumb.p

Link to comment

By giving up I meant giving up on fixing that post by the way.

 

Anyways, you're not going to be able to do it with the match primitive alone. It won't make a new string for you, which is what you need if you want to get the actual "The quick brown fox jumped over the dog" out of the match. The S&R would do it, but then you won't get the match. If you add some extra logic though, you can do it.

 

I hope I got your examples right, my LabVIEW stopped accepting snippets for some reason, so I rolled this one from scratch.

 

post-11742-0-16612500-1363311445_thumb.p

 

Other case is empty by the way, in case the snippets I produce are as defective as my ability to read them. Basically you need to construct the final string yourself.

Link to comment
Anyways, you're not going to be able to do it with the match primitive alone. It won't make a new string for you...

 

Well, it's kinda not a new string... it's just a noncontiguous substring. (Can a substring be defined as 'noncontiguous'? Perhaps, no, and what i desire is impossible.)

 

 

 

I hope I got your examples right, my LabVIEW stopped accepting snippets for some reason, so I rolled this one from scratch.

 

attachicon.gifregex2.png

 

Wow, thanks for going out of your way to recreate! Sorry snippets broke for you  :(

 

That is basically the way I'm solving the problem right now; with extra syntax. The prime motivation for finding a *purely* regex solution is to generalize this problem -- consider wanting to remove adjectives from both nouns:

 

The fox jumped over the dog

 

This general solution more closely matches my problem domain. (This thread presents the simplest form of the problem, since I can't even figure that out; or if the desired solution is even possible!)

Link to comment

Hah, no worries. I love these types of problems. Pure logic.

 

Non-contiguous strings exist, just not in LabVIEW :)

 

I figured you had a good reason to use a regex because as it stood a regex was not the best way to do it: scanning would be faster.

 

post-11742-0-24501000-1363314708.png

 

You have an interesting problem though in that you want to replace and match at the same time, and apparently globally.

Link to comment

You are expecting that "Whole Match" really means Whole Match except those bits I don't want?

 

LV Help

 

whole match contains all the characters that match the expression entered in regular expression. Any substring matches the function finds appear in the submatch outputs.

 

 

To match individual components you have to create capture groups. The ?: syntax means that you exclude that component from the list of capture groups (this isn't a LV peculiarity, it's how all regex parsers work).

 

So whilst  (The quick brown fox jumped over thes*)([a-zA-Z0-9]*)(s*dog) will give you three terminal outputs with

1. The quick brown fox jumped over the

2. lazy

3. dog

 

(The quick brown fox jumped over thes*)(?:[a-zA-Z0-9]*)(s*dog) will give you only two terminals with

1. The quick brown fox jumped over the

2. dog

 

Similarly. The quick brown fox jumped over thes*([a-zA-Z0-9]*)s*dog will give you only one terminal with

1. lazy

 

In all cases the Whole Match will give you the whole string if all capture groups match and bugger all if it doesn't.

Edited by ShaunR
Link to comment

So I've tried to load every snippet in this thread to LV2012 and all I get is a picture on the BD. I tested LV by grabbing a couple from the dark side and they work fine.

 

I created a few of my own as well and could load them back in.

 

Could there be a filter on the LAVAG server that is stripping out the meta data?

 

I'll share my SuperSecret toggle snippet here and see if I can load it back into LV.

 

post-949-0-53158700-1363345046.png

 

 

EDIT:

 

I think something is happening on the server. My image file on disk was ~15k in size and when downloaded is ony 5k.

 

Hmmm.... :shifty: 

Edited by Phillip Brooks
Link to comment
So I've tried to load every snippet in this thread to LV2012 and all I get is a picture on the BD. I tested LV by grabbing a couple from the dark side and they work fine.

 

I created a few of my own as well and could load them back in.

 

Could there be a filter on the LAVAG server that is stripping out the meta data?

 

I'll share my SuperSecret toggle snippet here and see if I can load it back into LV.

 

attachicon.gifChange SuperSecret Setting.png

 

 

EDIT:

 

I think something is happening on the server. My image file on disk was ~15k in size and when downloaded is ony 5k.

 

Hmmm.... :shifty: 

Yup.

I checked my snippet after Hoover said (on another thread) and it was fine.

 

I've now uploaded the same snippet to http://postimage.org/image/5p8stekp7/ and it works fine when downloaded. Definitely something going on with lavag.org. Maybe it's now being optimised/compressed?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.