PCRE -- tutorial part 2

PCRE –– tutorial part 2
written by: admin

Date Written: 7/31/08 Last Updated: 10/31/19

subpatterns

Subpatterns are the patterns found in parenthesis and counted from left to right starting with $1.

$0 is the complete pattern match.

Example 1:

$text="my name is the great houdini!";
$text=preg_replace('/(the)/','$1 $1',$text);
// this produces my name is the the great houdini!

$1 = (the)

Example 2:

$text="my name is the great houdini!";
$text=preg_replace('/(the|great)\s(houdini)/','$1',$text);
// this produces my name is the great!

(the|great)\s(houdini) = $0
(the|great) = $1
(houdini) = $2

$0 = great houdini
$1 = great
$2 = houdini

Example 3: nested subpatterns

$text=preg_replace('/(ho(u(d))in)i/', '<b>$3</b>',$text);

$0 = houdini
$1 = houdin
$2 = ud
$3 = d

Captures assigned from the first "(" to the last "(" in the pattern.

subpatterns

s modifier

Pattern Modifiers

When looking for a pattern in a string PCRE will look for a match on the first line until it encounters a newline or \r\n. Then it will go to the next line and look for a match until it comes across the next \r\n and so forth. Take a look at the following example:

$string=".hovermenu ul{
font: bold 13px arial;
padding: 0px;
margin: 0px;
height:20%;
}";

$string has 6 separate lines of code to look for patterns on. If you wanted to match everything between { and } you won't find a match because { and } occur on separate lines, but not on the same line.

In order to modify PCRE to look for matches located throughout the string as a whole as opposed to a bunch of separate lines you will need to modify the PCRE code to look like this:

$string =preg_replace('/\{*?\} ?/s', '', $string);

notice the s that is located just before the end quote of the pattern to look for. This tells the PCRE to consider all of the lines in the string as a whole. Now the pattern will see the match.

modifiers

preg_match()

Using PCRE to assign a match to a variable.

This is achieved with the preg_match() function. Preg_match() operates slightly differently than you would think. Preg_match adheres to the following pattern preg_match('/pattern/', 'string', $array) where $array will get the value of the pattern if it is found in the string.

Take a look at the following example:

<?php
$text="rrrarv sdf df s dfd fdf sd dfsd d sdf";
preg_match('/r.*fsd/',$text,$text1);
echo"original<br><br>$text<br><br>";
echo"Modified<br><br>$text1[0]";
?>

output:

original

rrrarv sdf df s dfd fdf sd dfsd d sdf

Modified

rrrarv sdf df s dfd fdf sd dfsd

Just to review: the pattern looked for the first occurance of the term 'r' and the last occurance of the term 'fsd' and whatever was between the two. When it found a match it took what it found between and including the two terms and assigned it to the array $text1

After finding the first match it stops looking for matches unless I use preg_match_all(). If I use preg_match_all() as opposed to preg_match() the array will become a multidimensional array.

preg_match()

non-capturing subpattern

?: when located after the opening parenthesis is used to locate matches, but does not actually capture them. It is useful when you want to match a specific combination of words, but also want to exclude certain matches that you don't want to use.

Take a look at the following two examples:

<pre><?php
$body="this is me. this is sentence two. this white sentence. this white me.";
$original=$body;
preg_match_all('/(is|white) (sentence|me)/',$body, $body);
echo "$original<br><br>";
print_r($body);
?></pre>

will produce:

this is me.  this is sentence two.  this white sentence.  this white me.

Array
(
    [0] => Array
        (
            [0] => is me
            [1] => is sentence
            [2] => white sentence
            [3] => white me
        )

    [1] => Array
        (
            [0] => is
            [1] => is
            [2] => white
            [3] => white
        )

    [2] => Array
        (
            [0] => me
            [1] => sentence
            [2] => sentence
            [3] => me
        )

)

however

<pre><?php
$body="this is me. this is sentence two. this white sentence. this white me.";
$original=$body;
preg_match_all('/(?:is|white) (sentence|me)/',$body, $body);
echo "$original<br><br>";
print_r($body);
?></pre>

will produce

this is me.  this is sentence two.  this white sentence.  this white me.

Array
(
    [0] => Array
        (
            [0] => is me
            [1] => is sentence
            [2] => white sentence
            [3] => white me
        )

    [1] => Array
        (
            [0] => me
            [1] => sentence
            [2] => sentence
            [3] => me
        )

)

When the ?: is used in a pattern inside a subpattern, as in the above example, the pattern (?:is|white) (sentence|me) will match, but will not capture is or white from the first subpattern:

In the first array, the entire pattern is captured as a whole just as you would see in $0 examples listed earlier.

[0] => Array
(
[0] => is me
[1] => is sentence
[2] => white sentence
[3] => white me
)

In the second array, the ?: part (?:is|white) was matched, but not captured, so the pattern ignores it and moves on to the next subpattern (sentence|me) and attaches it to the second array.

[1] => Array
(
[0] => me
[1] => sentence
[2] => sentence
[3] => me
)

This is useful if you want to retrieve the content between two different terms without also including the terms used in the capturing of the word you want. I have an example listed on the PCRE examples page.

lookahead (?= or (?!

This is really more of an example, rather than an explanation. The following will match two terms and everything between so long as a certain pattern is not located between the two terms.

<?php
$string ="hi, how goes it this is [quote]Bill[quote]Bill meyer[/quote] meyer[/quote] meyer";
$stringo = preg_replace('/\[quote\](?!.*?\[quote\]).*?\[\/quote\]/is', "XXXXXX", $string);
print $stringo;
echo"<br>$string";
?>

Commenting your PCRE

Occasionally when you are looking through different regular expressions you will come across a couple of octothorpes: # that is in a place that doesn't look like they are doing anything. Sometimes they are used for inserting comments into complicated expressions for added explanation. Remember, modifiers are located just before the ending quote as in the following example:

$string=preg_replace('/\{*?\} ?/s', '', $string);

In this case the modifier x is used. Here is an example of how it would be used:

$string=preg_replace('/# This pattern will match everything between two curly brackets on each line.
\{*?\} ?/x', '', $string);

$string=preg_replace('/# This pattern will match everything between two curly brackets no matter how many lines are used.
\{*?\} ?/xs', '', $string);

Multiple comments

$string=preg_replace('/# comment one.
\{*?\} ?
#comment two
/x', '', $string);

Rules to remember:

whitespace in your PCRE will now be ignored.
you need to use the x modifier.
Begin comments with # and end them with a newline. This means hit "enter" before resuming the pattern you are writing.
Do not use # to leave comments in the middle of a character class. For example: [tr#] won't work. If you use the rules listed here you can leave comments virtually anywhere in a pattern that you are trying to write.

ereg() deprecated

Deprecated: Use the preg_replace_callback() instead described under Using functions with PCRE below.

if (eregi($pattern,$string)) echo"found<br><br>";
else {echo "not found";}

Note: This is a little more processor heavy than preg_replace(), so use only when needed.

Note: ereg() functions are part of the POSIX library and have been discontinued in favor of PCRE. The PCRE is a little more difficult to understand, but is more versatile. The PCRE engine is also more intelligent and therefore often less processor heavy than PCRE, which is also processor heavy due to its complex analysis of strings.

ereg()

e modifier deprecated

Inserting PHP into your expression

Deprecated: Use the preg_replace_callback() instead described under Using functions with PCRE below.

The e modifier only works with preg_replace and allows you to insert php into your PCRE.

<?php
$text="My favorite comic character is X-23 found in X-Force comics. found here--><a href="https://www.marvel.com/characters/x-23">X-23</a>.";
$text=preg_replace('/>.*?</e',"str_replace('-','-','$0')",$text);
echo"$text<br><br>";
?>

changes:

My favorite comic character is X-23 found in X-Force comics. found here--><a href="https://www.marvel.com/characters/x-23">X-23</a>.

to:

My favorite comic character is X–23 found in X–Force comics. found here––>X–23.

In case you do not see it, the above will replace hyphens with dashes unless it is found within angle brackets, such as <a href="yot–yot">yo–yo</a> or scripts, etc. In this case the url "yot–yot" is ignored, but yo–yo will be matched.

Note: The e modifier will also backslash double quotes.

Using functions with PCRE

The following code will replace <br> with a space, but only if it is located between <style and /style>

<?php
$summary="<style>th<br>th</style>";
$summary1 = preg_replace_callback('#\<style(.*?)\/style\>#s',
        function ($matches) {
            return str_replace('<br>',' ',$matches[0]);
        },
        $summary
    );
echo "$summary1";
// produces <style>th th</style>
?>

preg_replace_callback() finds all of the matches and puts it into the array $matches which is defined in line 2. $matches[0] is the entire match including the subpattern (.*?). If we were to use $matches[1], only the content between <style> and /style> will be matched.

In line 5 we are using a str_replace command to replace the matched content. This is the second part of the preg_replace_callback() command.

Line 7, $summary, is the original content that the PCRE and function is applied to.

Note: I am not a big fan of user defined functions like the one listed here as I find them harder to understand, however, after a fair amount of searching, I was unable to find another PHP function to replace ereg() or preg_replace() with the e modifier.

preg_replace_callback()

some helpful links

TAGS: pcre, php