April 25, 2007

.htaccess and mod_rewrite (continued)

Posted at April 25, 2007 08:36 PM

Note: To make this easier for you, here are direct links for Part 1 and Part 2 of this brief .htaccess tutorial.

Let's jump right in and move on to some of the single character pre-defined modifiers we can use with mod_rewrite to make it do its magic.

^ (Caret) - The character is one of those pre-defined modifiers that actually has two meanings and purposes, depending upon where it's being used.

Most of the time you use it you can remember it as being equal to Starts With. Such as below.

Example: RewriteCond %{HTTP_HOST} ^domain\.com [NC]

As already discussed, the above equates to saying in English The HTTP_HOST variable starts with domain.com

The other use for a ^ caret is when it's inside of an Expression grouping, which we covered in the last section. The way to remember an Expression Grouping is that it's going to be enclosed inside square brackets. ala [0-9]

When a caret appears inside an expression grouping the ^ means Not. It's easier to explain with an example.

Example: RewriteRule ^store/[^0-9] /store/index.php?prodid=$1

In the above the first caret (before "store") means Starts With. The second caret (preceeding the zero) means Not. So in English it will test as true --or match-- if the request is to www.domain.com/store/a or any other version where the last character is Not a number between 0 and 9. Most think this would mean any letter, but it really means any single character. So something like www.domain.com/~ and www.domain.com/& would also match.

The caret is about the only one to remember this dual meaning. 99% of the time you'll be using it to mean Starts With, but there are those rare occasions (I'll cover some later) where it can be useful to mean Not inside an expression grouping.

! (Exclamation Point) - The exclamation point is one of those that is really cool. It means Not during normal mod_rewrites. You'll sometimes see these instances referred to a Negative Match. Meaning you're trying to test true for everything but one very specific instance.

Many times it's much easier to use a negative match because you know what you don't want to happen, whereas it could be difficult or impossible to imagine every possible positive match.

Example: RewriteCond %{HTTP_HOST} !^www\.domain\.com [NC]

The above simply says The HTTP_HOST variable does not start with "www.domain.com" Clean and simple. In its normal use the ! will be at the very beginning of your conditional statement. Generally it is not allowed in the center of an equation, because there is already the [^a-z] Not expression option.

One different way an exclamation can be used is when it appears on the Right side of the equation. In this case it stands for No Rewrite. I'm not going to go into it much here because frankly it's not used often. When it is used on the right side it's usually done to see if a file or directory exists.

| (Pipe) - The Pipe character equates to the word or and can be used to match alternate text or expressions that are grouped inside a set of parentheses. (see below for parens grouping.)

Example: RewriteCond %{REQUEST_URI} ^/(index|home|default)\.php

The above would test true for www.domain.com/index.php, www.domain.com/home.php and www.domain.com/default.php.

( ) (Parentheses) - Parentheses provide something of a dual role. Mainly they are used to create a variable that gets stored and can later be recalled. However it is also used as a compliment to the Square Brackets [] mentioned before, since parens can be used to group characters for use with the ?, + and * special quantifiers we'll discuss in just a moment.

\ (Backward Slash) - The backslash character is one you'll need to use whenever you need to escape another character that has special meaning in Regex so that it is treated literally rather than as a special character. The main time you will use this is when you have a conditional statement (the Left side of a RewriteRule or in the conditional of a RewriteCond) that contains either a dot (.), a question mark (?) or a space character.

Example: RewriteRule ^index\.php\? /store/index.php

. (Dot or Period) - This is one of those you have to watch out for a bit. In Regex the dot does have a special meaning --that being a match of any single character excluding the end of a line-- however dots are also used in URls. The best basic rule I can tell you is to make sure you Escape any dots in your URLs when you're constructing RewriteCond's. If you don't, they're going to be seen as something other than what you intend and your rules will not perform as expected.

FTR, the way to escape dots or any other special character you need to is to slap a backslash character ( \ ) in front of it. See the syntax I used above for an example of escaping.

Now we get into the really fun (and sometimes confusing) stuff. Pre-defined regex characters that introduce the ability to Wildcard your conditionals.

? (Question Mark) - In Regex-speak the question mark can be used to test 0 or 1 of the characters --or set of characters enclodsed in brackets or parentheses-- that immediately preceeds it.

Regex-speak is confusing isn't it? It's no wonder people don't get it. But this is really easier than it appears. Some examples will help.

Example 1: RewriteRule ^store/a? /store/index.php?prodid=$1

The above example would match www.domain.com/store/ and would also match www.domain.com/a However it wouldn't match anything else. All you're telling the server is that the lowercase "a" character is completely optional.

Example 2: RewriteCond %{HTTP_HOST} ^(www\.)?domain\.com [NC]

In the above example you're telling the server you want a match no matter if the HTTP_HOST variable in the request is www.domain.com or simply domain.com

Got it? The question mark is simply a way for you to give possible matches an extra out.

+ (Plus Sign) - This is the one you'll be using often with some of the expression groupings we covered earlier. The Plus syntax means it will match 1 or more of the character preceeding it. Or alternatively match 1 or more of a group of characters that are enclosed in square brackets or parentheses. This one will require a few examples because we'll want to see how it works in conjunction with the [] Square Bracket and () Parens groupings. This is where Regex starts to stretch its legs and show you how powerful it can be.

Example 1: RewriteRule ^store/a+ /store/index.php?prodid=$1

The above would match www.domain.com/store/a and www.domain.com/store/aaa and www.store.com/aaaaaaaa As long as the base URL is there and it's followed by one or more a's it'll match. That's all that will match though.

Example2: RewriteRule ^store/ape+ /store/index.php?prodid=$1

The above is a trick question. What do you think it would match as being true?

The answer is that it will match www.domain.com/store/ape and www.domain.com/store/apeeeeee but it will not match www.domain.com/store/apeape

Remember, these special wildcard quantifiers will only test true for the character that immediately preceed them. If you want them to work on more than one character, you need to make sure they're Grouped them together with brackets or parens.

Example 3: RewriteRule ^store/(ape)+ /store/index.php?prodid=$1

This is the one that would group the three letter together. It would match www.domain.com/store/ape and www.domain.com/store/apeape It would not match something like www.domain.com/store/apebigape

Example 4: RewriteRule ^store/([a-zA-Z]+) /store/index.php?prodid=$1

Take a close look at that syntax and specifically the use of grouping. This is one that comes in really handy, so you'll probably be using it a fair amount.

I'm using the Parens to create a group so that the + sign can work on the whole thing, and with the square brackets I'm telling the server to look for anything that is a letter, no matter whether it's lower- or uppercase.

So in essence this section of the rewrite rule will test True if any letter is in the appropriate place in the URL, but it would test false if any numbers or any other character (like a forward slash) appears there. It will match for www.domain.com/store/yonkers and www.domain.com/store/AbCdEfG but will return False for www.domain.com/store/yonkers1961 or even www.domain.com/store/anything/ (because of the trailing slash)

* (Asterik) - The asterik is a true Wildcard, as most people are familiar with it being. Up to a point anyway. Technically it matched 0 or more of the character preceeding it, or in the case of grouped characters one or more of the set. Examples are just easier, so you can see the usage.

Example 1: RewriteRule ^store/a* /store/index.php?prodid=$1

The above would match www.domain.com/store/ (because there are zero a's) and www.domain.com/store/a and www.domain.com/aaaaa

Example 2: RewriteRule ^store/(ape)* /store/index.php?prodid=$1

The above would match www.domain.com/store/ and www.domain.com/ape and www.domain.com/apeapeape

Example 3: RewriteRule ^store/(.*) /store/index.php?prodid=$1

Okay, I'm really not trying to confuse you. This one (.*) is one you'll see a lot of. Do you remember what the Period or Dot means? It means any single character. Any letter, any number or any other character.

So when we group it in parens with * --which matches 0 or more any single character-- we end up with a true Wildcard. The above would mach www.domain.com/store/abc and www.domain.com/store/abc123 and even www.domain.com/abc/123/big/old/ape

Obviously, you'll want to be a bit careful with this one, but it comes in really, really handy in a lot of situations.

Let's see, what others do we need to cover. There are so many to choose from!

$ Dollar Sign - The dollar sign is another of those that has a dual purpose. The best way to remember which is which is to think Left Side and Right Side. I assume that you've noticed there is a Left and Right site to rewriterules.

In RewriteRules the $ on the left side defines the End of your pattern, or more correctly in Regex-speak and Ending Anchor. So if you had something like:

RewriteRule ^store/index.htm$ /store/index.php?prodid=$1

It would match www.domain.com/store/index.htm but not match www.domain.com/store/index.html or www.domain.com/store/index.htm?var=something

Think of the $ in this instance as being a way to gain greater control over what will and will not match.

Now on the right side of a RewriteRule the $ has a completely different meaning. If you are familiar with coding in PHP it'll make complete sense because it's just a way to retrieve a previously declared variable. A quick example would be

Example: RewriteRule ^store/(.*) /store/index.php?prodid=$1

In the above the $1 at the end is going to be replaced by whatever the server read in the (.*) variable previously. These always start with the number one and count up for there. Though you don't always need to reference them in the right hand half if you don't need that partcular variable.

Example 2: RewriteRule ^store/([a-zA-Z]+)/(0-9)+ /store/index.php?categoryid=$1&prodid=$2

In the above example our Category Id is going to be letters, both capital and lowercase, then our Product ID is a number that can be 1 or more digits long. So the Category ID turns into variable $1 and the Product ID turns into variable $2.

Or if you didn't need the Category ID to pull up a product page in your database, but it the category would still be there in the URL you could simplify things by doing it simply as

Example: RewriteRule ^store/([a-zA-Z]+)/(0-9)+ /store/index.php?prodid=$2

Go it? Just because a variable is there in the original URL doesn't mean you always have to use it in your dynamic URL.

I'm glad you got it, because now I have to confuse you a bit on the Left/Right thing. In a RewriteCond and only in a RewriteCond a $ can't really be used on the left side of the equation since it's always going to be filled with one of the pre-defined conditionals. But on the Right side in RewriteCond's the $ is used to signify the ending anchor, just like it is on the Left side of RewriteRules.

Hopefully that little quirk doesn't cause anybody any pain. You won't be using a lot of $'s in RewriteCond's anyway. So remembering the Left/Right rule above is your best bet.

% (Percent Sign) - The Percent sign is another of those fun ones, though not quite as bad. You'll always use a % on the left side of your RewriteCond statements since it's there to tell the server that you want to access some server variable. For example %{HTTP_HOST} or %{HTTP_REFERER} or %{REQUEST_URI} or %{THE_REQUEST} That one is simply a given. It's part of the required syntax.

The other way you can use a % is if you have set/captured a variable in a RewriteCond that you need to use in your RewriteRule. Basically, instead of using $1 and the like when a variable is captured in the RewriteRule you can use %1 to instead insert the variable as it was captured in your RewriteCond.

Example: Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_URI} ^/index\.php
RewriteCond %{QUERY_STRING} ^productid=(.*)
RewriteRule ^(.*)$ http://www.domain.com/store/%1/ [R=301,L]

Say hunh?

Okay, the above is a completely made up situation, but one many run across. In this case we used to have a shopping cart installed where it used dynamic URLs and everything was keyed off of the default index.php file. When someone went to a product page in the old cart there was always a "productid" in the query string part of the dynamic URL.

We've recently changed the shopping cart we use. The new one is already constructed to have so-called Search Engine Friendly URLs, showing the URLs in folder/subfolder mode instead of showing dynamic URLs with query strings.

Since we know that our new cart only needs the Product ID to display the correct page we can key to that. So in our second RewriteCond we variablize it an use %1 to drop it into our resulting URL, issuing a 301 Moved Permanently status code while we're at it so that the spiders can find the path to our new shopping cart pages.

That's enough for now. I'll let you chew on this much for a bit before going on. There are lots of ways you can mix, match and combine the various Regex syntax we've covered so far. Next we'll start using them in some real world examples.

Don't worry about trying to memorize it all. It's impossible to memorize it all. Trust me. I've been dabbling with mod_rewrite and Regular Expressions for years and I still have little cheater files I refer to many if not most times.

There are some basic things you'll want to become familiar with since they get used a lot. We'll drum those into your brain in the real world examples to follow, so don't worry about it. I'll take the time with each of those to explain what's happening, referencing the syntax that makes it happen.

Note: To make this easier for you, here are direct links for Part 1 and Part 2 of this brief .htaccess tutorial.

Comments

Thanks so much, you helped me make sense of some undesired effects caused by some script I picked up online.

Posted by Lil Wiki at July 14, 2008 11:46 PM

Post a comment










Remember personal info?