86.67%
Search · Index

V.4 Pattern matching

 

Pattern matching is important across a wide variety of Web programming tasks but most notably when looking for exceptions in user-entered data and when trying to parse information out of non-cooperating Web sites.

Tcl's pattern matching facilities test whether a given string matches a specified pattern. Patterns are described using a syntax known as regular expressions. For example, the pattern expression consisting of a single period matches any character. The pattern a..a matches any four-character string whose first and last characters are both a.

The regexp command takes a pattern, a string, and an optional match variable. It tests whether the string matches the pattern, returns 1 if there is a match and zero otherwise, and sets the match variable to the part of the string that matched the pattern:

% set something candelabra
candelabra

% regexp a..a $something match
1

% set match
abra
Patterns can also contain subpatterns (delimited by parentheses) and denote repetition. A star denotes zero or more occurrences of a pattern, so a(.*)a matches any string of at least two characters that begins and ends with the character a. Whatever has matched the subpattern between the a's will get put into the first subvariable:
% set something candelabra
candelabra

% regexp a(.*)a $something match
1

% set match
andelabra
Note that Tcl regexp by default behaves in a greedy fashion. There are three alternative substrings of "candelabra" that match the regexp a(.*)a: "andelabra", "andela", and "abra". Tcl chose the longest substring. This is very painful when trying to pull HTML pages apart:
% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1

% set italicized_phrase
et cetera

% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1

% set italicized_phrase
sui generis</i> and <i>ipso facto
What you want is a non-greedy regexp, a standard feature of Perl and an option in Tcl 8.1 and later versions (see http://www.scriptics.com/services/support/howto/regexp81.html).

Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.

 

Matching Cookies From the Browser

A common problem in Web development is pulling information out of cookies that come from the client. The cookie spec at http://home.netscape.com/newsref/std/cookie_spec.html mandates that multiple cookies be separated by semicolons. So you look for "the cookie name that you've been using" followed by an equals sign and them slurp up anything that follows that isn't a semicolon. Here is how the ArsDigita Community System looks for the value of the last_visit cookie:

regexp {last_visit=([^;]+)} $cookie match last_visit
Note the square brackets inside the regexp. The Tcl interpreter isn't trying to call a procedure because the entire regexp has been grouped with braces rather than double quotes. Square brackets denote a range of acceptable characters:
  • [A-Z] would match any uppercase character
  • [ABC] would match any of first three characters in the alphabet (uppercase only)
  • [^ABC] would match any character other than the first three uppercase characters in the alphabet, i.e., the ^ reverses the sense of the brackets
The plus sign after the [^;] says "one or more characters that meets the preceding spec", i.e., "one or more characters that isn't a semicolon". It is distinguished from * in that there must be at least one character for a match.

If successful, the regexp command above will set the match variable with the complete matching string, starting from "last_visit=". Our code doesn't make any use of this variable but only looks at the subvar last_visit that would also have been set.

Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:

regexp {last_visit=([0-9]+)} $cookie match last_visit

 

Matching Into Multiple Variables

More generally regexp allows multiple pattern variables. The pattern variables after the first are set to the substrings that matched the subpatterns. Here is an example of matching a credit card expiration date entered by a user:

% set date_typed_by_user "06/02"
06/02

% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1

% set month
06

% set year
02
%
Each pair of parentheses corresponds to a subpattern variable.

 

Full Syntax


The most general form of regexp includes optional flags as well as multiple match variables:

regexp [flags] pattern data matched_result var1 var2 ...
The various flags are
  • -nocase
    uppercase characters in the data are bashed down to lower for case-insensitive matching (make sure that your pattern is all lowercase!)
  • -indices
    the returned values of the regexp contain the indices delimiting the matched substring, rather than the strings themselves.
  • If your pattern begins with a -, put a -- flag at the end of your flags
Regular expression syntax is:
  • .
    matches any character.
  • *
    matches zero or more instances of the previous pattern item.
  • +
    matches one or more instances of the previous pattern item.
  • ?
    matches zero or one instances of the previous pattern item.
  • |
    disjunction, e.g., (a|b) matches an a or a b
  • ( )
    groups a sub-pattern.
  • [ ]
    delimits a set of characters. ASCII Ranges are specified using hyphens, e.g., [A-z] matches any character from uppercase A through lowercase z (i.e., any alphabetic character). If the first character in the set is ^, this complements the set, e.g., [^A-z] matches any non-alphabetic character.
  • ^
    Matches only when the pattern appears at the beginning of the string. The ^ must appear at the beginning of the pattern expression.
  • $
    Matches only when the pattern appears at the end of the string. The $ must appear last in the pattern expression.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm

 

Matching with substitution

It's common in Web programming to create strings by substitution. Tcl's regsub command performs substitution based on a pattern:

regsub [flags] pattern data replacements var
matches the pattern against the data. If the match succeeds, the variable named var is set to data, with various parts modified, as specified by replacements. If the match fails, var is simply set to data. The value returned by regsub is the number of replacements performed.

The flag -all specifies that every occurrence of the pattern should be replaced. Otherwise only the first occurrence is replaced. Other flags include -nocase and -- as with regexp

Here's an example from the banner ideas module of the ArsDigita Community System (see http://photo.net/doc/bannerideas.html). The goal is that each banner idea contain a linked thumbnail image. To facilitate cutting and pasting of the image html, we don't require that the publisher include uniform subtags within the IMG. However, we use regexp to clean up:

# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html

In the example above, <replacements> specified the literal characters ''. Other replacement directives include:

  • & inserts the string that matched the pattern
  • The backslashed numbers \1 through \9 inserts the strings that matched the corresponding sub-patterns in the pattern.
Here's another web example, which parses HTML, and replaces the comments (delineated in HTML by <!-- and -->) by the comment text, enclosed in parentheses.
% proc extract_comment_text {html} {
regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
return $with_exposed_comments
}

% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm


String match

Tcl provides an alternative matching mechanism that is simpler for users to understand than regular expressions. The Tcl command string match uses "GLOB-style" matching. Here is the syntax:

string match pattern data
It returns 1 if there is a match and 0 otherwise. The only pattern elements permitted here are ?, which matches any single character; *, which matches any sequence; and [], which delimits a set of characters or a range. This differs from regexp in that the pattern must match the entire string supplied:
% regexp "foo" "foobar"
1

% string match "foo" "foobar"
0

% # here's what we need to do to make the string match
% # work like the regexp
% string match "*foo*" foobar
1
Here's an example of the character range system in use:
string match {*[0-9]*} $text

returns 1 if text contains at least one digit and 0 otherwise.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/string.htm

 

 


Exercises

1. 

  • Write a procedure which takes a string and makes sure that the result contains an "@" sign
  • Extend the procedure to make sure that only letters, numbers are allowed before the "@" sign
  • Extend the procedure to check that after the @ sign comes a valid domain (hint, look at 2.) A valid domain contains of at least one "." and only letters after the last ".". so malte.cognovis.de is a valid domain, cognovis.d1e is not.
  • Extend the procedure to return "Welcome foo, member of bar.com" if the string is "foo@bar.com"
  • Extend the procedure to return "Welcome OpenACS member foo" if the string is like "foo@openacs.org" meaning, the e-mail ends with openacs.org
  • Check against the valid domain again. This time make use of the ad_locales table installed in your local copy of OpenACS. To make this work you will have to use the OpenACS Shell.
    • Get a list of all countries from the table ad_locales. Choose the language column for this. The command to extract this is "db_list".
    • If your list contains the language "ca" more than once, make sure to limit it to one "ca" only. Make sure this works for others as well.
    • As ".com" ".org" and ".net" are also valid domain ending append them to the list.
    • Make sure that the domain ends on any language defined in your list you created. So automotive.ca works but automotive.eu does not (and yes, I know that .eu is now a valid domain :-)).


Answer


2.

  1. Search at amazon.com for your favorite book. Copy the URL until you see the "/ref..." part, e.g. http://www.amazon.com/4-Hour-Workweek-Escape-Live-Anywhere/dp/0307353133
  2. In the OpenACS shell use "ad_httpget" to retrieve the URL you copied. Look at the api doc for the syntax.
  3. Use regexp to find the price of the book in the html source returned to you by ad_httpget
  4. Return the price of the book.

 

Answer

---

based on Tcl for Web Nerds