test-doc - Pattern matching

I OpenACS For Everyone
- I.1 High level information: What is OpenACS?
  - I.1.1 Overview
  - I.1.2 OpenACS Release Notes
- I.2 OpenACS: robust web development framework
  - I.2.1 Introduction
  - I.2.2 Basic infrastructure
  - I.2.3 Advanced infrastructure
  - I.2.4 Domain level tools
II Administrator's Guide
- II.2 Installation Overview
  - II.2.1 Basic Steps
  - II.2.2 Prerequisite Software
- II.3 Complete Installation
  - II.3.1 Install a Unix-like system and supporting software
  - II.3.2 Install Oracle 10g XE on debian
    - II.3.2.1 Install Oracle 8.1.7
  - II.3.3 Install PostgreSQL
  - II.3.4 Install AOLserver 4
  - II.3.5 Quick Install of OpenACS
    - II.3.5.1 Complex Install OpenACS 5.3
  - II.3.6 OpenACS Installation Guide for Windows2000
  - II.3.7 OpenACS Installation Guide for Mac OS X
- II.4 Configuring a new OpenACS Site
  - II.4.1 Installing OpenACS packages
  - II.4.2 Mounting OpenACS packages
  - II.4.3 Configuring an OpenACS package
  - II.4.4 Setting Permissions on an OpenACS package
  - II.4.5 How Do I?
  - II.4.6 Configure OpenACS look and feel with templates
- II.5 Upgrading
  - II.5.1 Overview
  - II.5.2 Upgrading 4.5 or higher to 4.6.3
  - II.5.3 Upgrading OpenACS 4.6.3 to 5.0
  - II.5.4 Upgrading an OpenACS 5.0.0 or greater installation
  - II.5.5 Upgrading the OpenACS files
  - II.5.6 Upgrading Platform components
- II.6 Production Environments
  - II.6.1 Starting and Stopping an OpenACS instance.
  - II.6.2 AOLserver keepalive with inittab
  - II.6.3 Running multiple services on one machine
  - II.6.4 High Availability/High Performance Configurations
  - II.6.5 Staged Deployment for Production Networks
  - II.6.6 Installing SSL Support for an OpenACS service
  - II.6.7 Set up Log Analysis Reports
  - II.6.8 External uptime validation
  - II.6.9 Diagnosing Performance Problems
- II.7 Database Management
  - II.7.1 Running a PostgreSQL database on another server
  - II.7.2 Deleting a tablespace
  - II.7.3 Vacuum Postgres nightly
- II.8 Backup and Recovery
  - II.8.1 Backup Strategy
  - II.8.2 Manual backup and recovery
  - II.8.3 Automated Backup
  - II.8.4 Using CVS for backup-recovery
- II.A Install Red Hat 8/9
- II.B Install additional supporting software
  - II.B.1 Unpack the OpenACS tarball
  - II.B.2 Initialize CVS (OPTIONAL)
  - II.B.3 Add PSGML commands to emacs init file (OPTIONAL)
  - II.B.4 Install Daemontools (OPTIONAL)
  - II.B.5 Install qmail (OPTIONAL)
  - II.B.6 Install Analog web file analyzer
  - II.B.7 Install nspam
  - II.B.8 Install Full Text Search
  - II.B.9 Install Full Text Search using Tsearch2
  - II.B.10 Install Full Text Search using OpenFTS (deprecated see tsearch2)
  - II.B.11 Install nsopenssl
  - II.B.12 Install tclwebtest.
  - II.B.13 Install PHP for use in AOLserver
  - II.B.14 Install Squirrelmail for use as a webmail system for OpenACS
  - II.B.15 Install PAM Radius for use as external authentication
  - II.B.16 Install LDAP for use as external authentication
  - II.B.17 Install AOLserver 3.3oacs1
- II.C Credits
  - II.C.1 Where did this document come from?
  - II.C.2 Linux Install Guides
  - II.C.3 Security Information
  - II.C.4 Resources
III For OpenACS Package Developers
- III.9 Development Tutorial
  - III.9.1 Creating an Application Package
  - III.9.2 Setting Up Database Objects
  - III.9.3 Creating Web Pages
  - III.9.4 Debugging and Automated Testing
- III.10 Advanced Topics
  - III.10.1 Write the Requirements and Design Specs
  - III.10.2 Add the new package to CVS
  - III.10.3 OpenACS Edit This Page Templates
  - III.10.4 Adding Comments
  - III.10.5 Admin Pages
  - III.10.6 Categories
  - III.10.7 Profile your code
  - III.10.8 Prepare the package for distribution.
  - III.10.9 Distributing upgrades of your package
  - III.10.10 Notifications
  - III.10.11 Hierarchical data
  - III.10.12 Using .vuh files for pretty urls
  - III.10.13 Laying out a page with CSS instead of tables
  - III.10.14 Sending HTML email from your application
  - III.10.15 Basic Caching
  - III.10.16 Scheduled Procedures
  - III.10.17 Enabling WYSIWYG
  - III.10.18 Adding in parameters for your package
  - III.10.19 Writing upgrade scripts
  - III.10.20 Connect to a second database
  - III.10.21 Future Topics
- III.11 Development Reference
  - III.11.1 OpenACS Packages
  - III.11.2 OpenACS Data Models and the Object System
  - III.11.3 The Request Processor
  - III.11.4 The OpenACS Database Access API
  - III.11.5 Using Templates in OpenACS
  - III.11.6 Groups, Context, Permissions
  - III.11.7 Writing OpenACS Application Pages
  - III.11.8 Parties in OpenACS
  - III.11.9 OpenACS Permissions Tediously Explained
  - III.11.10 Object Identity
  - III.11.11 Programming with AOLserver
  - III.11.12 Using Form Builder: building html forms dynamically
- III.12 Engineering Standards
  - III.12.1 OpenACS Style Guide
  - III.12.2 Release Version Numbering
  - III.12.3 Constraint naming standard
  - III.12.4 ACS File Naming and Formatting Standards
  - III.12.5 PL/SQL Standards
  - III.12.6 Variables
  - III.12.7 Automated Testing
- III.13 CVS Guidelines
  - III.13.1 Using CVS with OpenACS
  - III.13.2 OpenACS CVS Concepts
  - III.13.3 Contributing code back to OpenACS
  - III.13.4 Additional Resources for CVS
- III.14 Documentation Standards
  - III.14.1 OpenACS Documentation Guide
  - III.14.2 Using PSGML mode in Emacs
  - III.14.3 Using nXML mode in Emacs
  - III.14.4 Detailed Design Documentation Template
  - III.14.5 System/Application Requirements Template
- III.15 TCLWebtest
  - III.15.1 API test
  - III.15.2 Webtest
- III.16 Internationalization
  - III.16.1 Internationalization and Localization Overview
  - III.16.2 How Internationalization/Localization works in OpenACS
  - III.16.4 Design Notes
  - III.16.5 Translator's Guide
- III.D Using CVS with an OpenACS Site
IV For OpenACS Platform Developers
- IV.17 Kernel Documentation
  - IV.17.1 Overview
  - IV.17.2 Object Model Requirements
  - IV.17.3 Object Model Design
  - IV.17.4 Permissions Requirements
  - IV.17.5 Permissions Design
  - IV.17.6 Groups Requirements
  - IV.17.7 Groups Design
  - IV.17.8 Subsites Requirements
  - IV.17.9 Subsites Design Document
  - IV.17.10 Package Manager Requirements
  - IV.17.11 Package Manager Design
  - IV.17.12 Database Access API
  - IV.17.13 OpenACS Internationalization Requirements
  - IV.17.14 Security Requirements
  - IV.17.15 Security Design
  - IV.17.16 Security Notes
  - IV.17.17 Request Processor Requirements
  - IV.17.18 Request Processor Design
  - IV.17.19 Documenting Tcl Files: Page Contracts and Libraries
  - IV.17.20 Bootstrapping OpenACS
  - IV.17.21 External Authentication Requirements
- IV.18 Releasing OpenACS
  - IV.18.1 OpenACS Core and .LRN
  - IV.18.2 How to Update the OpenACS.org repository
  - IV.18.3 How to package and release an OpenACS Package
  - IV.18.4 How to Update the translations
V Tcl for Web Nerds
- V.1 Tcl for Web Nerds Introduction
- V.2 Basic String Operations
- V.3 List Operations
- V.4 Pattern matching
- V.5 Array Operations
- V.6 Numbers
- V.7 Control Structure
- V.8 Scope, Upvar and Uplevel
- V.9 File Operations
- V.10 Eval
- V.11 Exec
- V.12 Tcl for Web Use
- V.13 OpenACS conventions for TCL
- V.14 Solutions
VI SQL for Web Nerds
- VI.1 SQL Tutorial
  - VI.1.1 SQL Tutorial
  - VI.1.2 Answers
- VI.2 SQL for Web Nerds Introduction
- VI.3 Data modeling
  - VI.3.1 The Discussion Forum -- philg's personal odyssey
  - VI.3.2 Data Types (Oracle)
  - VI.3.4 Tables
  - VI.3.5 Constraints
- VI.4 Simple queries
- VI.5 More complex queries
- VI.6 Transactions
- VI.7 Triggers
- VI.8 Views
- VI.9 Style
- VI.10 Escaping to the procedural world
- VI.11 Trees

86.67%

· Index

V.4 Pattern matching

Pattern matching is important across a wide variety of Web programming tasks but most notably when looking for exceptions in user-entered data and when trying to parse information out of non-cooperating Web sites.

Tcl's pattern matching facilities test whether a given string matches a specified pattern. Patterns are described using a syntax known as regular expressions. For example, the pattern expression consisting of a single period matches any character. The pattern a..a matches any four-character string whose first and last characters are both a.

The regexp command takes a pattern, a string, and an optional match variable. It tests whether the string matches the pattern, returns 1 if there is a match and zero otherwise, and sets the match variable to the part of the string that matched the pattern:

% set something candelabra
candelabra

% regexp a..a $something match
1

% set match
abra

Patterns can also contain subpatterns (delimited by parentheses) and denote repetition. A star denotes zero or more occurrences of a pattern, so a(.*)a matches any string of at least two characters that begins and ends with the character a. Whatever has matched the subpattern between the a's will get put into the first subvariable:

% set something candelabra
candelabra

% regexp a(.*)a $something match
1

% set match
andelabra

Note that Tcl regexp by default behaves in a greedy fashion. There are three alternative substrings of "candelabra" that match the regexp a(.*)a: "andelabra", "andela", and "abra". Tcl chose the longest substring. This is very painful when trying to pull HTML pages apart:

% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1

% set italicized_phrase
et cetera

% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1

% set italicized_phrase
sui generis</i> and <i>ipso facto

What you want is a non-greedy regexp, a standard feature of Perl and an option in Tcl 8.1 and later versions (see http://www.scriptics.com/services/support/howto/regexp81.html).

Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.

Matching Cookies From the Browser

A common problem in Web development is pulling information out of cookies that come from the client. The cookie spec at http://home.netscape.com/newsref/std/cookie_spec.html mandates that multiple cookies be separated by semicolons. So you look for "the cookie name that you've been using" followed by an equals sign and them slurp up anything that follows that isn't a semicolon. Here is how the ArsDigita Community System looks for the value of the last_visit cookie:

regexp {last_visit=([^;]+)} $cookie match last_visit

Note the square brackets inside the regexp. The Tcl interpreter isn't trying to call a procedure because the entire regexp has been grouped with braces rather than double quotes. Square brackets denote a range of acceptable characters:

[A-Z] would match any uppercase character
[ABC] would match any of first three characters in the alphabet (uppercase only)
[^ABC] would match any character other than the first three uppercase characters in the alphabet, i.e., the ^ reverses the sense of the brackets

The plus sign after the [^;] says "one or more characters that meets the preceding spec", i.e., "one or more characters that isn't a semicolon". It is distinguished from * in that there must be at least one character for a match.

If successful, the regexp command above will set the match variable with the complete matching string, starting from "last_visit=". Our code doesn't make any use of this variable but only looks at the subvar last_visit that would also have been set.

Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:

regexp {last_visit=([0-9]+)} $cookie match last_visit

Matching Into Multiple Variables

More generally regexp allows multiple pattern variables. The pattern variables after the first are set to the substrings that matched the subpatterns. Here is an example of matching a credit card expiration date entered by a user:

% set date_typed_by_user "06/02"
06/02

% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1

% set month
06

% set year
02
%

Each pair of parentheses corresponds to a subpattern variable.

Full Syntax

The most general form of regexp includes optional flags as well as multiple match variables:

regexp [flags] pattern data matched_result var1 var2 ...

The various flags are

-nocase
uppercase characters in the data are bashed down to lower for case-insensitive matching (make sure that your pattern is all lowercase!)
-indices
the returned values of the regexp contain the indices delimiting the matched substring, rather than the strings themselves.
If your pattern begins with a -, put a -- flag at the end of your flags

Regular expression syntax is:

.
matches any character.
*
matches zero or more instances of the previous pattern item.
+
matches one or more instances of the previous pattern item.
?
matches zero or one instances of the previous pattern item.
|
disjunction, e.g., (a|b) matches an a or a b
( )
groups a sub-pattern.
[ ]
delimits a set of characters. ASCII Ranges are specified using hyphens, e.g., [A-z] matches any character from uppercase A through lowercase z (i.e., any alphabetic character). If the first character in the set is ^, this complements the set, e.g., [^A-z] matches any non-alphabetic character.
^
Matches only when the pattern appears at the beginning of the string. The ^ must appear at the beginning of the pattern expression.
$
Matches only when the pattern appears at the end of the string. The $ must appear last in the pattern expression.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regexp.htm

Matching with substitution

It's common in Web programming to create strings by substitution. Tcl's regsub command performs substitution based on a pattern:

regsub [flags] pattern data replacements var

matches the pattern against the data. If the match succeeds, the variable named var is set to data, with various parts modified, as specified by replacements. If the match fails, var is simply set to data. The value returned by regsub is the number of replacements performed.

The flag -all specifies that every occurrence of the pattern should be replaced. Otherwise only the first occurrence is replaced. Other flags include -nocase and -- as with regexp

Here's an example from the banner ideas module of the ArsDigita Community System (see http://photo.net/doc/bannerideas.html). The goal is that each banner idea contain a linked thumbnail image. To facilitate cutting and pasting of the image html, we don't require that the publisher include uniform subtags within the IMG. However, we use regexp to clean up:

# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html

In the example above, <replacements> specified the literal characters ''. Other replacement directives include:

& inserts the string that matched the pattern
The backslashed numbers \1 through \9 inserts the strings that matched the corresponding sub-patterns in the pattern.

Here's another web example, which parses HTML, and replaces the comments (delineated in HTML by ) by the comment text, enclosed in parentheses.

% proc extract_comment_text {html} {
    regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
    return $with_exposed_comments
}

% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)

More: http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm

String match

Tcl provides an alternative matching mechanism that is simpler for users to understand than regular expressions. The Tcl command string match uses "GLOB-style" matching. Here is the syntax:

string match pattern data

It returns 1 if there is a match and 0 otherwise. The only pattern elements permitted here are ?, which matches any single character; *, which matches any sequence; and [], which delimits a set of characters or a range. This differs from regexp in that the pattern must match the entire string supplied:

% regexp "foo" "foobar"
1

% string match "foo" "foobar"
0

% # here's what we need to do to make the string match 
% # work like the regexp
% string match "*foo*" foobar
1

Here's an example of the character range system in use:

string match {*[0-9]*} $text

returns 1 if text contains at least one digit and 0 otherwise.

More: http://www.tcl.tk/man/tcl8.4/TclCmd/string.htm

Exercises

Write a procedure which takes a string and makes sure that the result contains an "@" sign
Extend the procedure to make sure that only letters, numbers are allowed before the "@" sign
Extend the procedure to check that after the @ sign comes a valid domain (hint, look at 2.) A valid domain contains of at least one "." and only letters after the last ".". so malte.cognovis.de is a valid domain, cognovis.d1e is not.
Extend the procedure to return "Welcome foo, member of bar.com" if the string is "foo@bar.com"
Extend the procedure to return "Welcome OpenACS member foo" if the string is like "foo@openacs.org" meaning, the e-mail ends with openacs.org
Check against the valid domain again. This time make use of the ad_locales table installed in your local copy of OpenACS. To make this work you will have to use the OpenACS Shell.

Get a list of all countries from the table ad_locales. Choose the language column for this. The command to extract this is "db_list".
If your list contains the language "ca" more than once, make sure to limit it to one "ca" only. Make sure this works for others as well.
As ".com" ".org" and ".net" are also valid domain ending append them to the list.
Make sure that the domain ends on any language defined in your list you created. So automotive.ca works but automotive.eu does not (and yes, I know that .eu is now a valid domain :-)).

Answer

Search at amazon.com for your favorite book. Copy the URL until you see the "/ref..." part, e.g. http://www.amazon.com/4-Hour-Workweek-Escape-Live-Anywhere/dp/0307353133
In the OpenACS shell use "ad_httpget" to retrieve the URL you copied. Look at the api doc for the syntax.
Use regexp to find the price of the book in the html source returned to you by ad_httpget
Return the price of the book.

Answer

---

based on Tcl for Web Nerds

Categories: beginner (Audience)