diff options
Diffstat (limited to 'newlib/libc/posix/regex.3')
-rw-r--r-- | newlib/libc/posix/regex.3 | 701 |
1 files changed, 701 insertions, 0 deletions
diff --git a/newlib/libc/posix/regex.3 b/newlib/libc/posix/regex.3 new file mode 100644 index 000000000..d87164177 --- /dev/null +++ b/newlib/libc/posix/regex.3 @@ -0,0 +1,701 @@ +.\" Copyright (c) 1992, 1993, 1994 Henry Spencer. +.\" Copyright (c) 1992, 1993, 1994 +.\" The Regents of the University of California. All rights reserved. +.\" +.\" This code is derived from software contributed to Berkeley by +.\" Henry Spencer. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" 3. All advertising materials mentioning features or use of this software +.\" must display the following acknowledgement: +.\" This product includes software developed by the University of +.\" California, Berkeley and its contributors. +.\" 4. Neither the name of the University nor the names of its contributors +.\" may be used to endorse or promote products derived from this software +.\" without specific prior written permission. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" @(#)regex.3 8.4 (Berkeley) 3/20/94 +.\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.9 2001/10/01 16:08:58 ru Exp $ +.\" +.Dd March 20, 1994 +.Dt REGEX 3 +.Os +.Sh NAME +.Nm regcomp , +.Nm regexec , +.Nm regerror , +.Nm regfree +.Nd regular-expression library +.Sh LIBRARY +.Lb libc +.Sh SYNOPSIS +.In sys/types.h +.In regex.h +.Ft int +.Fn regcomp "regex_t *preg" "const char *pattern" "int cflags" +.Ft int +.Fo regexec +.Fa "const regex_t *preg" "const char *string" +.Fa "size_t nmatch" "regmatch_t pmatch[]" "int eflags" +.Fc +.Ft size_t +.Fo regerror +.Fa "int errcode" "const regex_t *preg" +.Fa "char *errbuf" "size_t errbuf_size" +.Fc +.Ft void +.Fn regfree "regex_t *preg" +.Sh DESCRIPTION +These routines implement +.St -p1003.2 +regular expressions +.Pq Do RE Dc Ns s ; +see +.Xr re_format 7 . +.Fn Regcomp +compiles an RE written as a string into an internal form, +.Fn regexec +matches that internal form against a string and reports results, +.Fn regerror +transforms error codes from either into human-readable messages, +and +.Fn regfree +frees any dynamically-allocated storage used by the internal form +of an RE. +.Pp +The header +.Aq Pa regex.h +declares two structure types, +.Ft regex_t +and +.Ft regmatch_t , +the former for compiled internal forms and the latter for match reporting. +It also declares the four functions, +a type +.Ft regoff_t , +and a number of constants with names starting with +.Dq Dv REG_ . +.Pp +.Fn Regcomp +compiles the regular expression contained in the +.Fa pattern +string, +subject to the flags in +.Fa cflags , +and places the results in the +.Ft regex_t +structure pointed to by +.Fa preg . +.Fa Cflags +is the bitwise OR of zero or more of the following flags: +.Bl -tag -width REG_EXTENDED +.It Dv REG_EXTENDED +Compile modern +.Pq Dq extended +REs, +rather than the obsolete +.Pq Dq basic +REs that +are the default. +.It Dv REG_BASIC +This is a synonym for 0, +provided as a counterpart to +.Dv REG_EXTENDED +to improve readability. +.It Dv REG_NOSPEC +Compile with recognition of all special characters turned off. +All characters are thus considered ordinary, +so the +.Dq RE +is a literal string. +This is an extension, +compatible with but not specified by +.St -p1003.2 , +and should be used with +caution in software intended to be portable to other systems. +.Dv REG_EXTENDED +and +.Dv REG_NOSPEC +may not be used +in the same call to +.Fn regcomp . +.It Dv REG_ICASE +Compile for matching that ignores upper/lower case distinctions. +See +.Xr re_format 7 . +.It Dv REG_NOSUB +Compile for matching that need only report success or failure, +not what was matched. +.It Dv REG_NEWLINE +Compile for newline-sensitive matching. +By default, newline is a completely ordinary character with no special +meaning in either REs or strings. +With this flag, +.Ql [^ +bracket expressions and +.Ql .\& +never match newline, +a +.Ql ^\& +anchor matches the null string after any newline in the string +in addition to its normal function, +and the +.Ql $\& +anchor matches the null string before any newline in the +string in addition to its normal function. +.It Dv REG_PEND +The regular expression ends, +not at the first NUL, +but just before the character pointed to by the +.Va re_endp +member of the structure pointed to by +.Fa preg . +The +.Va re_endp +member is of type +.Ft "const char *" . +This flag permits inclusion of NULs in the RE; +they are considered ordinary characters. +This is an extension, +compatible with but not specified by +.St -p1003.2 , +and should be used with +caution in software intended to be portable to other systems. +.El +.Pp +When successful, +.Fn regcomp +returns 0 and fills in the structure pointed to by +.Fa preg . +One member of that structure +(other than +.Va re_endp ) +is publicized: +.Va re_nsub , +of type +.Ft size_t , +contains the number of parenthesized subexpressions within the RE +(except that the value of this member is undefined if the +.Dv REG_NOSUB +flag was used). +If +.Fn regcomp +fails, it returns a non-zero error code; +see +.Sx DIAGNOSTICS . +.Pp +.Fn Regexec +matches the compiled RE pointed to by +.Fa preg +against the +.Fa string , +subject to the flags in +.Fa eflags , +and reports results using +.Fa nmatch , +.Fa pmatch , +and the returned value. +The RE must have been compiled by a previous invocation of +.Fn regcomp . +The compiled form is not altered during execution of +.Fn regexec , +so a single compiled RE can be used simultaneously by multiple threads. +.Pp +By default, +the NUL-terminated string pointed to by +.Fa string +is considered to be the text of an entire line, minus any terminating +newline. +The +.Fa eflags +argument is the bitwise OR of zero or more of the following flags: +.Bl -tag -width REG_STARTEND +.It Dv REG_NOTBOL +The first character of +the string +is not the beginning of a line, so the +.Ql ^\& +anchor should not match before it. +This does not affect the behavior of newlines under +.Dv REG_NEWLINE . +.It Dv REG_NOTEOL +The NUL terminating +the string +does not end a line, so the +.Ql $\& +anchor should not match before it. +This does not affect the behavior of newlines under +.Dv REG_NEWLINE . +.It Dv REG_STARTEND +The string is considered to start at +.Fa string ++ +.Fa pmatch Ns [0]. Ns Va rm_so +and to have a terminating NUL located at +.Fa string ++ +.Fa pmatch Ns [0]. Ns Va rm_eo +(there need not actually be a NUL at that location), +regardless of the value of +.Fa nmatch . +See below for the definition of +.Fa pmatch +and +.Fa nmatch . +This is an extension, +compatible with but not specified by +.St -p1003.2 , +and should be used with +caution in software intended to be portable to other systems. +Note that a non-zero +.Va rm_so +does not imply +.Dv REG_NOTBOL ; +.Dv REG_STARTEND +affects only the location of the string, +not how it is matched. +.El +.Pp +See +.Xr re_format 7 +for a discussion of what is matched in situations where an RE or a +portion thereof could match any of several substrings of +.Fa string . +.Pp +Normally, +.Fn regexec +returns 0 for success and the non-zero code +.Dv REG_NOMATCH +for failure. +Other non-zero error codes may be returned in exceptional situations; +see +.Sx DIAGNOSTICS . +.Pp +If +.Dv REG_NOSUB +was specified in the compilation of the RE, +or if +.Fa nmatch +is 0, +.Fn regexec +ignores the +.Fa pmatch +argument (but see below for the case where +.Dv REG_STARTEND +is specified). +Otherwise, +.Fa pmatch +points to an array of +.Fa nmatch +structures of type +.Ft regmatch_t . +Such a structure has at least the members +.Va rm_so +and +.Va rm_eo , +both of type +.Ft regoff_t +(a signed arithmetic type at least as large as an +.Ft off_t +and a +.Ft ssize_t ) , +containing respectively the offset of the first character of a substring +and the offset of the first character after the end of the substring. +Offsets are measured from the beginning of the +.Fa string +argument given to +.Fn regexec . +An empty substring is denoted by equal offsets, +both indicating the character following the empty substring. +.Pp +The 0th member of the +.Fa pmatch +array is filled in to indicate what substring of +.Fa string +was matched by the entire RE. +Remaining members report what substring was matched by parenthesized +subexpressions within the RE; +member +.Va i +reports subexpression +.Va i , +with subexpressions counted (starting at 1) by the order of their opening +parentheses in the RE, left to right. +Unused entries in the array (corresponding either to subexpressions that +did not participate in the match at all, or to subexpressions that do not +exist in the RE (that is, +.Va i +> +.Fa preg Ns -> Ns Va re_nsub ) ) +have both +.Va rm_so +and +.Va rm_eo +set to -1. +If a subexpression participated in the match several times, +the reported substring is the last one it matched. +(Note, as an example in particular, that when the RE +.Ql "(b*)+" +matches +.Ql bbb , +the parenthesized subexpression matches each of the three +.So Li b Sc Ns s +and then +an infinite number of empty strings following the last +.Ql b , +so the reported substring is one of the empties.) +.Pp +If +.Dv REG_STARTEND +is specified, +.Fa pmatch +must point to at least one +.Ft regmatch_t +(even if +.Fa nmatch +is 0 or +.Dv REG_NOSUB +was specified), +to hold the input offsets for +.Dv REG_STARTEND . +Use for output is still entirely controlled by +.Fa nmatch ; +if +.Fa nmatch +is 0 or +.Dv REG_NOSUB +was specified, +the value of +.Fa pmatch Ns [0] +will not be changed by a successful +.Fn regexec . +.Pp +.Fn Regerror +maps a non-zero +.Fa errcode +from either +.Fn regcomp +or +.Fn regexec +to a human-readable, printable message. +If +.Fa preg +is +.No non\- Ns Dv NULL , +the error code should have arisen from use of +the +.Ft regex_t +pointed to by +.Fa preg , +and if the error code came from +.Fn regcomp , +it should have been the result from the most recent +.Fn regcomp +using that +.Ft regex_t . +.No ( Fn Regerror +may be able to supply a more detailed message using information +from the +.Ft regex_t . ) +.Fn Regerror +places the NUL-terminated message into the buffer pointed to by +.Fa errbuf , +limiting the length (including the NUL) to at most +.Fa errbuf_size +bytes. +If the whole message won't fit, +as much of it as will fit before the terminating NUL is supplied. +In any case, +the returned value is the size of buffer needed to hold the whole +message (including terminating NUL). +If +.Fa errbuf_size +is 0, +.Fa errbuf +is ignored but the return value is still correct. +.Pp +If the +.Fa errcode +given to +.Fn regerror +is first ORed with +.Dv REG_ITOA , +the +.Dq message +that results is the printable name of the error code, +e.g.\& +.Dq Dv REG_NOMATCH , +rather than an explanation thereof. +If +.Fa errcode +is +.Dv REG_ATOI , +then +.Fa preg +shall be +.No non\- Ns Dv NULL +and the +.Va re_endp +member of the structure it points to +must point to the printable name of an error code; +in this case, the result in +.Fa errbuf +is the decimal digits of +the numeric value of the error code +(0 if the name is not recognized). +.Dv REG_ITOA +and +.Dv REG_ATOI +are intended primarily as debugging facilities; +they are extensions, +compatible with but not specified by +.St -p1003.2 , +and should be used with +caution in software intended to be portable to other systems. +Be warned also that they are considered experimental and changes are possible. +.Pp +.Fn Regfree +frees any dynamically-allocated storage associated with the compiled RE +pointed to by +.Fa preg . +The remaining +.Ft regex_t +is no longer a valid compiled RE +and the effect of supplying it to +.Fn regexec +or +.Fn regerror +is undefined. +.Pp +None of these functions references global variables except for tables +of constants; +all are safe for use from multiple threads if the arguments are safe. +.Sh IMPLEMENTATION CHOICES +There are a number of decisions that +.St -p1003.2 +leaves up to the implementor, +either by explicitly saying +.Dq undefined +or by virtue of them being +forbidden by the RE grammar. +This implementation treats them as follows. +.Pp +See +.Xr re_format 7 +for a discussion of the definition of case-independent matching. +.Pp +There is no particular limit on the length of REs, +except insofar as memory is limited. +Memory usage is approximately linear in RE size, and largely insensitive +to RE complexity, except for bounded repetitions. +See +.Sx BUGS +for one short RE using them +that will run almost any system out of memory. +.Pp +A backslashed character other than one specifically given a magic meaning +by +.St -p1003.2 +(such magic meanings occur only in obsolete +.Bq Dq basic +REs) +is taken as an ordinary character. +.Pp +Any unmatched +.Ql [\& +is a +.Dv REG_EBRACK +error. +.Pp +Equivalence classes cannot begin or end bracket-expression ranges. +The endpoint of one range cannot begin another. +.Pp +.Dv RE_DUP_MAX , +the limit on repetition counts in bounded repetitions, is 255. +.Pp +A repetition operator +.Ql ( ?\& , +.Ql *\& , +.Ql +\& , +or bounds) +cannot follow another +repetition operator. +A repetition operator cannot begin an expression or subexpression +or follow +.Ql ^\& +or +.Ql |\& . +.Pp +.Ql |\& +cannot appear first or last in a (sub)expression or after another +.Ql |\& , +i.e. an operand of +.Ql |\& +cannot be an empty subexpression. +An empty parenthesized subexpression, +.Ql "()" , +is legal and matches an +empty (sub)string. +An empty string is not a legal RE. +.Pp +A +.Ql {\& +followed by a digit is considered the beginning of bounds for a +bounded repetition, which must then follow the syntax for bounds. +A +.Ql {\& +.Em not +followed by a digit is considered an ordinary character. +.Pp +.Ql ^\& +and +.Ql $\& +beginning and ending subexpressions in obsolete +.Pq Dq basic +REs are anchors, not ordinary characters. +.Sh SEE ALSO +.Xr grep 1 , +.Xr re_format 7 +.Pp +.St -p1003.2 , +sections 2.8 (Regular Expression Notation) +and +B.5 (C Binding for Regular Expression Matching). +.Sh DIAGNOSTICS +Non-zero error codes from +.Fn regcomp +and +.Fn regexec +include the following: +.Pp +.Bl -tag -width REG_ECOLLATE -compact +.It Dv REG_NOMATCH +.Fn regexec +failed to match +.It Dv REG_BADPAT +invalid regular expression +.It Dv REG_ECOLLATE +invalid collating element +.It Dv REG_ECTYPE +invalid character class +.It Dv REG_EESCAPE +.Ql \e +applied to unescapable character +.It Dv REG_ESUBREG +invalid backreference number +.It Dv REG_EBRACK +brackets +.Ql "[ ]" +not balanced +.It Dv REG_EPAREN +parentheses +.Ql "( )" +not balanced +.It Dv REG_EBRACE +braces +.Ql "{ }" +not balanced +.It Dv REG_BADBR +invalid repetition count(s) in +.Ql "{ }" +.It Dv REG_ERANGE +invalid character range in +.Ql "[ ]" +.It Dv REG_ESPACE +ran out of memory +.It Dv REG_BADRPT +.Ql ?\& , +.Ql *\& , +or +.Ql +\& +operand invalid +.It Dv REG_EMPTY +empty (sub)expression +.It Dv REG_ASSERT +can't happen - you found a bug +.It Dv REG_INVARG +invalid argument, e.g. negative-length string +.El +.Sh HISTORY +Originally written by +.An Henry Spencer . +Altered for inclusion in the +.Bx 4.4 +distribution. +.Sh BUGS +This is an alpha release with known defects. +Please report problems. +.Pp +The back-reference code is subtle and doubts linger about its correctness +in complex cases. +.Pp +.Fn Regexec +performance is poor. +This will improve with later releases. +.Fa Nmatch +exceeding 0 is expensive; +.Fa nmatch +exceeding 1 is worse. +.Fn Regexec +is largely insensitive to RE complexity +.Em except +that back +references are massively expensive. +RE length does matter; in particular, there is a strong speed bonus +for keeping RE length under about 30 characters, +with most special characters counting roughly double. +.Pp +.Fn Regcomp +implements bounded repetitions by macro expansion, +which is costly in time and space if counts are large +or bounded repetitions are nested. +An RE like, say, +.Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}" +will (eventually) run almost any existing machine out of swap space. +.Pp +There are suspected problems with response to obscure error conditions. +Notably, +certain kinds of internal overflow, +produced only by truly enormous REs or by multiply nested bounded repetitions, +are probably not handled well. +.Pp +Due to a mistake in +.St -p1003.2 , +things like +.Ql "a)b" +are legal REs because +.Ql )\& +is +a special character only in the presence of a previous unmatched +.Ql (\& . +This can't be fixed until the spec is fixed. +.Pp +The standard's definition of back references is vague. +For example, does +.Ql "a\e(\e(b\e)*\e2\e)*d" +match +.Ql "abbbd" ? +Until the standard is clarified, +behavior in such cases should not be relied on. +.Pp +The implementation of word-boundary matching is a bit of a kludge, +and bugs may lurk in combinations of word-boundary matching and anchoring. |