Regular Expression

The regular expression is a common method of searching, matching and separation of strings. This library provides the function by RegularExpression class. The implementation is based on libonig.

libonig

The regular expression syntax is almost compatible with Perl and you can refer to many documentations about Perl regular expression.
Anyway, we'll provide the standard syntax reference here.

1. Syntax elements

  \       escape (enable or disable meta character meaning)
  |       alternation
  (...)   group
  [...]   character class  


2. Characters

  \t           horizontal tab (0x09)
  \v           vertical tab   (0x0B)
  \n           newline        (0x0A)
  \r           return         (0x0D)
  \b           back space     (0x08)
  \f           form feed      (0x0C)
  \a           bell           (0x07)
  \e           escape         (0x1B)
  \nnn         octal char            (encoded byte value)
  \xHH         hexadecimal char      (encoded byte value)
  \x{7HHHHHHH} wide hexadecimal char (character code point value)
  \cx          control char          (character code point value)
  \C-x         control char          (character code point value)
  \M-x         meta  (x|0x80)        (character code point value)
  \M-\C-x      meta control char     (character code point value)

 (* \b is effective in character class [...] only)


3. Character types

  .        any character (except newline)

  \w       word character

             General_Category -- (Letter|Mark|Number|Connector_Punctuation)

  \W       non word char

  \s       whitespace char

             0009, 000A, 000B, 000C, 000D, 0085(NEL), 
             General_Category -- Line_Separator
                              -- Paragraph_Separator
                              -- Space_Separator

  \S       non whitespace char

  \d       decimal digit char

           General_Category -- Decimal_Number

  \D       non decimal digit char

  \h       hexadecimal digit char   [0-9a-fA-F]

  \H       non hexadecimal digit char


4. Quantifier

  greedy

    ?       1 or 0 times
       0 or more times
    +       1 or more times
    {n,m}   at least n but not more than m times
    {n,}    at least n times
    {,n}    at least 0 but not more than n times ({0,n})
    {n}     n times

  reluctant

    ??      1 or 0 times
    *?      0 or more times
    +?      1 or more times
    {n,m}?  at least n but not more than m times  
    {n,}?   at least n times
    {,n}?   at least 0 but not more than n times (== {0,n}?)

  possessive (greedy and does not backtrack after repeated)

    ?+      1 or 0 times
    *+      0 or more times
    ++      1 or more times

    Note: {n,m}+, {n,}+, {n}+ are reluctant op.

    ex. /a*+/ === /(?>a*)/


5. Anchors

  ^       beginning of the line
  $       end of the line
  \b      word boundary
  \B      not word boundary
  \A      beginning of string
  \Z      end of string, or before newline at the end
  \z      end of string
  \G      matching start position (*)

 Ruby Regexp:
                 previous end-of-match position
                (This specification is not related to this library.)


6. Character class

  ^...    negative class (lowest precedence operator)
  x-y     range from x to y
  [...]   set (character class in character class)
  ..&&..  intersection (low precedence at the next of ^)
          
    ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]

 If you want to use '[', '-', ']' as a normal character
    in a character class, you should escape these characters by '\'.


  POSIX bracket ([:xxxxx:], negate [:^xxxxx:])

    alnum    Letter | Mark | Decimal_Number
    alpha    Letter | Mark
    ascii    0000 - 007F
    blank    Space_Separator | 0009
    cntrl    Control | Format | Unassigned | Private_Use | Surrogate
    digit    Decimal_Number
    graph    [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
    lower    Lowercase_Letter
    print    [[:graph:]] | [[:space:]]
    punct    Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
             Final_Punctuation | Initial_Punctuation | Other_Punctuation |
             Open_Punctuation
    space    Space_Separator | Line_Separator | Paragraph_Separator |
             0009 | 000A | 000B | 000C | 000D | 0085
    upper    Uppercase_Letter
    xdigit   0030 - 0039 | 0041 - 0046 | 0061 - 0066
             (0-9, a-f, A-F)


7. Extended groups

  (?#...)            comment

  (?imx-imx)         option on/off
                         i: ignore case
                         m: multi-line (dot(.) match newline)
                         x: extended form
  (?imx-imx:subexp)  option on/off for subexp

  (?:subexp)         not captured group
  (subexp)           captured group

  (?=subexp)         look-ahead
  (?!subexp)         negative look-ahead
  (?<=subexp)        look-behind
  (?<!subexp)        negative look-behind

                     Subexp of look-behind must be fixed character length.
                     But different character length is allowed in top level
                     alternatives only.
                     ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

                     In negative-look-behind, captured group isn't allowed, 
                     but shy group(?:) is allowed.

  (?>subexp)         atomic group
                     don't backtrack in subexp.

  (?<name>subexp)    define named group
                     (All characters of the name must be a word character.
                     And first character must not be a digit or uppper case)

                     Not only a name but a number is assigned like a captured
                     group.

                     Assigning the same name as two or more subexps is allowed.
                     In this case, a subexp call can not be performed although
                     the back reference is possible.


8. Back reference

  \n          back reference by group number (n >= 1)
  \k<name>    back reference by group name

  In the back reference by the multiplex definition name,
  a subexp with a large number is referred to preferentially.
  (When not matched, a group of the small number is referred to.)

 Back reference by group number is forbidden if named group is defined 
    in the pattern.

Resources on the Internet

There're so much documents about Regular Expression and we recommends the following sites for your purpose:

Anyway, please note that we will accepts any bug reports, including the bug of libonig, and please do not report the bugs of our library to them.


This document is automatically generated using doxygen 1.5.4 at Fri Jun 27 18:22:54 2008.