Title: | An R Interface to the Onigmo Regular Expression Library |
---|---|
Description: | Provides an alternative to R's built-in functionality for handling regular expressions, based on the Onigmo library. Offers first-class compiled regex objects, partial matching and function-based substitutions, amongst other features. |
Authors: | Jon Clayden, based on Onigmo by K. Kosako and K. Takata |
Maintainer: | Jon Clayden <[email protected]> |
License: | BSD_3_clause + file LICENCE |
Version: | 1.7.4 |
Built: | 2025-01-30 06:07:55 UTC |
Source: | https://github.com/jonclayden/ore |
Evaluate R expressions and substitute their values into one or more strings.
es(text, round = NULL, signif = NULL, envir = parent.frame())
es(text, round = NULL, signif = NULL, envir = parent.frame())
text |
A vector of strings to substitute into. |
round |
|
signif |
|
envir |
The environment to evaluate expressions in. |
Each part of the string surrounded by "#{}"
is extracted, evaluated
as R code in the specified environment, and then its value is substituted
back into the string. The literal string "#{}"
can be obtained by
escaping the hash character, viz. "\\#{}"
. The block may contain
multiple R expressions, separated by semicolons, but may not contain
additional braces. Its value will be coerced to character mode, and if the
result has multiple elements then the source string will be duplicated.
The final strings, with expression values substituted into them.
es("pi is #{pi}") es("pi is \\#{pi}") es("The square-root of pi is approximately #{sqrt(pi)}", signif=4) es("1/(1+x) for x=3 is #{x <- 3; 1/(1+x)}")
es("pi is #{pi}") es("pi is \\#{pi}") es("The square-root of pi is approximately #{sqrt(pi)}", signif=4) es("1/(1+x) for x=3 is #{x <- 3; 1/(1+x)}")
This dataset contains translations into many languages of the esoteric sentence "I can eat glass and it doesn't hurt me", UTF-8 encoded. Since this dataset uses characters from a range of scripts, it provides a useful test set for text handling and character encodings.
glass
glass
A named character vector, whose elements are translations of the sentence, and are named for the appropriate language in each case.
The translations were gathered by Frank da Cruz and written by a large group of contributors. Notes, commentary and a full list of credits are online at https://kermitproject.org/utf8.html.
These functions extract entire matches, or just subgroup matches, from
objects of class "orematch"
. They can also be applied to lists of
these objects, as returned by ore_search
when more than one
string is searched. For other objects they return NA
.
matches(object, ...) ## S3 method for class 'orematches' matches(object, simplify = TRUE, ...) ## S3 method for class 'orematch' matches(object, ...) ## Default S3 method: matches(object, ...) groups(object, ...) ## S3 method for class 'orematches' groups(object, simplify = TRUE, ...) ## S3 method for class 'orematch' groups(object, ...) ## S3 method for class 'orearg' groups(object, ...) ## Default S3 method: groups(object, ...)
matches(object, ...) ## S3 method for class 'orematches' matches(object, simplify = TRUE, ...) ## S3 method for class 'orematch' matches(object, ...) ## Default S3 method: matches(object, ...) groups(object, ...) ## S3 method for class 'orematches' groups(object, simplify = TRUE, ...) ## S3 method for class 'orematch' groups(object, ...) ## S3 method for class 'orearg' groups(object, ...) ## Default S3 method: groups(object, ...)
object |
An R object. Methods are provided for generic lists and
|
... |
Further arguments to methods. |
simplify |
For the list methods, should nonmatching elements be removed from the result? |
A vector, matrix, array, or list of the same, containing full
matches or subgroups. If simplify
is TRUE
, the result may
have a dropped
attribute, giving the indices of nonmatching
elements.
Create, test for, and print objects of class "ore"
, which represent
Oniguruma regular expressions. These are unit-length character vectors with
additional attributes, including a pointer to the compiled version.
ore(..., options = "", encoding = getOption("ore.encoding"), syntax = c("ruby", "fixed")) is_ore(x) ## S3 method for class 'ore' print(x, ...)
ore(..., options = "", encoding = getOption("ore.encoding"), syntax = c("ruby", "fixed")) is_ore(x) ## S3 method for class 'ore' print(x, ...)
... |
One or more strings or dictionary labels, constituting a valid
regular expression after being concatenated together. Elements drawn from
the dictionary will be surrounded by parentheses, turning them into
groups. Note that backslashes should be doubled, to avoid them being
interpreted as character escapes by R. The |
options |
A string composed of characters indicating variations on the
usual interpretation of the regex. These may currently include |
encoding |
A string specifying the encoding that matching will take
place in. The default is given by the |
syntax |
The regular expression syntax being used. The default is
|
x |
An R object. |
The ore
function returns the final pattern, with class
"ore"
and the following attributes:
A low-level pointer to the compiled version of the regular expression.
Options, copied from the argument of the same name.
The specified or detected encoding.
The specified syntax type.
The number of groups in the pattern.
Group names, if applicable.
The is_ore
function returns a logical vector indicating whether
its argument represents an "ore"
object.
For full details of supported syntax, please see
https://raw.githubusercontent.com/k-takata/Onigmo/master/doc/RE. The
regex
page is also useful as a quick reference, since
PCRE (used by base R) and Oniguruma (used by ore
) have similar
features. See ore_dict
for details of the pattern dictionary.
# This matches a positive or negative integer ore("-?\\d+") # This matches words of exactly four characters ore("\\b\\w{4}\\b")
# This matches a positive or negative integer ore("-?\\d+") # This matches words of exactly four characters ore("\\b\\w{4}\\b")
This function allows the user to get or set entries in the pattern
dictionary, a library of regular expressions whose elements can be referred
to by name in ore
, and therefore easily reused.
ore_dict(..., enclos = parent.frame())
ore_dict(..., enclos = parent.frame())
... |
One or more strings or dictionary keys. Unnamed, literal strings will be returned unmodified, named strings will be added to the dictionary, and unquoted names will be resolved using the dictionary. |
enclos |
Enclosure for resolving names not present in the dictionary.
Passed to |
If no arguments are provided, the whole dictionary is returned. Otherwise the return value is a (possibly named) character vector of resolved strings.
ore
, which passes its arguments through this function
# Literal strings are returned as-is ore_dict("protocol") # Named arguments are added to the dictionary ore_dict(protocol="\\w+://") # ... and can be retrieved by name ore_dict(protocol)
# Literal strings are returned as-is ore_dict("protocol") # Named arguments are added to the dictionary ore_dict(protocol="\\w+://") # ... and can be retrieved by name ore_dict(protocol)
Escape characters that would usually be interpreted specially in a regular expression, returning a modified version of the argument. This can be useful when incorporating a general-purpose string into a larger regex.
ore_escape(text)
ore_escape(text)
text |
A character vector. |
A modified version of the argument, with special characters escaped by prefixing them with a backslash.
Identify a file path to be used as a text source for a subsequent call to
ore_search
.
ore_file(path, encoding = getOption("ore.encoding"), binary = FALSE)
ore_file(path, encoding = getOption("ore.encoding"), binary = FALSE)
path |
A character string giving the file path. |
encoding |
A character string giving the encoding of the file. This
should match the encoding of the regular expression used in a call to
|
binary |
A logical value: if |
A string of class "orefile"
, with the encoding
and
binary
arguments stored as attributes.
ore_search
for actually searching through the file.
These functions test whether the elements of a character vector match a
Oniguruma regular expression. The actual match can be retrieved using
ore_lastmatch
.
ore_ismatch(regex, text, keepNA = getOption("ore.keepNA", FALSE), ...) X %~% Y X %~~% Y X %~|% Y
ore_ismatch(regex, text, keepNA = getOption("ore.keepNA", FALSE), ...) X %~% Y X %~~% Y X %~|% Y
regex |
A single character string or object of class |
text |
A character vector of strings to search. |
keepNA |
If |
... |
Further arguments to |
X |
A character vector or |
Y |
A character vector. See Details. |
The %~%
infix shorthand corresponds to ore_ismatch(...,
all=FALSE)
, while %~~%
corresponds to ore_ismatch(...,
all=TRUE)
. Either way, the first argument can be an "ore"
object,
in which case the second is the text to search, or a character vector, in
which case the second argument is assumed to contain the regex. The
%~|%
shorthand returns just those elements of the text vector which
match the regular expression.
A logical vector, indicating whether elements of text
match
regex
, or not.
# Test for the presence of a vowel ore_ismatch("[aeiou]", c("sky","lake")) # => c(FALSE,TRUE) # The same thing, in shorter form c("sky","lake") %~% "[aeiou]" # Same again: the first argument must be an "ore" object this way around ore("[aeiou]") %~% c("sky","lake")
# Test for the presence of a vowel ore_ismatch("[aeiou]", c("sky","lake")) # => c(FALSE,TRUE) # The same thing, in shorter form c("sky","lake") %~% "[aeiou]" # Same again: the first argument must be an "ore" object this way around ore("[aeiou]") %~% c("sky","lake")
This function can be used to obtain the "orematch"
object, or list,
corresponding to the last call to ore_search
. This can be
useful after performing a search implicitly, for example with %~%
.
ore_lastmatch(simplify = TRUE)
ore_lastmatch(simplify = TRUE)
simplify |
If |
An "orematch"
object or list. See ore_search
for details.
Search a character vector, or the content of a file or connection, for one
or more matches to an Oniguruma-compatible regular expression. Printing and
indexing methods are available for the results. ore_match
is an alias
for ore_search
.
ore_search(regex, text, all = FALSE, start = 1L, simplify = TRUE, incremental = !all) is_orematch(x) ## S3 method for class 'orematch' x[j, k, ...] ## S3 method for class 'orematches' x[i, j, k, ...] ## S3 method for class 'orematch' print(x, lines = getOption("ore.lines", 0L), context = getOption("ore.context", 30L), width = getOption("width", 80L), ...) ## S3 method for class 'orematches' print(x, lines = getOption("ore.lines", 0L), simplify = TRUE, ...)
ore_search(regex, text, all = FALSE, start = 1L, simplify = TRUE, incremental = !all) is_orematch(x) ## S3 method for class 'orematch' x[j, k, ...] ## S3 method for class 'orematches' x[i, j, k, ...] ## S3 method for class 'orematch' print(x, lines = getOption("ore.lines", 0L), context = getOption("ore.context", 30L), width = getOption("width", 80L), ...) ## S3 method for class 'orematches' print(x, lines = getOption("ore.lines", 0L), simplify = TRUE, ...)
regex |
A single character string or object of class |
text |
A vector of strings to match against, or a connection, or the
result of a call to |
all |
If |
start |
An optional vector of offsets (in characters) at which to start
searching. Will be recycled to the length of |
simplify |
If |
incremental |
If |
x |
An R object. |
j |
For indexing, the match number. |
k |
For indexing, the group number. |
... |
For |
i |
For indexing into an |
lines |
The maximum number of lines to print. The default is zero,
meaning no limit. For |
context |
The number of characters of context to include either side of each match. |
width |
The number of characters in each line of printed output. |
For ore_search
, an "orematch"
object, or a list of
the same, each with elements
A copy of the text
element for the current match, if
it was a character vector; otherwise a single string with the content
retrieved from the file or connection. If the source was a binary file
(from ore_file(..., binary=TRUE)
) then this element will be
NULL
.
The number of matches found.
The offsets (in characters) of each match.
The offsets (in bytes) of each match.
The lengths (in characters) of each match.
The lengths (in bytes) of each match.
The matched substrings.
Equivalent metadata for each parenthesised subgroup in
regex
, in a series of matrices. If named groups are present in
the regex then dimnames
will be set appropriately.
For is_orematch
, a logical vector indicating whether the specified
object has class "orematch"
. For extraction with one index, a
vector of matched substrings. For extraction with two indices, a vector
or matrix of substrings corresponding to captured groups.
Only named *or* unnamed groups will currently be captured, not both. If there are named groups in the pattern, then unnamed groups will be ignored.
By default the print
method uses the crayon
package (if it is
available) to determine whether or not the R terminal supports colour.
Alternatively, colour printing may be forced or disabled by setting the
"ore.colour"
(or "ore.color"
) option to a logical value.
ore
for creating regex objects; matches
and groups
for an alternative to indexing for extracting
matching substrings.
# Pick out pairs of consecutive word characters match <- ore_search("(\\w)(\\w)", "This is a test", all=TRUE) # Find the second matched substring ("is", from "This") match[2] # Find the content of the second group in the second match ("s") match[2,2]
# Pick out pairs of consecutive word characters match <- ore_search("(\\w)(\\w)", "This is a test", all=TRUE) # Find the second matched substring ("is", from "This") match[2] # Find the content of the second group in the second match ("s") match[2,2]
This function breaks up the strings provided at regions matching a regular
expression, removing those regions from the result. It is analogous to the
strsplit
function in base R.
ore_split(regex, text, start = 1L, simplify = TRUE)
ore_split(regex, text, start = 1L, simplify = TRUE)
regex |
A single character string or object of class |
text |
A vector of strings to match against. |
start |
An optional vector of offsets (in characters) at which to start
searching. Will be recycled to the length of |
simplify |
If |
A character vector or list of substrings.
ore_split("-?\\d+", "I have 2 dogs, 3 cats and 4 hamsters")
ore_split("-?\\d+", "I have 2 dogs, 3 cats and 4 hamsters")
These functions substitute new text into strings in regions that match a regular expression. The substitutions may be simple text, may include references to matched subgroups, or may be created by an R function.
ore_subst(regex, replacement, text, ..., all = FALSE, start = 1L) ore_repl(regex, replacement, text, ..., all = FALSE, start = 1L, simplify = TRUE)
ore_subst(regex, replacement, text, ..., all = FALSE, start = 1L) ore_repl(regex, replacement, text, ..., all = FALSE, start = 1L, simplify = TRUE)
regex |
A single character string or object of class |
replacement |
A character vector, or a function to be applied to the matches. |
text |
A vector of strings to match against. |
... |
Further arguments to |
all |
If |
start |
An optional vector of offsets (in characters) at which to start
searching. Will be recycled to the length of |
simplify |
For |
These functions differ in how they are vectorised. ore_subst
vectorises over matches, and returns a vector of the same length as the
text
argument. If multiple replacements are given then they are
applied to matches in turn. ore_repl
vectorises over replacements,
replicating the elements of text
as needed, and (in general)
returns a list the same length as text
, whose elements are character
vectors each of the same length as replacement
(or its return value,
if a function). Each string combines the first replacement for each match,
the second, and so on.
If replacement
is a character vector, its component strings may
include back-references to captured substrings. "\\0"
corresponds
to the whole matching substring, "\\1"
is the first captured
group, and so on. Named groups may be referenced as "\\k<name>"
.
If replacement
is a function, then it will be passed as its first
argument an object of class "orearg"
. This is a character vector
containing as its elements the matched substrings, and with an attribute
containing the matches for parenthesised subgroups, if there are any. A
groups
method is available for this class, so the groups
attribute can be easily obtained that way. The substitution function will be
called once per element of text
by ore_subst
, and once per
match by ore_repl
.
Versions of text
with the substitutions made.
# Simple text substitution (produces "no dogs") ore_subst("\\d+", "no", "2 dogs") # Back-referenced substitution (produces "22 dogs") ore_subst("(\\d+)", "\\1\\1", "2 dogs") # Function-based substitution (produces "4 dogs") ore_subst("\\d+", function(i) as.numeric(i)^2, "2 dogs")
# Simple text substitution (produces "no dogs") ore_subst("\\d+", "no", "2 dogs") # Back-referenced substitution (produces "22 dogs") ore_subst("(\\d+)", "\\1\\1", "2 dogs") # Function-based substitution (produces "4 dogs") ore_subst("\\d+", function(i) as.numeric(i)^2, "2 dogs")
This function maps one character vector to another, based on sequential matching to a series of regular expressions. The return value corresponding to each element in the source text is chosen based on the first matching regex: once matched, later options are ignored.
ore_switch(text, ..., options = "", encoding = getOption("ore.encoding"))
ore_switch(text, ..., options = "", encoding = getOption("ore.encoding"))
text |
A vector of strings to match against. |
... |
One or more string arguments specifying a possible return value.
These are generally named with a regex, and the string is only used for a
given |
options |
A string composed of characters indicating variations on the
usual interpretation of the regex. These may currently include |
encoding |
A string specifying the encoding that matching will take
place in. The default is given by the |
A character vector of the same length as text
, containing the
multiplexed strings. If none of the regexes matched, the corresponding
element will be NA
.
ore_subst
for details of back-reference syntax.
# Extract digits where present; otherwise return zero ore_switch(c("2 dogs","no dogs"), "\\d+"="\\0", "0")
# Extract digits where present; otherwise return zero ore_switch(c("2 dogs","no dogs"), "\\d+"="\\0", "0")