Regular Expression And its Usage in R

Regrex

Regular Expression

A regular expression is a special text string for describing a certain amount of text . This certain amount of text receives the formal name of pattern . It is a pattern that describes a set of strings.

We use four basic operations for creating regular expressions:

  • Concatenation
  • Logical OR
  • Replication
  • Grouping

Concatenation

The basic type of regular expression is formed by concatenating a set of characters together .We concatenate two characters “ab” and “cd” as “abcd”.

Logical OR

It is represented by | , allows us to choose from one of several possibilities. The regular expression “xy|ab” matches exactly two strings “xy” and “ab” . We can find many strings among a bunch of documents .

Repetition

The repetition enables us to define a pattern that matches under multiple possibilities . This operation is carried by using a series of regex operators , known as quantifier , that repeat the preceding regular expression a specified number of times.

Grouping

A grouping sequence is a parenthesized expression that is treated as a unit. If we want to specify the set of strings X, XYX , XYXYX and so forth, we write “(XY)*X” to indicate that the “XY” pattern must be replicated together.

To know about regular expressions :

help(regex)

It opens help documentation about regular expressions .

We have different types of regular expressions :

  • Metacharacters
  • Quantifiers
  • Sequences
  • Character classes
  • POSIX character classes

Metacharcters

The simplest form of regular expressions are those that match a single character. The pattern “1” matches the number 1 . The pattern “=” matches the equal symbol. There are some special characters that have a reserved status and they are known as metacharacters.  In Extended Regular Expressions(ERE) the metacharacters are :

                                          .   \   |   (   )   [   {   $   *   +   ?

sub()

It is used to replace pattern matching string with another string.

The syntax of sub() function is :

sub(pattern, replacement, x)

Replace the first occurrence of a pattern .The meaning of parameters in sub() function is :

pattern – A pattern to search for, which is assumed to be a regular expression .

replacement – A character string to replace the occurrence of pattern.

x – A character vector to search for pattern.

We create a string object as:

money = “$money”

We use “\\$” to find pattern “$” in given string money. We replace “$” with “”(empty string) in money object .

sub(pattern = “\\$”, replacement = “”, x = money)

We remove “.” in given string.

sub(“\\.”, “”, “Peace.Love”)

We replace “+” with empty string.

sub(“\\+”, “”, “Peace+Love”)

Sequences

It defines sequences of characters which can match .

The commonly used sequences are :

Anchor       Description

\\d            match a digit character

\\D            match a non-digit character

\\s             match a space character

\\S            match a non-space character

\\w           match a word character

\\W         match a non-word character

\\b          match a word boundary

\\B          match a non-(word boundary)

\\h          match a horizontal space

\\H         match a non-horizontal space

\\v         match a vertical space

\\V        match a non-vertical space

It is used to replace first occurrence of matching pattern. We want to replace first digit to “_” .

sub(“\\d”, “_”, “My first name is San and birth year is 1982”)

So , in given string 1982 is replaced with _982.

gsub()

It is used to replace all occurrences of a pattern. It have same syntax as sub() function .

It is used to replace all occurrence of matching pattern. We want to replace digit to “_” .

We are replacing first non-digit data with “_” .

sub(“\\D”, “_”, “My first name is San and birth year is 1982”)

We are replacing non-digit data with “_” .

gsub(“\\D”, “_”, “My first name is San and birth year is 1982”)

We are replacing first space character with “_”.

sub(“\\s”, “_”, “My first name is San and birth year is 1982”)

We are replacing space characters with “_” .

gsub(“\\s”, “_”, “My first name is San and birth year is 1982”)

We are replacing first non space character with “_”.

sub(“\\S”, “_”, “My first name is San and birth year is 1982”)

We are replacing non space characters with “_”.

gsub(“\\S”, “_”, “My first name is San and birth year is 1982”)

“\\b”  is used to match a word boundary and show by “_”.

sub(“\\b”, “_”, “My first name is San and birth year is 1982 and birth year is 1982”)

gsub(“\\b”, “_”, “My first name is San and birth year is 1982”)

We are using “\\w” to replace word character with “_”  .

sub(“\\w”, “_”, “My first name is San and birth year is 1982”)

gsub(“\\w”, “_”, “My first name is San and birth year is 1982”)

We using pattern matching “\\W” to replace non-word character with “_” .

sub(“\\W”, “_”, “First World War in 1915”)

gsub(“\\W”, “_”, “First world war in 1915”)

Character classes

A character class or character set is a list of characters enclosed by square brackets [] .

Some  Character Classes are:

Anchor                      Description

[aeiou]                    match any one lower case vowel

[AEIOU]                  match any one upper case vowel

[0123456789]       match any digit

[0-9]                       match any digit (same as previous class)

[a-z]                        match any lower case ASCII letter

[A-Z]                       match any upper case ASCII letter

[a-zA-Z0-9]            match any of the above classes

[^aeiou]                match anything other than a lowercase vowel

[^0-9]                    match anything other than a digit

grep()

It is used to find pattern matching strings in given vector.

The grep() function takes regex as first argument and the input vector as second argument. If you pass value = FALSE or does not assign any value to value parameter then grep() returns a new vector with the indexes of the elements in the input vector that could be (partially) matched by the regular expression. If you pass value=TRUE , then grep() returns a vector with actual elements in the input vector that could be (partially) matched.

We create a character vector as :

transport = c(“car”, “bike”, “plane”, “boat”)

We finds strings contains “e” , “i” or  both. It shows string values associated with pattern.

grep(pattern = “[ei]”, transport, value = TRUE)

numerics = c(“123”, “17-April”, “I-II-III”, “R 3.0.1”)

We match the string which contains “0” , “1” or both . It shows pattern matching string positions in character vector numerics.

grep(pattern = “[01]”, numerics)

It shows strings contains “0” to “9” values.

grep(pattern = “[0-9]”, numerics, value = TRUE)

It returns position of element in input vector contains values other than digits.

grep(pattern = “[^0-9]”, numerics, value = F)

POSIX character classes

Class                      Description

[[:lower:]]            Lower-case letters

[[:upper:]]            Upper-case letters

[[:alpha:]]            Alphabetic characters ([[:lower:]] and [[:upper:]])

[[:digit:]]              Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]]          Alphanumeric characters ([[:alpha:]] and [[:digit:]])

[[:blank:]]           Blank characters: space and tab

[[:cntrl:]]             Control characters

[[:punct:]]           Punctuation characters: ! ” # % & ‘ ( ) * + , – . / : ;

[[:space:]]           Space characters: tab, newline, vertical tab, form feed,carriage return, and space

[[:xdigit:]]           Hexadecimal digits: 0-9 A B C D E F a b c d e f

[[:print:]]            Printable characters ([[:alpha:]], [[:punct:]] and space)

[[:graph:]]          Graphical characters ([[:alpha:]] and [[:punct:]])

We create an object la_vie as:

la_vie = “La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie”

We want to concatenate string representations . It used to convert “\n” as next line and “\t” as tab operator.

cat(la_vie)

it is used to replace blanks with empty string.

gsub(pattern = “[[:blank:]]”, replacement = “”, la_vie)

It is used to replace punctuation characters with empty string.

gsub(pattern = “[[:punct:]]”, replacement = “”, la_vie)

It is used to remove Hexadecimal digits .

gsub(pattern = “[[:xdigit:]]”, replacement = “”, la_vie)

It is used to remove printable chracters.

gsub(pattern = “[[:print:]]”, replacement = “”, la_vie)

It is used to remove non-printable characters.

gsub(pattern = “[^[:print:]]”, replacement = “”, la_vie)

Quantifiers

These are used when we want to match a certain number of characters that meet certain criteria. Quantifiers specify how many instances of a character , group or character class must be present in the input to match to be found.

The list of quantifiers are :

Quantifier       Description

?                     The preceding item is optional and will be matched at most once

*                    The preceding item will be matched zero or more times

+                    The preceding item will be matched one or more times

{n}                 The preceding item is matched exactly n times

{n,}                The preceding item is matched n or more times

{n,m}            The preceding item is matched at least n times, but not more than m times

We create a string vector to store names.

people = c(“rori”, “emilia”, “mmatteo”, “mehmet”, “filipe”, “anna”, “tyler”,

           “rasmus”, “jacob”, “youna”, “flora”, “adi”)

It matches ‘m’ exactly one time .

grep(pattern = “m{1}”, people, value = TRUE)

It matches a pattern of ‘m’ coming before ‘t’ in strings . It also find strings contains ‘t’ only.

grep(pattern = “m?t”, people, value = TRUE)

It matches ‘m’ zero or more times and also finds ‘t’ in the input string.

grep(pattern = “m*t”, people, value = TRUE)

It matches ‘m’ one or more times .

grep(pattern = “m+”, people, value = TRUE)

It matches ‘m’ one or more times and also contains ‘t’ .

grep(pattern = “m+.t”, people, value = TRUE)

It matches ‘t’ exactly twice.

grep(pattern = “t{2}”, people, value = TRUE)

We create a character object as:

text = c(“one word”, “a sentence”, “you and me “,

         “three two one”)

regexpr()

It returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the first regex match was found.

It works as :

  • which elements of the character vector actually contains the regex pattern
  • identifies the position of the substring that is matched by the regular expression pattern.

regexpr(“one”, text)

The number 1 indicates that the pattern “one” starts at the position 1 of the first element in string. The number -1 represent there was no match found. The number 11 indicates the pattern “one” starts at position 11 in the fourth element of the string.

The attribute “match.length” gives us the length of the match in each element of string. The -1 means there was no match in that element. The ”useBytes” has a value TRUE which means that the matching was done byte-by-byte rather than character-by-character .

gregexpr()

It does same thing as regexpr() . It has output in the form of list . It shows output for each element separately.

gregexpr(“one”, text)

We create a character object as:

str <- c(“Regular”, “expression”, “examples of R language”)

We finds pattern “ex” in the given vector.

x <- grep(“ex”,str,value=T)

grepl()

It enables us to perform a similar task as grep() . It shows output of pattern matching as logical (TRUE/FALSE).

It shows whether the string have pattern “ex” or not.

TRUE represent the pattern “ex” presents in string .

FALSE represent the pattern “ex” does not present in string.

x <- grepl(“ex”,str)

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top