LOADING

Type to search

Regular Expression And its Usage in R

To Know more about the Different Corporate Training & Consulting Visit our website www.Instrovate.com Or Email : info@instrovate.com or WhatsApp / Call at +91 74289 52788

R Programming

Regular Expression And its Usage in R

Share

Regrex

Regular Expression

A regular expression is a special text string for describing a certain amount of text . This certain amount of text receives the formal name of pattern . It is a pattern that describes a set of strings.

We use four basic operations for creating regular expressions:

  • Concatenation
  • Logical OR
  • Replication
  • Grouping

Concatenation

The basic type of regular expression is formed by concatenating a set of characters together .We concatenate two characters “ab” and “cd” as “abcd”.

Logical OR

It is represented by | , allows us to choose from one of several possibilities. The regular expression “xy|ab” matches exactly two strings “xy” and “ab” . We can find many strings among a bunch of documents .

Repetition

The repetition enables us to define a pattern that matches under multiple possibilities . This operation is carried by using a series of regex operators , known as quantifier , that repeat the preceding regular expression a specified number of times.

Grouping

A grouping sequence is a parenthesized expression that is treated as a unit. If we want to specify the set of strings X, XYX , XYXYX and so forth, we write “(XY)*X” to indicate that the “XY” pattern must be replicated together.

To know about regular expressions :

help(regex)

It opens help documentation about regular expressions .

We have different types of regular expressions :

  • Metacharacters
  • Quantifiers
  • Sequences
  • Character classes
  • POSIX character classes

Metacharcters

The simplest form of regular expressions are those that match a single character. The pattern “1” matches the number 1 . The pattern “=” matches the equal symbol. There are some special characters that have a reserved status and they are known as metacharacters.  In Extended Regular Expressions(ERE) the metacharacters are :

                                          .   \   |   (   )   [   {   $   *   +   ?

sub()

It is used to replace pattern matching string with another string.

The syntax of sub() function is :

sub(pattern, replacement, x)

Replace the first occurrence of a pattern .The meaning of parameters in sub() function is :

pattern – A pattern to search for, which is assumed to be a regular expression .

replacement – A character string to replace the occurrence of pattern.

x – A character vector to search for pattern.

We create a string object as:

money = “$money”

We use “\\$” to find pattern “$” in given string money. We replace “$” with “”(empty string) in money object .

sub(pattern = “\\$”, replacement = “”, x = money)

Regular Expression And its Usage in R 29

We remove “.” in given string.

sub(“\\.”, “”, “Peace.Love”)

Regular Expression And its Usage in R 30

We replace “+” with empty string.

sub(“\\+”, “”, “Peace+Love”)

Regular Expression And its Usage in R 31

Sequences

It defines sequences of characters which can match .

The commonly used sequences are :

Anchor       Description

\\d            match a digit character

\\D            match a non-digit character

\\s             match a space character

\\S            match a non-space character

\\w           match a word character

\\W         match a non-word character

\\b          match a word boundary

\\B          match a non-(word boundary)

\\h          match a horizontal space

\\H         match a non-horizontal space

\\v         match a vertical space

\\V        match a non-vertical space

It is used to replace first occurrence of matching pattern. We want to replace first digit to “_” .

sub(“\\d”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 32

So , in given string 1982 is replaced with _982.

gsub()

It is used to replace all occurrences of a pattern. It have same syntax as sub() function .

It is used to replace all occurrence of matching pattern. We want to replace digit to “_” .

Regular Expression And its Usage in R 33

We are replacing first non-digit data with “_” .

sub(“\\D”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 34

We are replacing non-digit data with “_” .

gsub(“\\D”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 35

We are replacing first space character with “_”.

sub(“\\s”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 36

We are replacing space characters with “_” .

gsub(“\\s”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 37

We are replacing first non space character with “_”.

sub(“\\S”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 38

We are replacing non space characters with “_”.

gsub(“\\S”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 39

“\\b”  is used to match a word boundary and show by “_”.

sub(“\\b”, “_”, “My first name is San and birth year is 1982 and birth year is 1982”)

Regular Expression And its Usage in R 40

gsub(“\\b”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 41

We are using “\\w” to replace word character with “_”  .

sub(“\\w”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 42

gsub(“\\w”, “_”, “My first name is San and birth year is 1982”)

Regular Expression And its Usage in R 43

We using pattern matching “\\W” to replace non-word character with “_” .

sub(“\\W”, “_”, “First World War in 1915”)

Regular Expression And its Usage in R 44

gsub(“\\W”, “_”, “First world war in 1915”)

Regular Expression And its Usage in R 45

Character classes

A character class or character set is a list of characters enclosed by square brackets [] .

Some  Character Classes are:

Anchor                      Description

[aeiou]                    match any one lower case vowel

[AEIOU]                  match any one upper case vowel

[0123456789]       match any digit

[0-9]                       match any digit (same as previous class)

[a-z]                        match any lower case ASCII letter

[A-Z]                       match any upper case ASCII letter

[a-zA-Z0-9]            match any of the above classes

[^aeiou]                match anything other than a lowercase vowel

[^0-9]                    match anything other than a digit

grep()

It is used to find pattern matching strings in given vector.

The grep() function takes regex as first argument and the input vector as second argument. If you pass value = FALSE or does not assign any value to value parameter then grep() returns a new vector with the indexes of the elements in the input vector that could be (partially) matched by the regular expression. If you pass value=TRUE , then grep() returns a vector with actual elements in the input vector that could be (partially) matched.

We create a character vector as :

transport = c(“car”, “bike”, “plane”, “boat”)

Regular Expression And its Usage in R 46

We finds strings contains “e” , “i” or  both. It shows string values associated with pattern.

grep(pattern = “[ei]”, transport, value = TRUE)

Regular Expression And its Usage in R 47
Regular Expression And its Usage in R 48

numerics = c(“123”, “17-April”, “I-II-III”, “R 3.0.1”)

We match the string which contains “0” , “1” or both . It shows pattern matching string positions in character vector numerics.

grep(pattern = “[01]”, numerics)

Regular Expression And its Usage in R 49

It shows strings contains “0” to “9” values.

grep(pattern = “[0-9]”, numerics, value = TRUE)

Regular Expression And its Usage in R 50

It returns position of element in input vector contains values other than digits.

grep(pattern = “[^0-9]”, numerics, value = F)

Regular Expression And its Usage in R 51

POSIX character classes

Class                      Description

[[:lower:]]            Lower-case letters

[[:upper:]]            Upper-case letters

[[:alpha:]]            Alphabetic characters ([[:lower:]] and [[:upper:]])

[[:digit:]]              Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

[[:alnum:]]          Alphanumeric characters ([[:alpha:]] and [[:digit:]])

[[:blank:]]           Blank characters: space and tab

[[:cntrl:]]             Control characters

[[:punct:]]           Punctuation characters: ! ” # % & ‘ ( ) * + , – . / : ;

[[:space:]]           Space characters: tab, newline, vertical tab, form feed,carriage return, and space

[[:xdigit:]]           Hexadecimal digits: 0-9 A B C D E F a b c d e f

[[:print:]]            Printable characters ([[:alpha:]], [[:punct:]] and space)

[[:graph:]]          Graphical characters ([[:alpha:]] and [[:punct:]])

We create an object la_vie as:

la_vie = “La vie en #FFC0CB (rose);\nCest la vie! \ttres jolie”

We want to concatenate string representations . It used to convert “\n” as next line and “\t” as tab operator.

cat(la_vie)

Regular Expression And its Usage in R 52

it is used to replace blanks with empty string.

gsub(pattern = “[[:blank:]]”, replacement = “”, la_vie)

Regular Expression And its Usage in R 53

It is used to replace punctuation characters with empty string.

gsub(pattern = “[[:punct:]]”, replacement = “”, la_vie)

Regular Expression And its Usage in R 54

It is used to remove Hexadecimal digits .

gsub(pattern = “[[:xdigit:]]”, replacement = “”, la_vie)

Regular Expression And its Usage in R 55

It is used to remove printable chracters.

gsub(pattern = “[[:print:]]”, replacement = “”, la_vie)

Regular Expression And its Usage in R 56

It is used to remove non-printable characters.

gsub(pattern = “[^[:print:]]”, replacement = “”, la_vie)

Regular Expression And its Usage in R 57

Quantifiers

These are used when we want to match a certain number of characters that meet certain criteria. Quantifiers specify how many instances of a character , group or character class must be present in the input to match to be found.

The list of quantifiers are :

Quantifier       Description

?                     The preceding item is optional and will be matched at most once

*                    The preceding item will be matched zero or more times

+                    The preceding item will be matched one or more times

{n}                 The preceding item is matched exactly n times

{n,}                The preceding item is matched n or more times

{n,m}            The preceding item is matched at least n times, but not more than m times

We create a string vector to store names.

people = c(“rori”, “emilia”, “mmatteo”, “mehmet”, “filipe”, “anna”, “tyler”,

           “rasmus”, “jacob”, “youna”, “flora”, “adi”)

Regular Expression And its Usage in R 58

It matches ‘m’ exactly one time .

grep(pattern = “m{1}”, people, value = TRUE)

Regular Expression And its Usage in R 59

It matches a pattern of ‘m’ coming before ‘t’ in strings . It also find strings contains ‘t’ only.

grep(pattern = “m?t”, people, value = TRUE)

Regular Expression And its Usage in R 60

It matches ‘m’ zero or more times and also finds ‘t’ in the input string.

grep(pattern = “m*t”, people, value = TRUE)

Regular Expression And its Usage in R 61

It matches ‘m’ one or more times .

grep(pattern = “m+”, people, value = TRUE)

Regular Expression And its Usage in R 62

It matches ‘m’ one or more times and also contains ‘t’ .

grep(pattern = “m+.t”, people, value = TRUE)

Regular Expression And its Usage in R 63

It matches ‘t’ exactly twice.

grep(pattern = “t{2}”, people, value = TRUE)

Regular Expression And its Usage in R 64

We create a character object as:

text = c(“one word”, “a sentence”, “you and me “,

         “three two one”)

Regular Expression And its Usage in R 65

regexpr()

It returns an integer vector with the same length as the input vector. Each element in the returned vector indicates the character position in each corresponding string element in the input vector at which the first regex match was found.

It works as :

  • which elements of the character vector actually contains the regex pattern
  • identifies the position of the substring that is matched by the regular expression pattern.

regexpr(“one”, text)

Regular Expression And its Usage in R 66

The number 1 indicates that the pattern “one” starts at the position 1 of the first element in string. The number -1 represent there was no match found. The number 11 indicates the pattern “one” starts at position 11 in the fourth element of the string.

The attribute “match.length” gives us the length of the match in each element of string. The -1 means there was no match in that element. The ”useBytes” has a value TRUE which means that the matching was done byte-by-byte rather than character-by-character .

gregexpr()

It does same thing as regexpr() . It has output in the form of list . It shows output for each element separately.

gregexpr(“one”, text)

Regular Expression And its Usage in R 67

We create a character object as:

str <- c(“Regular”, “expression”, “examples of R language”)

Regular Expression And its Usage in R 68

We finds pattern “ex” in the given vector.

x <- grep(“ex”,str,value=T)

Regular Expression And its Usage in R 69

grepl()

It enables us to perform a similar task as grep() . It shows output of pattern matching as logical (TRUE/FALSE).

It shows whether the string have pattern “ex” or not.

TRUE represent the pattern “ex” presents in string .

FALSE represent the pattern “ex” does not present in string.

x <- grepl(“ex”,str)

Regular Expression And its Usage in R 70

Leave a Comment

Your email address will not be published. Required fields are marked *