Base R from A to Z: Operators (3)

R
Arithmetic, Logical and Relational operators
Published

March 4, 2023

Introduction

In the third part of the series about base R, we will talk about operators. As we have mentioned in the previous part, operators are construct similar to functions, but with special syntax or semantics. While you can call them as a functions, and there are instances where you want to do that, normally you call them in a special way called infix notation.

Infix, prefix and postfix

Most of us are used to the classical infix notation, such as 3 + 5. However, in some older programming languages, there is also a prefix notation + 3 5 and postfix notation 3 5 +. These can be efficiently parsed using a stack, and do not require operator precedence.

Technically, R contains a traces of prefix notation in the form "+"(3,5) (or in fact, any functional call), but unlike in the typical prefix or postfix languages, the parentheses are required. Additionally, ! is unary, and - and + have unary form, so they can be considered prefix, such as -8 or !a.

For more information, see infix, prefix and postfix on Wikipedia.

R offers a multitude of operators and a way to define new operators through the %any% notation. Operators can be divided into these groups:

  • Arithmetic operators
    +, -, *, /, ^, %%, %/%
  • Logical operators
    !, &, &&, |, ||
  • Relational operatos
    <, >, <=, >=, ==, !=
  • Subsetting operators
    [, [[, @, $
  • Assignment operators
    <-, <<-, =, [<-, [[<-, @<-, $<-
  • Matrix operators
    %*%, %o%, %x%
  • Special operator
    %anything%
  • Matching operator
    %in%
  • Sequence operator
    :
  • Operators in formula: ~, :, %in%
  • Namespace access: ::, :::
  • Parentheses and braces
    (, {

We will talk about Arithmetic, Logical, Relational, Subsetting, Assigment and Matrix operators, as well as the special operator %anything%. Matching operator is just a shorthand for the match() function, so we will leave it for later. The sequence operator : is a primitive for performance (and parsing) reasons, but we will talk about it together with other sequence-generating functions. The :: and ::: are quite complex, so we will talk about them when we go deeper into environments, namespaces and package access, and the ( and { are language features rather than operators or functions, so we skip them for later. Finally, the ? operator is not part of the base, but is in utils.

Operator precedence

From basic mathematics, we are used to the idea that multiplication precedes addition, so 3 + 5 * 2 is interpreted as 3 + (5 * 2) without having to use parentheses. These rules are called operator precedence. Here I am reproducing the list from the ?Syntax help command.

For listed operators, precedence goes from highest (evaluated first) to lowest. Operators with the same precedence are evaluated in the order they are encountered.

  • ::, :::
  • $, @
  • [, [[
  • ^
  • -, + in their unary form (-3, +5)
  • :
  • %any%, |> (base R pipe)
  • *, /
  • +, - in their binary form (3+5)
  • <, >, <=, ==, !=
  • !
  • &, &&
  • |, ||
  • ~
  • ->, ->>
  • <-, <<-
  • =
  • ?

Now for some interesting consequences. With ^ having such high precedence, -a^2 is interpeted as -(a^2) instead of (-a)^2, but then -1:3 is interpeted as (-1):3 and not -(1:3).

The pipe |> has high precedence, which mean you can’t do a + 2 |> ..., because it is interpeted as a + (2 |> ...). This bites me all the time, when I just want to do some simple division or addition inside pipes.

Finally = has lower precedence than <-. This shouldn’t be an issue since you should use either = or <- as your assignment operator.

The takeway is that you should not rely on common sense your knowledge of operator precedence rules, but use parentheses to make operations as much explicit as possible.

Arithmetic operators

Arithmetic operators are addition +, subtraction -, multiplication *, division /, exponentiation ^, modulo %% and integer division %/%.

The addition, subtraction, multiplication, division, and exponentiation are well known. Module and the integer division are less common, but occasionaly useful in programming. Modulo %% is the remainder after integer division, so that 5 %% 2 = 1. The %/% is the integer division 5 %/% 2 = 2. These are used when we want to know how many times something fit in our number, if we need to do something with a certain frequency, or if you want to know if a number is even. But arguably, some of these are displaced by sequence functions, and if we want to fit k elements into categories of size m, ceiling(k/m) gives us exactly that. So personally, I haven’t used modulo or integer division very often.

The + and - operators can be both unary and binary, meaning they accept one or two parameters. The meaning and behaviour, such as operator precedence, are different for the unary and binary version.

+ a # unary `+`
- a # unary `-`

a + b # binary `+`
a - b # binary `-`

This is related to how numbers are parsed by the interpreter (see literals. Details are not important, just remember that the unary and binary operators are different, and that the unary - makes numbers negative. Fun fact, as a consequence of the parsing rules, this is valid R code:

+ - + - 5 + - + - + + - - + - 1
[1] 4

As with most fun facts, you should not ever write a code like this.

Recycling

A lot of basic operations in R do something called recycling. For instance, when you add together two vectors with a different number of elements, the shorter vector is being extended by reusing (recycling) its elements.

For instance, consider multiplying a vector of length 3 with a vector of length 1: c(1, 2, 3) * 3. This is identical to c(1,2,3) * c(3, 3, 3) because the shorter vector is being recycled. This work well when the length of the longer vector is a multiple of the shorter vector (i.e., longer %% shorter = 0), for instance:

c(1, 2, 3, 4, 5, 6) + c(1, 2)
[1] 2 4 4 6 6 8

You can see that the shorter vector was recycled as c(1,2, 1,2, 1,2), because the length of the longer vector is a multiple of the shorter one. If this is not the case, the shorter vector is still recycled, but a warning is thrown.

c(1,2,3) * c(1,2)
Warning in c(1, 2, 3) * c(1, 2): longer object length is not a multiple of
shorter object length
[1] 1 4 3

Recycling allows you to do some fancy tricks, for example if you want to pick every second element of a vector:

c(1,2,3,4,5,6)[c(FALSE, TRUE)]
[1] 2 4 6

This works because the selection vector is being recycled to the full length using the predefined pattern.

The strange case of **

In some languages such as Python, ** is a power operator. This is also supported in R:

2 ^ 3
[1] 8
2 ** 3
[1] 8

Strangely, ** is not documented, ** is not a primitive, and when you type bare ** in R, you will get a curious error:

**
Error: <text>:1:1: unexpected '^'
1: **
    ^

There is a small note in the ?Arithmetic. Apparently, ** existed in S (the precedesors of R) but was deprecated. For backward compatibility, the R parser kept this functionality and translates ** into ^. Since it is undocumented and not part of the official language specification, do not use **.

Logical operators and functions

Since we are already talking about the logical operators, I thought it might be more useful to look more closely at the logical type and functions that operate with it.

Logical operators are negation !, which is a unary operator, and binary logical and & and &&, logical or | and ||. Aside of these operators, there is also a function xor(), helpers isTRUE(), isFALSE(), useful primitives all() and any(), and functions for the creation, testing, and conversion to the logical type: logical(), is.logical(), and as.logical().

Logical operators

Logical operators are negation !, logical and &, &&, and logical or |, ||.

The negation ! operator just makes a FALSE value TRUE and the other way around. The and and or operators are more interesting in the way they work on vectors.

The single symbol and and or operators & and | work like + or -, they are applied element-wise and vectors are recycled as required.

c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
[1]  TRUE FALSE FALSE
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
[1]  TRUE  TRUE FALSE

The double-symbol and and or are not vectorized, they uses only the first element of a vector, while other elements are ignored. Likewise, the output of && and || is a single TRUE or FALSE value. This makes && and || useful when doing control flow operations with if(condition){...}.

c(TRUE, TRUE) && c(TRUE, FALSE) # only the first value is used
[1] TRUE

Short-circuting

In addition, the && and || short-circuits, the second expression is evaluated only if it is required. For instance, in the expression:

a || b

If a evaluates to TRUE, the whole expression evaluates to TRUE regardless of b. This means that if a = TRUE, b doesn’t have to be evaluated and in fact isn’t evaluated. So if the b was a function call with some side effects, the side effects (reading file, printing, increasing a counter) are never evaluated. For instance, consider an expression that prints foo and returns TRUE:

{print("foo"); TRUE}

If we use two of these expression with ||, only single foo will be printed:

{print("foo"); TRUE} || {print("foo"); TRUE}
[1] "foo"
[1] TRUE

Contrast this with the & or | which do not short-curcuit:

{print("foo"); TRUE} | {print("foo"); TRUE}
[1] "foo"
[1] "foo"
[1] TRUE

In some languages, this is used as a control structure, since this behaviour can be transformed into:

if a then a, else b

But I haven’t seen this being used in R. Since the R code is typically light on side-effects, I don’t think there is a much value in trying to be cheeky in this way.

NA behaviour

NA or not available is a peculiar value (or values, since there is NA_character_, NA_integer_ atp.) that symbolize a missing value. Most programing languages do not have this (usually, they only have NULL), but R was designed as a domain-specific language for data-analysis and unknown values are a common problem.

Typically, any numerical calculation involving NA typically results in NA.

3 + NA
[1] NA
5 * NA
[1] NA

Logical operations are slightly different and result in NA only if the expression depends on it. You can intepret this dependence in a similar manner as short-circuting, but this works even for the non-short-circuiting operators | and ||.

For example, in the x | NA, if x is TRUE, the whole expression is also TRUE. But if x is FALSE, the whole expression depends on the unknown value NA, which means that the expression evaluates to NA.

TRUE | NA
[1] TRUE
FALSE | NA
[1] NA

Similarly, in the x & NA, if x is FALSE, the whole expression is FALSE. But if x is TRUE, the result of the expression depends on the NA and the expression evaluates to NA.

FALSE & NA
[1] FALSE
TRUE & NA
[1] NA

In the same manner, the !NA depends on the NA so it evaluates to NA and so on.

Logical functions

Now, lets move to the functions. The isTRUE() and isFALSE() are also useful shorthands for control flow. They detect if the condition is exactly TRUE or FALSE respectively, a good way to avoid pitalls with NA, NULL, which might arise from some operations and should be processed appropriatelly.

The logical operations xor() is elementwise exclusive or. It is not an operator, and not a primitive, likely because it is not used very often. You can see that xor() is just a shorthand for (x | y) & !(x & y)

xor
function (x, y) 
{
    (x | y) & !(x & y)
}
<bytecode: 0x55a87771d0f8>
<environment: namespace:base>

Very useful if only x or y are allowed, such as if you are using parameters as a mutually exclusive flags. Consider a function that does either foo or bar, but can’t do both at the same time:

myfunc = function(x, doFoo=FALSE, doBar=FALSE){
    if(!xor(doFoo, doBar))
        stop("Must select either doFoo or doBar")

    if(doFoo)
        return(foo(x))
    if(doBar)
        return(bar(x))
    }

Without xor, the function might not behave as expected, not returning anything if either both conditions are FALSE, or performing only one operation if both conditions are TRUE.

You can turn xor() easily into an infix operator %xor%.

`%xor%` = xor
TRUE %xor% FALSE
[1] TRUE

Before we start talking about the logical type in general, lets quickly mention the all() and any(). These are vectorized and will tell you if all or any value of the vector are TRUE respectively. For efficiency reason, they are primitives, because they are used quite often for the control flow.

vec = c(TRUE, TRUE, FALSE)
all(vec) # will be FALSE because not all values are TRUE
[1] FALSE
any(vec) # TRUE as at least one value is TRUE
[1] TRUE

You can define a vectorized variant of xor which tells you that there is exactly one TRUE rather easily:

one = function(x){
    sum(x, na.rm=TRUE) == 1
    }   
one(c(TRUE, TRUE, FALSE)) # will be FALSE
[1] FALSE
one(c(FALSE, TRUE, FALSE)) # will be TRUE
[1] TRUE

Conversion rules

This brings us to an important step, conversion. You can apply logical operators not only on the logical TRUE and FALSE values, but as with many similar operations in R, the values are automatically converted to the correct type.

Conversion is performed internally using the as.logical primitive. Valid conversions are from numeric (that is, both integer and double), complex, and character. For the numeric and complex, 0 is converted into FALSE and non-zero value is converted into TRUE:

as.logical(0)
[1] FALSE
as.logical(5i+3)
[1] TRUE

For the character type, the conversion is a bit more complex. The strings “T”, “TRUE”, “True” and “true” are converted into TRUE, and similarly “F”, “FALSE”, “False” and “false” are converted into FALSE. Everything else, including “TrUe” and similar messy capitalizations, is converted into NA.

as.logical(c("T", "TRUE", "True", "true"))
[1] TRUE TRUE TRUE TRUE
as.logical(c("F", "FALSE", "False", "false"))
[1] FALSE FALSE FALSE FALSE
as.logical(c("1", "0", "truE", "foo"))
[1] NA NA NA NA

The logical(n) is a shorthand for vector("logical", n) and simply create a logical vector of length n with values initialized to FALSE.

logical(5)
[1] FALSE FALSE FALSE FALSE FALSE

When converting from logical to numeric, the TRUE and FALSE converts to 1 and 0 respectively. This allows for the convenient use of sum or mean to count the number or the percentage of matches. When converting from logical to character, the TRUE and FALSE gets converted simply to "TRUE" and "FALSE" strings.

Raw vectors

The logical operators !, & and | have another meaning for raw vectors. raw is a basic type, alongside integer, double, list, and so on. It represents raw bytes (from 0 to 255), here printed in a hexadecimal format:

as.raw(c(0, 10, 255))
[1] 00 0a ff

For these raw vectors, the logical operators have a slightly different bitwise meaning. See bitwNot, botwAnd, and bitwOr for !, &, and | respectively. We will talk in-depth about raw type and operations on them later in this series.

Relational operators

Relational operatos are smaller than >, larger than >, smaller or equal <=, _larger or equal >=, equal ==, and not equal !=. For numeric (integer and double) and logical (which is converted to numeric), the comparisons are what you might expect, just a standard numerical comparisons.

5L > 2L # integers, the `long` type
[1] TRUE
5.3 > 2.8
[1] TRUE
TRUE > FALSE
[1] TRUE

For raw, the numeric order is used so the comparisons also behaves as expected, and for complex, only the == and != comparisons are implemented.

as.raw(5) > as.raw(2)
[1] TRUE
5+1i > 2+1i
Error in 5 + (0+1i) > 2 + (0+1i): invalid comparison with complex values

String comparisons

For character strings, the comparisons are quite a bit more complicated. If you consider only standard english alphabet, the problem might feel obvious. But every language has slightly different rules about order of character in their alphabet. Quoting from the R help page:

Beware of making any assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’.

On top of this, the order is system and locale dependent. This makes sorting extremely unpredictable when comparing strings accross languages.

For more information, see Microsoft page or long unicode explanation.

I am happy that smarter people solved it and I don’t need to know the details.

Floating point comparisons

There is a caveat when comparing doubles. You might remember that computers work in a binary. This means that every number is represented by combination of bits that can be 0 or 1. You might start sensing that something is off in here. How can I represent a rational number by just combination of 0 and 1? This is because computers are able to represent perfectly only whole numbers, integer. Anything else is represented imperfectly and you get rounding errors. R is trying to hide this imperfection by rounding when printing, but the abstraction will leak if you try to compare for equality.

(1 - 0.9)
[1] 0.1
(1 - 0.9) == 0.1
[1] FALSE

The > and < still works as intended, but when you are comparing floating point numbers, use all.equal instead and select an appropriate precision (by default sqrt(.Machine$double.eps) is used):

all.equal( (1 - 0.9), 0.1)
[1] TRUE

Unlike with string comparison, this is common issue that will significantly influence the performance of your code, especially if you are writing any kind of numerical algorithm. You need to be aware of these issues. See shorter explanation and longer explanation and try to understand them.

Group generic methods and Ops

Some methods are grouped into groups called group generics functions. These methods are dispatched when any of the grouped functions are called. While it might look strange at first look, this allows for some neat tricks to efficiently define common mathematical operations for a large group of functions.

For instance, consider Ops.POSIXt which defines arithmetic, logical and relational operators for the S3 class POSIXt.

Ops.POSIXt
function (e1, e2) 
{
    if (nargs() == 1L) 
        stop(gettextf("unary '%s' not defined for \"POSIXt\" objects", 
            .Generic), domain = NA)
    boolean <- switch(.Generic, `<` = , `>` = , `==` = , `!=` = , 
        `<=` = , `>=` = TRUE, FALSE)
    if (!boolean) 
        stop(gettextf("'%s' not defined for \"POSIXt\" objects", 
            .Generic), domain = NA)
    if (inherits(e1, "POSIXlt") || is.character(e1)) 
        e1 <- as.POSIXct(e1)
    if (inherits(e2, "POSIXlt") || is.character(e2)) 
        e2 <- as.POSIXct(e2)
    check_tzones(e1, e2)
    NextMethod(.Generic)
}
<bytecode: 0x55a877002078>
<environment: namespace:base>

First, the function checks if only a single argument was passed, because unary operators are not defined.

Then it uses switch statement, which evaluates to TRUE if one of <, >, ==, !=, <=, and >= matches the .Generic variable, which is an automatic variable initialized to the dispatched generics from the Ops group. So if the function call was !POSIXt, the .Generic is the negation operator !. The switch statement thus works as a check whether a particular operator makes sense for the class, throwing error if the operation was not defined.

This doesn’t mean that operators not specified in the switch are undefined. In fact, both +.POSIXt and -.POSIXt exists.

POSIXt, POSIXlt and POSIXct

R contains a number of different classes for dates.

POSIXct is a signed number of seconds since the begining of 1970 (UCT time zone), the Unix time.

POSIXlt is a named list of vectors with sec, min, hour, mday and so on.

POSIXct is a really simple and convenient way to keep time, but not very human readable (unless you are Unix guru). POSIXlt is a nice human-readable format, but not very convenient when you want to store precise times in a data.frame

POSIXt is a virtual class, both POSIXct and POSIXlt inherits from it. This means that when you define a method for POSIXt, it will automatically work for POSIXct and POSIXlt.

Since POSIXt is a virtual class, and math with a number of seconds POSIXct is simpler than math on a complex data structure POSIXlt (see the box), what follows is a conversion into POSIXct, and the call to NextMethod(), which simply calls the appropriate method for the .Generic on the modified parameters. There is nothing simple on NextMethod() and we will talk about it later, when we are talking in-depth about the S3 system.

Double dispatch

Methods in the Ops group are special because most of them are binary and they need to do something called double dispatch. For instance, if you are adding objects of class foo and bar, you need to dispatch the correct method for this particular addition, taking in account classes of both objects, not just foo or bar independently. To keep addition comutative, method needs should be identical even if the order of addition is changed (bar + foo).

S3 has rather primitive double-dispatch system, the rules are as follows:

  1. If neither objects has a method, use internal method.
  2. If only one object has a method, use that method.
  3. If both objects have a method, and the methods:
    1. are identical, use that method.
    2. are not identical, throw a warning and use internal methods.

To demonstrate this, we define a helper function and methods for the + generic for classes foo and bar. These simply return NULL and print a message, which will tell us which method was dispatched.

# helper to make object less wordy
obj = function(class){
    structure(1, class=class)
    }
`+.foo` = function(...){message("foo")}
`+.bar` = function(...){message("bar")}

The first rule is just a normal addition:

1 + 1
[1] 2

According to the second rule, the +.foo method is dispatched:

1 + obj("foo")
foo
NULL
obj("foo") + 1
foo
NULL
obj("foo") + obj("baz") # no method +.baz
foo
NULL

According to the third rule, since methods are not identical, we will get a warning.

obj("foo") + obj("bar")
Warning: Incompatible methods ("+.foo", "+.bar") for "+"
[1] 2
attr(,"class")
[1] "foo"
obj("bar") + obj("foo")
Warning: Incompatible methods ("+.bar", "+.foo") for "+"
[1] 2
attr(,"class")
[1] "bar"

Finally, if we redefine the +.foo and the +.bar to be identical, addition works again:

`+.foo` = function(...){message("foobar")}
`+.bar` = function(...){message("foobar")}
obj("foo") + obj("bar")
foobar
NULL

Notice that if one of the object doesn’t have defined an S3 method, we can make the S3 method that is dispatched quite complex and built-in support for different types. But this mechanism will be fragile and once someone defines method for one of the supported types, our method would be ignored and the dispatch would default to an internal method (with warning). For an explicit double dispatch system that can be further extended, we would need to go for S4 classes.

Example: sparse_vector

As an example, we will create an S3 class sparse_vector. There is an S4 class in the Matrix package, which is a more reasonable choice since S4 methods can be extended, but we do this only for demonstration purposes.

First, we will define our class, sparse vector is defined by the vector of values x, vector of indices i and the length of the vector. Normally, we would make a barebone new_sparse_vector constructor, and the sparse_vector one would be an interface that would also check validity of input parameters, but we will keep it simple.

We also create a method for the as.vector generics, so that we can easily unroll our sparse vector into a normal one.

sparse_vector = function(x, i, length){
    structure(list(
        x = x,
        i = i,
        length = length
        ), class = "sparse_vector")
    }

as.vector.sparse_vector = function(x, ...){
    y = vector(mode=mode(x$x), length=x$length)
    y[x$i] = x$x
    y
    }

print.sparse_vector = function(x, ...){
    y = rep(".", x$length)
    y[x$i] = as.character(x$x)
    print(y, quote=FALSE)
    }

a = sparse_vector(c(3,5,8), c(2,4,6), 10)
a
 [1] . 3 . 5 . 8 . . . .
as.vector(a)
 [1] 0 3 0 5 0 8 0 0 0 0

We want to support all operations in the group generics Ops.

Ops.sparse_vector = function(e1, e2){
    if(inherits(e1, "sparse_vector"))
        e1 = as.vector(e1)
    # If Ops is unary, e2 is missing
    if(!missing(e2) && inherits(e2, "sparse_vector"))
        e2 = as.vector(e2)
    NextMethod(.Generic)
    }

a + 2
 [1]  2  5  2  7  2 10  2  2  2  2
2 + a
 [1]  2  5  2  7  2 10  2  2  2  2
a * 2
 [1]  0  6  0 10  0 16  0  0  0  0
a == 0
 [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
- a
 [1]  0 -3  0 -5  0 -8  0  0  0  0

And essentially for free, we suddenly support all operations. But this isn’t very efficient if we are summing two sparse vectors, for instance. So for some operations, we define their own functions.

`+.sparse_vector` = function(e1, e2){
    message("+.sparse_vector dispatched")
    if(nargs() == 1L)
        return(e1)

    if(inherits(e1, "sparse_vector") && inherits(e2, "sparse_vector")){
        i = union(e1$i, e2$i) |> sort()
        j = intersect(e1$i, e2$i) |> sort()
        l = max(e1$length, e2$length) # ignoring recycling for now
        y = sparse_vector(rep(0, length(i)), i, l)
        y$x[i %in% e1$i] = e1$x
        y$x[i %in% e2$i] = e2$x
        y$x[i %in% j] = e1$x[e1$i %in% j] + e2$x[e2$i %in% j]
        return(y)
        }
    if(inherits(e1, "sparse_vector")){
        e2[e1$i] = e2[e1$i] + e1$x
        return(e2)
        }
    if(inherits(e2, "sparse_vector")){
        e1[e2$i] = e1[e2$i] + e2$x
        return(e1)
        }
    stop("This should be unreachable state")
    }

b = sparse_vector(c(1, 2, 3), c(1, 4, 6), 10)
+ a
+.sparse_vector dispatched
 [1] . 3 . 5 . 8 . . . .
+ b
+.sparse_vector dispatched
 [1] 1 . . 2 . 3 . . . .
a + b
+.sparse_vector dispatched
 [1] 1  3  .  7  .  11 .  .  .  . 
a + rep(1, 10)
+.sparse_vector dispatched
 [1] 1 4 1 6 1 9 1 1 1 1
b + rep(1, 10)
+.sparse_vector dispatched
 [1] 2 1 1 3 1 4 1 1 1 1

The code isn’t perfect, we ignore a lot of recycling, which means operations with matrices will not work correctly, and so on. But as an example of the Ops and operator overloading, this should be sufficient. One obvious improvement would be redefining the operator [ to make subsetting of the vector much easier, and thus the code cleaner. We will look at this in the next article.

Summary

In this part, we have learned about operators, and more closely explored arithmetic, logical and relational operators. On top of this, we have learned more about group generic functions, double dispatch system, and managed to implement a simple sparse_vector class with working arithmetic, logical and relational operators.