+ a # unary `+`
- a # unary `-`
+ b # binary `+`
a - b # binary `-` a
Base R from A to Z: Operators (3)
Introduction
In the third part of the series about base R, we will talk about operators. As we have mentioned in the previous part, operators are construct similar to functions, but with special syntax or semantics. While you can call them as a functions, and there are instances where you want to do that, normally you call them in a special way called infix notation.
R offers a multitude of operators and a way to define new operators through the %any%
notation. Operators can be divided into these groups:
- Arithmetic operators
+
,-
,*
,/
,^
,%%
,%/%
- Logical operators
!
,&
,&&
,|
,||
- Relational operatos
<
,>
,<=
,>=
,==
,!=
- Subsetting operators
[
,[[
,@
,$
- Assignment operators
<-
,<<-
,=
,[<-
,[[<-
,@<-
,$<-
- Matrix operators
%*%
,%o%
,%x%
- Special operator
%anything%
- Matching operator
%in%
- Sequence operator
:
- Operators in formula:
~
,:
,%in%
- Namespace access:
::
,:::
- Parentheses and braces
(
,{
We will talk about Arithmetic, Logical, Relational, Subsetting, Assigment and Matrix operators, as well as the special operator %anything%
. Matching operator is just a shorthand for the match()
function, so we will leave it for later. The sequence operator :
is a primitive for performance (and parsing) reasons, but we will talk about it together with other sequence-generating functions. The ::
and :::
are quite complex, so we will talk about them when we go deeper into environments, namespaces and package access, and the (
and {
are language features rather than operators or functions, so we skip them for later. Finally, the ?
operator is not part of the base, but is in utils.
Operator precedence
From basic mathematics, we are used to the idea that multiplication precedes addition, so 3 + 5 * 2
is interpreted as 3 + (5 * 2)
without having to use parentheses. These rules are called operator precedence. Here I am reproducing the list from the ?Syntax
help command.
For listed operators, precedence goes from highest (evaluated first) to lowest. Operators with the same precedence are evaluated in the order they are encountered.
::
,:::
$
,@
[
,[[
^
-
,+
in their unary form (-3
,+5
):
%any%
,|>
(base R pipe)*
,/
+
,-
in their binary form (3+5
)<
,>
,<=
,==
,!=
!
&
,&&
|
,||
~
->
,->>
<-
,<<-
=
?
Now for some interesting consequences. With ^
having such high precedence, -a^2
is interpeted as -(a^2)
instead of (-a)^2
, but then -1:3
is interpeted as (-1):3
and not -(1:3)
.
The pipe |>
has high precedence, which mean you can’t do a + 2 |> ...
, because it is interpeted as a + (2 |> ...)
. This bites me all the time, when I just want to do some simple division or addition inside pipes.
Finally =
has lower precedence than <-
. This shouldn’t be an issue since you should use either =
or <-
as your assignment operator.
The takeway is that you should not rely on common sense your knowledge of operator precedence rules, but use parentheses to make operations as much explicit as possible.
Arithmetic operators
Arithmetic operators are addition +
, subtraction -
, multiplication *
, division /
, exponentiation ^
, modulo %%
and integer division %/%
.
The addition, subtraction, multiplication, division, and exponentiation are well known. Module and the integer division are less common, but occasionaly useful in programming. Modulo %%
is the remainder after integer division, so that 5 %% 2 = 1
. The %/%
is the integer division 5 %/% 2 = 2
. These are used when we want to know how many times something fit in our number, if we need to do something with a certain frequency, or if you want to know if a number is even. But arguably, some of these are displaced by sequence functions, and if we want to fit k
elements into categories of size m
, ceiling(k/m)
gives us exactly that. So personally, I haven’t used modulo or integer division very often.
The +
and -
operators can be both unary and binary, meaning they accept one or two parameters. The meaning and behaviour, such as operator precedence, are different for the unary and binary version.
This is related to how numbers are parsed by the interpreter (see literals. Details are not important, just remember that the unary and binary operators are different, and that the unary -
makes numbers negative. Fun fact, as a consequence of the parsing rules, this is valid R code:
+ - + - 5 + - + - + + - - + - 1
[1] 4
As with most fun facts, you should not ever write a code like this.
The strange case of **
In some languages such as Python, **
is a power operator. This is also supported in R:
2 ^ 3
[1] 8
2 ** 3
[1] 8
Strangely, **
is not documented, **
is not a primitive, and when you type bare **
in R, you will get a curious error:
**
Error: <text>:1:1: unexpected '^'
1: **
^
There is a small note in the ?Arithmetic
. Apparently, **
existed in S (the precedesors of R) but was deprecated. For backward compatibility, the R parser kept this functionality and translates **
into ^
. Since it is undocumented and not part of the official language specification, do not use **
.
Logical operators and functions
Since we are already talking about the logical operators, I thought it might be more useful to look more closely at the logical type and functions that operate with it.
Logical operators are negation !
, which is a unary operator, and binary logical and &
and &&
, logical or |
and ||
. Aside of these operators, there is also a function xor()
, helpers isTRUE()
, isFALSE()
, useful primitives all()
and any()
, and functions for the creation, testing, and conversion to the logical type: logical()
, is.logical()
, and as.logical()
.
Logical operators
Logical operators are negation !
, logical and &
, &&
, and logical or |
, ||
.
The negation !
operator just makes a FALSE
value TRUE
and the other way around. The and and or operators are more interesting in the way they work on vectors.
The single symbol and and or operators &
and |
work like +
or -
, they are applied element-wise and vectors are recycled as required.
c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
[1] TRUE FALSE FALSE
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
[1] TRUE TRUE FALSE
The double-symbol and and or are not vectorized, they uses only the first element of a vector, while other elements are ignored. Likewise, the output of &&
and ||
is a single TRUE
or FALSE
value. This makes &&
and ||
useful when doing control flow operations with if(condition){...}
.
c(TRUE, TRUE) && c(TRUE, FALSE) # only the first value is used
[1] TRUE
Short-circuting
In addition, the &&
and ||
short-circuits, the second expression is evaluated only if it is required. For instance, in the expression:
a || b
If a
evaluates to TRUE
, the whole expression evaluates to TRUE
regardless of b
. This means that if a = TRUE
, b
doesn’t have to be evaluated and in fact isn’t evaluated. So if the b
was a function call with some side effects, the side effects (reading file, printing, increasing a counter) are never evaluated. For instance, consider an expression that prints foo
and returns TRUE
:
{print("foo"); TRUE}
If we use two of these expression with ||
, only single foo
will be printed:
print("foo"); TRUE} || {print("foo"); TRUE} {
[1] "foo"
[1] TRUE
Contrast this with the &
or |
which do not short-curcuit:
print("foo"); TRUE} | {print("foo"); TRUE} {
[1] "foo"
[1] "foo"
[1] TRUE
In some languages, this is used as a control structure, since this behaviour can be transformed into:
if a then a, else b
But I haven’t seen this being used in R. Since the R code is typically light on side-effects, I don’t think there is a much value in trying to be cheeky in this way.
NA behaviour
NA
or not available is a peculiar value (or values, since there is NA_character_
, NA_integer_
atp.) that symbolize a missing value. Most programing languages do not have this (usually, they only have NULL
), but R was designed as a domain-specific language for data-analysis and unknown values are a common problem.
Typically, any numerical calculation involving NA
typically results in NA
.
3 + NA
[1] NA
5 * NA
[1] NA
Logical operations are slightly different and result in NA
only if the expression depends on it. You can intepret this dependence in a similar manner as short-circuting, but this works even for the non-short-circuiting operators |
and ||
.
For example, in the x | NA
, if x
is TRUE
, the whole expression is also TRUE
. But if x
is FALSE
, the whole expression depends on the unknown value NA
, which means that the expression evaluates to NA
.
TRUE | NA
[1] TRUE
FALSE | NA
[1] NA
Similarly, in the x & NA
, if x
is FALSE
, the whole expression is FALSE
. But if x
is TRUE
, the result of the expression depends on the NA
and the expression evaluates to NA
.
FALSE & NA
[1] FALSE
TRUE & NA
[1] NA
In the same manner, the !NA
depends on the NA
so it evaluates to NA
and so on.
Logical functions
Now, lets move to the functions. The isTRUE()
and isFALSE()
are also useful shorthands for control flow. They detect if the condition is exactly TRUE
or FALSE
respectively, a good way to avoid pitalls with NA
, NULL
, which might arise from some operations and should be processed appropriatelly.
The logical operations xor()
is elementwise exclusive or. It is not an operator, and not a primitive, likely because it is not used very often. You can see that xor()
is just a shorthand for (x | y) & !(x & y)
xor
function (x, y)
{
(x | y) & !(x & y)
}
<bytecode: 0x55a87771d0f8>
<environment: namespace:base>
Very useful if only x or y are allowed, such as if you are using parameters as a mutually exclusive flags. Consider a function that does either foo
or bar
, but can’t do both at the same time:
myfunc = function(x, doFoo=FALSE, doBar=FALSE){
if(!xor(doFoo, doBar))
stop("Must select either doFoo or doBar")
if(doFoo)
return(foo(x))
if(doBar)
return(bar(x))
}
Without xor
, the function might not behave as expected, not returning anything if either both conditions are FALSE
, or performing only one operation if both conditions are TRUE
.
You can turn xor()
easily into an infix operator %xor%
.
`%xor%` = xor
TRUE %xor% FALSE
[1] TRUE
Before we start talking about the logical
type in general, lets quickly mention the all()
and any()
. These are vectorized and will tell you if all or any value of the vector are TRUE
respectively. For efficiency reason, they are primitives, because they are used quite often for the control flow.
= c(TRUE, TRUE, FALSE)
vec all(vec) # will be FALSE because not all values are TRUE
[1] FALSE
any(vec) # TRUE as at least one value is TRUE
[1] TRUE
You can define a vectorized variant of xor
which tells you that there is exactly one TRUE
rather easily:
= function(x){
one sum(x, na.rm=TRUE) == 1
} one(c(TRUE, TRUE, FALSE)) # will be FALSE
[1] FALSE
one(c(FALSE, TRUE, FALSE)) # will be TRUE
[1] TRUE
Conversion rules
This brings us to an important step, conversion. You can apply logical operators not only on the logical TRUE
and FALSE
values, but as with many similar operations in R, the values are automatically converted to the correct type.
Conversion is performed internally using the as.logical
primitive. Valid conversions are from numeric
(that is, both integer
and double
), complex
, and character
. For the numeric
and complex
, 0
is converted into FALSE
and non-zero value is converted into TRUE
:
as.logical(0)
[1] FALSE
as.logical(5i+3)
[1] TRUE
For the character type, the conversion is a bit more complex. The strings “T”, “TRUE”, “True” and “true” are converted into TRUE
, and similarly “F”, “FALSE”, “False” and “false” are converted into FALSE
. Everything else, including “TrUe” and similar messy capitalizations, is converted into NA
.
as.logical(c("T", "TRUE", "True", "true"))
[1] TRUE TRUE TRUE TRUE
as.logical(c("F", "FALSE", "False", "false"))
[1] FALSE FALSE FALSE FALSE
as.logical(c("1", "0", "truE", "foo"))
[1] NA NA NA NA
The logical(n)
is a shorthand for vector("logical", n)
and simply create a logical vector of length n
with values initialized to FALSE
.
logical(5)
[1] FALSE FALSE FALSE FALSE FALSE
When converting from logical to numeric, the TRUE
and FALSE
converts to 1
and 0
respectively. This allows for the convenient use of sum
or mean
to count the number or the percentage of matches. When converting from logical to character, the TRUE
and FALSE
gets converted simply to "TRUE"
and "FALSE"
strings.
Raw vectors
The logical operators !
, &
and |
have another meaning for raw vectors. raw
is a basic type, alongside integer
, double
, list
, and so on. It represents raw bytes (from 0 to 255), here printed in a hexadecimal format:
as.raw(c(0, 10, 255))
[1] 00 0a ff
For these raw vectors, the logical operators have a slightly different bitwise meaning. See bitwNot
, botwAnd
, and bitwOr
for !
, &
, and |
respectively. We will talk in-depth about raw type and operations on them later in this series.
Relational operators
Relational operatos are smaller than >
, larger than >
, smaller or equal <=
, _larger or equal >=
, equal ==
, and not equal !=
. For numeric
(integer
and double
) and logical
(which is converted to numeric), the comparisons are what you might expect, just a standard numerical comparisons.
> 2L # integers, the `long` type 5L
[1] TRUE
5.3 > 2.8
[1] TRUE
TRUE > FALSE
[1] TRUE
For raw
, the numeric order is used so the comparisons also behaves as expected, and for complex
, only the ==
and !=
comparisons are implemented.
as.raw(5) > as.raw(2)
[1] TRUE
5+1i > 2+1i
Error in 5 + (0+1i) > 2 + (0+1i): invalid comparison with complex values
String comparisons
For character
strings, the comparisons are quite a bit more complicated. If you consider only standard english alphabet, the problem might feel obvious. But every language has slightly different rules about order of character in their alphabet. Quoting from the R help page:
Beware of making any assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’.
On top of this, the order is system and locale dependent. This makes sorting extremely unpredictable when comparing strings accross languages.
For more information, see Microsoft page or long unicode explanation.
I am happy that smarter people solved it and I don’t need to know the details.
Floating point comparisons
There is a caveat when comparing doubles
. You might remember that computers work in a binary. This means that every number is represented by combination of bits that can be 0
or 1
. You might start sensing that something is off in here. How can I represent a rational number by just combination of 0
and 1
? This is because computers are able to represent perfectly only whole numbers, integer
. Anything else is represented imperfectly and you get rounding errors. R is trying to hide this imperfection by rounding when printing, but the abstraction will leak if you try to compare for equality.
1 - 0.9) (
[1] 0.1
1 - 0.9) == 0.1 (
[1] FALSE
The >
and <
still works as intended, but when you are comparing floating point numbers, use all.equal
instead and select an appropriate precision (by default sqrt(.Machine$double.eps)
is used):
all.equal( (1 - 0.9), 0.1)
[1] TRUE
Unlike with string comparison, this is common issue that will significantly influence the performance of your code, especially if you are writing any kind of numerical algorithm. You need to be aware of these issues. See shorter explanation and longer explanation and try to understand them.
Group generic methods and Ops
Some methods are grouped into groups called group generics functions. These methods are dispatched when any of the grouped functions are called. While it might look strange at first look, this allows for some neat tricks to efficiently define common mathematical operations for a large group of functions.
For instance, consider Ops.POSIXt
which defines arithmetic, logical and relational operators for the S3 class POSIXt
.
Ops.POSIXt
function (e1, e2)
{
if (nargs() == 1L)
stop(gettextf("unary '%s' not defined for \"POSIXt\" objects",
.Generic), domain = NA)
boolean <- switch(.Generic, `<` = , `>` = , `==` = , `!=` = ,
`<=` = , `>=` = TRUE, FALSE)
if (!boolean)
stop(gettextf("'%s' not defined for \"POSIXt\" objects",
.Generic), domain = NA)
if (inherits(e1, "POSIXlt") || is.character(e1))
e1 <- as.POSIXct(e1)
if (inherits(e2, "POSIXlt") || is.character(e2))
e2 <- as.POSIXct(e2)
check_tzones(e1, e2)
NextMethod(.Generic)
}
<bytecode: 0x55a877002078>
<environment: namespace:base>
First, the function checks if only a single argument was passed, because unary operators are not defined.
Then it uses switch
statement, which evaluates to TRUE
if one of <
, >
, ==
, !=
, <=
, and >=
matches the .Generic
variable, which is an automatic variable initialized to the dispatched generics from the Ops
group. So if the function call was !POSIXt
, the .Generic
is the negation operator !
. The switch
statement thus works as a check whether a particular operator makes sense for the class, throwing error if the operation was not defined.
This doesn’t mean that operators not specified in the switch are undefined. In fact, both +.POSIXt
and -.POSIXt
exists.
Since POSIXt
is a virtual class, and math with a number of seconds POSIXct
is simpler than math on a complex data structure POSIXlt
(see the box), what follows is a conversion into POSIXct
, and the call to NextMethod()
, which simply calls the appropriate method for the .Generic
on the modified parameters. There is nothing simple on NextMethod()
and we will talk about it later, when we are talking in-depth about the S3 system.
Double dispatch
Methods in the Ops
group are special because most of them are binary and they need to do something called double dispatch. For instance, if you are adding objects of class foo
and bar
, you need to dispatch the correct method for this particular addition, taking in account classes of both objects, not just foo
or bar
independently. To keep addition comutative, method needs should be identical even if the order of addition is changed (bar + foo
).
S3 has rather primitive double-dispatch system, the rules are as follows:
- If neither objects has a method, use internal method.
- If only one object has a method, use that method.
- If both objects have a method, and the methods:
- are identical, use that method.
- are not identical, throw a warning and use internal methods.
To demonstrate this, we define a helper function and methods for the +
generic for classes foo
and bar
. These simply return NULL
and print a message, which will tell us which method was dispatched.
# helper to make object less wordy
= function(class){
obj structure(1, class=class)
}`+.foo` = function(...){message("foo")}
`+.bar` = function(...){message("bar")}
The first rule is just a normal addition:
1 + 1
[1] 2
According to the second rule, the +.foo
method is dispatched:
1 + obj("foo")
foo
NULL
obj("foo") + 1
foo
NULL
obj("foo") + obj("baz") # no method +.baz
foo
NULL
According to the third rule, since methods are not identical, we will get a warning.
obj("foo") + obj("bar")
Warning: Incompatible methods ("+.foo", "+.bar") for "+"
[1] 2
attr(,"class")
[1] "foo"
obj("bar") + obj("foo")
Warning: Incompatible methods ("+.bar", "+.foo") for "+"
[1] 2
attr(,"class")
[1] "bar"
Finally, if we redefine the +.foo
and the +.bar
to be identical, addition works again:
`+.foo` = function(...){message("foobar")}
`+.bar` = function(...){message("foobar")}
obj("foo") + obj("bar")
foobar
NULL
Notice that if one of the object doesn’t have defined an S3 method, we can make the S3 method that is dispatched quite complex and built-in support for different types. But this mechanism will be fragile and once someone defines method for one of the supported types, our method would be ignored and the dispatch would default to an internal method (with warning). For an explicit double dispatch system that can be further extended, we would need to go for S4 classes.
Example: sparse_vector
As an example, we will create an S3 class sparse_vector
. There is an S4 class in the Matrix
package, which is a more reasonable choice since S4 methods can be extended, but we do this only for demonstration purposes.
First, we will define our class, sparse vector is defined by the vector of values x
, vector of indices i
and the length of the vector. Normally, we would make a barebone new_sparse_vector
constructor, and the sparse_vector
one would be an interface that would also check validity of input parameters, but we will keep it simple.
We also create a method for the as.vector
generics, so that we can easily unroll our sparse vector into a normal one.
= function(x, i, length){
sparse_vector structure(list(
x = x,
i = i,
length = length
class = "sparse_vector")
),
}
= function(x, ...){
as.vector.sparse_vector = vector(mode=mode(x$x), length=x$length)
y $i] = x$x
y[x
y
}
= function(x, ...){
print.sparse_vector = rep(".", x$length)
y $i] = as.character(x$x)
y[xprint(y, quote=FALSE)
}
= sparse_vector(c(3,5,8), c(2,4,6), 10)
a a
[1] . 3 . 5 . 8 . . . .
as.vector(a)
[1] 0 3 0 5 0 8 0 0 0 0
We want to support all operations in the group generics Ops
.
= function(e1, e2){
Ops.sparse_vector if(inherits(e1, "sparse_vector"))
= as.vector(e1)
e1 # If Ops is unary, e2 is missing
if(!missing(e2) && inherits(e2, "sparse_vector"))
= as.vector(e2)
e2 NextMethod(.Generic)
}
+ 2 a
[1] 2 5 2 7 2 10 2 2 2 2
2 + a
[1] 2 5 2 7 2 10 2 2 2 2
* 2 a
[1] 0 6 0 10 0 16 0 0 0 0
== 0 a
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE
- a
[1] 0 -3 0 -5 0 -8 0 0 0 0
And essentially for free, we suddenly support all operations. But this isn’t very efficient if we are summing two sparse vectors, for instance. So for some operations, we define their own functions.
`+.sparse_vector` = function(e1, e2){
message("+.sparse_vector dispatched")
if(nargs() == 1L)
return(e1)
if(inherits(e1, "sparse_vector") && inherits(e2, "sparse_vector")){
= union(e1$i, e2$i) |> sort()
i = intersect(e1$i, e2$i) |> sort()
j = max(e1$length, e2$length) # ignoring recycling for now
l = sparse_vector(rep(0, length(i)), i, l)
y $x[i %in% e1$i] = e1$x
y$x[i %in% e2$i] = e2$x
y$x[i %in% j] = e1$x[e1$i %in% j] + e2$x[e2$i %in% j]
yreturn(y)
}if(inherits(e1, "sparse_vector")){
$i] = e2[e1$i] + e1$x
e2[e1return(e2)
}if(inherits(e2, "sparse_vector")){
$i] = e1[e2$i] + e2$x
e1[e2return(e1)
}stop("This should be unreachable state")
}
= sparse_vector(c(1, 2, 3), c(1, 4, 6), 10)
b + a
+.sparse_vector dispatched
[1] . 3 . 5 . 8 . . . .
+ b
+.sparse_vector dispatched
[1] 1 . . 2 . 3 . . . .
+ b a
+.sparse_vector dispatched
[1] 1 3 . 7 . 11 . . . .
+ rep(1, 10) a
+.sparse_vector dispatched
[1] 1 4 1 6 1 9 1 1 1 1
+ rep(1, 10) b
+.sparse_vector dispatched
[1] 2 1 1 3 1 4 1 1 1 1
The code isn’t perfect, we ignore a lot of recycling, which means operations with matrices will not work correctly, and so on. But as an example of the Ops
and operator overloading, this should be sufficient. One obvious improvement would be redefining the operator [
to make subsetting of the vector much easier, and thus the code cleaner. We will look at this in the next article.
Summary
In this part, we have learned about operators, and more closely explored arithmetic, logical and relational operators. On top of this, we have learned more about group generic functions, double dispatch system, and managed to implement a simple sparse_vector
class with working arithmetic, logical and relational operators.