bake, stew, and freeze, for caching and reproducibility

Author

Aaron A. King and Edward L. Ionides


Sub-computations and their interdependencies

A scientific computation can be broken down into sub-computations, each of which depends on previous calculations. We refer to the fact that calculation B depends on the results of a calculation A by saying that B is downstream of A or A is upstream of B.

As we develop a scientific computation, we often need to try various approaches, tweak algorithmic settings, and so on. When we do so, we need of course to re-run the computation to see the new result. This can be costly to the extent that the computation requires resources (time, money, and patience for example). When a computation is expensive, therefore, it can be useful to store its results, so that one does not have to recompute them unnecessarily. For example, suppose we are modifying a sub-computation B that depends on the results of the upstream calculation A. Any modification we make to B will require us to re-run B, but it would be wasteful to also re-run A, since we already know the result of A.

The pomp package provides two tools to help in the caching of intermediate results:
- bake stores a single R object, the result of a specified computation.
- stew can store multiple R objects resulting from a specified computation.
In addition to these, pomp provides freeze, which controls the pseudorandom number generator.

These functions also play a role in supporting reproducibility. In particular, they allow one to store not only the results of a calculation, but also the code that generated the results. We demonstrate them here.

The manual page gives a full description:

Caching with bake

In the following snippet, the results of a very simple calculation are stored in the file result1.rds. In this example, x and y represent the results of an upstream calculation and z is the result of the calculation we which to cache.

x <- 3
y <- runif(2)
bake(file="result1.rds",{
  x+y
}) -> z
x; y; z
[1] 3
[1] 0.74591811 0.08906348
[1] 3.745918 3.089063
attr(,"system.time")
   user  system elapsed 
      0       0       0 

When we run the above snippet for the first time, z is computed according to the recipe given, and is then stored, in R’s binary .rds format, in the file result1.rds. [Note also that bake appears to have retained some information about the amount of time used in the computation; see below for more on this.]

If we run the code again,

bake(file="result1.rds",{
  x+y
}) -> z
z
[1] 3.745918 3.089063
attr(,"system.time")
   user  system elapsed 
      0       0       0 

there is no re-computation of z. Instead, bake notices that result1.rds exists and therefore opens and reads the file, returning the stored result.

What happens if an upstream quantity changes? For example:

x <- 5
y <- runif(2)
bake(file="result1.rds",{
  x+y
}) -> z
x; y; z
[1] 5
[1] 0.63290167 0.09881188
[1] 3.745918 3.089063
attr(,"system.time")
   user  system elapsed 
      0       0       0 

Note that we no longer have x+y==z. Since z depends on x and y, we must re-compute z. To do so, we simply delete the file result1.rds:

file.remove("result1.rds")
[1] TRUE
bake(file="result1.rds",{
  x+y
}) -> z
x; y; z
[1] 5
[1] 0.63290167 0.09881188
[1] 5.632902 5.098812
attr(,"system.time")
   user  system elapsed 
      0       0       0 

Thus, one has to manage the dependencies between bake calls oneself.

Caching with stew

The stew function works just like bake, but it can store multiple R objects, each of which has a name. To do so, it uses R’s .rda file format. For example, consider the following snippet of code.

x <- 5
y <- runif(2)
stew(file="result2.rda",{
  z <- x+y
  w <- rexp(1)
  z+w
})
x; y; z; w
[1] 5
[1] 0.0406416 0.4623526
[1] 5.040642 5.462353
[1] 0.7277024

The ls command allows us to see the names of all R objects that exist in our workspace:

ls()
[1] "w" "x" "y" "z"

Now, if we re-run the stew call:

stew(file="result2.rda",{
  z <- x+y
  w <- rexp(1)
  z+w
})
x; y; z; w
[1] 5
[1] 0.0406416 0.4623526
[1] 5.040642 5.462353
[1] 0.7277024

This just loads the values of z and w from the file into the workspace, as we can see from the following.

rm(x,y,z,w)
stew(file="result2.rda",{
  z <- x+y
  w <- rexp(1)
  z+w
})
ls()
[1] "w" "z"

In the above, the rm call removes the four variables named. We see that the stew call has retrieved z and w, but not x and y.

Notice also that the result of the last line of code inside the stew call, since it is not stored in any named location, is not cached.

Controlling the random-number generator

Both stew and bake allow one to control the pseudorandom number generator (RNG) by fixing its seed. For example, the last snippet above includes a call to rexp, which simulates a draw from an exponential random variable. Ordinarily, each time rexp is called, it returns a different value.

Consider the following snippet.

stew(file="result3.rda",seed=99,{
  w <- rexp(1)
})
w
[1] 0.1694237

Of course, if we call stew again, we will simply reload the result we just computed:

stew(file="result3.rda",seed=99,{
  w <- rexp(1)
})
w
[1] 0.1694237

However, if we now delete the result file and re-run,

file.remove("result3.rda")
[1] TRUE
stew(file="result3.rda",seed=99,{
  w <- rexp(1)
})
w
[1] 0.1694237

we get the same result. The positive integer we pass to the seed argument sets the state of R’s built-in RNG so that subsequent calls to rexp (or any other random-deviate simulator) will produce the same values. To prevent this from affecting calculations outside the stew call, stew restores the RNG to the state it was in just prior to the stew call.

The bake function has exactly the same feature.

Finally, if one wishes to control the RNG in this way, without doing any caching, one can make use of the freeze function provided by pomp. For example, consider the following.

rexp(1)
[1] 1.565154
freeze(rexp(1),seed=34996)
[1] 1.518076
rexp(1)
[1] 0.8314933
freeze(rexp(1),seed=34996)
[1] 1.518076
rexp(1)
[1] 0.06788161

Exercise

Verify the claim that freeze does not affect the RNG outside of its call.


Serialized file formats in R

R has two binary formats for storing general R objects. The .rds format holds a single R object. One reads and writes such files using readRDS and saveRDS, respectively. The .rda format can hold multiple R objects. The load, attach, and save commands allow one to work with such files. See the R help pages for these functions for more information.


Attributes stored by bake and stew

bake stores information about the amount of time required for a computation in the file, as an attribute of the stored object. In addition, if the RNG has been fixed (by means of the seed argument), then the value of seed is stored as another attribute.


Produced in R version 4.4.0 with pomp version 5.9.