bake
, stew
, and freeze
, for caching and reproducibility
bake
, stew
, and freeze
, for caching and reproducibilitySub-computations and their interdependencies
A scientific computation can be broken down into sub-computations, each of which depends on previous calculations. We refer to the fact that calculation B depends on the results of a calculation A by saying that B is downstream of A or A is upstream of B.
As we develop a scientific computation, we often need to try various approaches, tweak algorithmic settings, and so on. When we do so, we need of course to re-run the computation to see the new result. This can be costly to the extent that the computation requires resources (time, money, and patience for example). When a computation is expensive, therefore, it can be useful to store its results, so that one does not have to recompute them unnecessarily. For example, suppose we are modifying a sub-computation B that depends on the results of the upstream calculation A. Any modification we make to B will require us to re-run B, but it would be wasteful to also re-run A, since we already know the result of A.
The pomp package provides two tools to help in the caching of intermediate results:
- bake
stores a single R object, the result of a specified computation.
- stew
can store multiple R objects resulting from a specified computation.
In addition to these, pomp provides freeze
, which controls the pseudorandom number generator.
These functions also play a role in supporting reproducibility. In particular, they allow one to store not only the results of a calculation, but also the code that generated the results. We demonstrate them here.
The manual page gives a full description:
Caching with bake
In the following snippet, the results of a very simple calculation are stored in the file result1.rds
. In this example, x
and y
represent the results of an upstream calculation and z
is the result of the calculation we which to cache.
<- 3
x <- runif(2)
y bake(file="result1.rds",{
+y
x-> z
}) x; y; z
[1] 3
[1] 0.74591811 0.08906348
[1] 3.745918 3.089063
attr(,"system.time")
user system elapsed
0 0 0
When we run the above snippet for the first time, z
is computed according to the recipe given, and is then stored, in R’s binary .rds
format, in the file result1.rds
. [Note also that bake
appears to have retained some information about the amount of time used in the computation; see below for more on this.]
If we run the code again,
bake(file="result1.rds",{
+y
x-> z
}) z
[1] 3.745918 3.089063
attr(,"system.time")
user system elapsed
0 0 0
there is no re-computation of z
. Instead, bake
notices that result1.rds
exists and therefore opens and reads the file, returning the stored result.
What happens if an upstream quantity changes? For example:
<- 5
x <- runif(2)
y bake(file="result1.rds",{
+y
x-> z
}) x; y; z
[1] 5
[1] 0.63290167 0.09881188
[1] 3.745918 3.089063
attr(,"system.time")
user system elapsed
0 0 0
Note that we no longer have x+y==z
. Since z
depends on x
and y
, we must re-compute z
. To do so, we simply delete the file result1.rds
:
file.remove("result1.rds")
[1] TRUE
bake(file="result1.rds",{
+y
x-> z
}) x; y; z
[1] 5
[1] 0.63290167 0.09881188
[1] 5.632902 5.098812
attr(,"system.time")
user system elapsed
0 0 0
Thus, one has to manage the dependencies between bake
calls oneself.
Caching with stew
The stew
function works just like bake
, but it can store multiple R objects, each of which has a name. To do so, it uses R’s .rda
file format. For example, consider the following snippet of code.
<- 5
x <- runif(2)
y stew(file="result2.rda",{
<- x+y
z <- rexp(1)
w +w
z
}) x; y; z; w
[1] 5
[1] 0.0406416 0.4623526
[1] 5.040642 5.462353
[1] 0.7277024
The ls
command allows us to see the names of all R objects that exist in our workspace:
ls()
[1] "w" "x" "y" "z"
Now, if we re-run the stew
call:
stew(file="result2.rda",{
<- x+y
z <- rexp(1)
w +w
z
}) x; y; z; w
[1] 5
[1] 0.0406416 0.4623526
[1] 5.040642 5.462353
[1] 0.7277024
This just loads the values of z
and w
from the file into the workspace, as we can see from the following.
rm(x,y,z,w)
stew(file="result2.rda",{
<- x+y
z <- rexp(1)
w +w
z
})ls()
[1] "w" "z"
In the above, the rm
call removes the four variables named. We see that the stew
call has retrieved z
and w
, but not x
and y
.
Notice also that the result of the last line of code inside the stew
call, since it is not stored in any named location, is not cached.
Controlling the random-number generator
Both stew
and bake
allow one to control the pseudorandom number generator (RNG) by fixing its seed. For example, the last snippet above includes a call to rexp
, which simulates a draw from an exponential random variable. Ordinarily, each time rexp
is called, it returns a different value.
Consider the following snippet.
stew(file="result3.rda",seed=99,{
<- rexp(1)
w
}) w
[1] 0.1694237
Of course, if we call stew
again, we will simply reload the result we just computed:
stew(file="result3.rda",seed=99,{
<- rexp(1)
w
}) w
[1] 0.1694237
However, if we now delete the result file and re-run,
file.remove("result3.rda")
[1] TRUE
stew(file="result3.rda",seed=99,{
<- rexp(1)
w
}) w
[1] 0.1694237
we get the same result. The positive integer we pass to the seed
argument sets the state of R’s built-in RNG so that subsequent calls to rexp
(or any other random-deviate simulator) will produce the same values. To prevent this from affecting calculations outside the stew
call, stew
restores the RNG to the state it was in just prior to the stew
call.
The bake
function has exactly the same feature.
Finally, if one wishes to control the RNG in this way, without doing any caching, one can make use of the freeze
function provided by pomp. For example, consider the following.
rexp(1)
[1] 1.565154
freeze(rexp(1),seed=34996)
[1] 1.518076
rexp(1)
[1] 0.8314933
freeze(rexp(1),seed=34996)
[1] 1.518076
rexp(1)
[1] 0.06788161
Exercise
Verify the claim that freeze
does not affect the RNG outside of its call.
Serialized file formats in R
R has two binary formats for storing general R objects. The .rds
format holds a single R object. One reads and writes such files using readRDS
and saveRDS
, respectively. The .rda
format can hold multiple R objects. The load
, attach
, and save
commands allow one to work with such files. See the R help pages for these functions for more information.
Attributes stored by bake
and stew
bake
stores information about the amount of time required for a computation in the file, as an attribute of the stored object. In addition, if the RNG has been fixed (by means of the seed
argument), then the value of seed
is stored as another attribute.
Produced in R version 4.4.0 with pomp version 5.9.