Tuning IF2

Author

Aaron A. King and Edward L. Ionides

Produced with R version 4.4.0 and pomp version 5.9.

Number of particles

The initial parameter swarm, $\{ \Theta^0_j, j=1,\dots,J\}$, usually consists of $J$ identical replications of some starting parameter vector.
$J$ is set to be sufficient for particle filtering. Because the addition of random perturbations acts to combat particle depletion, it is typically possible to take $J$ substantially smaller than the value needed to obtain precise likelihood estimates via pfilter. By the time of the last iteration ($m=M$) one should not have effective sample size close to 1.

Perturbations

Perturbations are usually chosen to be multivariate normal, with $\sigma_m$ being a scale factor for iteration $m$: \[h_n(\theta|\varphi;\sigma) \sim N[\varphi, \sigma^2_m V_n].\]
$V_n$ is usually taken to be diagonal, \[ V_n = \left( \begin{array}{ccccc} v_{1,n}^2 & 0 & 0 & \cdots & 0 \\ 0 & v_{2,n}^2 & 0 & \cdots & 0 \\ 0 & 0 & v_{3,n}^2 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & v_{p,n}^2 \end{array}\right).\]
If $\theta_i$ is a parameter that affects the dynamics or observations throughout the time series, it is called a regular parameter, and it is often appropriate to specify \[v_{i,n} = v_i.\]
If $\theta_j$ is a parameter that affects only the initial conditions of the dynamic model, it is called an initial value parameter (IVP) and it is appropriate to specify \[v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=0$} \\0 & \mbox{if $n>0$} \end{array}\right.\]
If $\theta_k$ is a break-point parameter that models how the system changes at time $t_q$, then $\theta_k$ is like an IVP at time $t_q$ and it is appropriate to specify \[v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=q$} \\ 0 & \mbox{if $n\neq q$} \end{array}\right.\]

Cooling schedule

$\sigma_{1:M}$ is called a cooling schedule, following a thermodynamic analogy popularized by simulated annealing. As $\sigma_m$ becomes small, the system cools toward a “freezing point”. If the algorithm is working successfully, the freezing point should be close to the lowest-energy state of the system, i.e., the MLE. Typical choices of the cooling schedule are geometric, $\sigma_m = \alpha^m$, and hyperbolic, $\sigma_m \propto 1/(1+\alpha\,m)$. In mif2, the cooling schedule is parameterized by $\sigma_{50}$, the cooling fraction after 50 IF2 iterations.

Parameter transformations

It is generally helpful to transform the parameters so that (on the estimation scale) they are real-valued, unconstrained, and have uncertainty on the order of 1 unit. For example, one typically takes a logarithmic transformation of positive parameters and a logistic transformation of $[0,1]$ valued parameters.
On such a scale, it is surprisingly often effective to take \[v_i \sim 0.02\] for regular parameters (RPs) and \[v_j \sim 0.1\] for initial value parameters (IVPs).

Maximizing the likelihood in stages

Early on in an investigation, one might take $M=100$ and $\sigma_M=0.1$. This allows for a relatively broad search of the parameter space.
As the investigation proceeds, and one finds oneself in the heights of the likelihood surface, one can refine the search. In doing so, it helps to examine diagnostic plots.
In particular, one typically needs to reduce the magnitude of the perturbations (rw.sd) and perhaps adjust the cooling schedule (cooling.fraction.50) to eke out the last few units of log likelihood.
Profile likelihood computations are not only valuable as a way of obtaining confidence intervals. It is often the case that by profiling, one simplifies the task of finding the MLE. Of course, the price that is paid is that a profile calculation requires multiple parallel IF2 computations.

General remarks

It is remarkable that useful general advice exists for the choice of algorithmic parameters that should in principle be model- and data-specific. Here is one possible explanation: the precision of interest is often the second significant figure and there are often on the order of 100 observations (10 monthly observations would be too few to fit a mechanistic model; 1000 would be unusual for an epidemiological system).

Exercises

Assessing and improving algorithmic parameters

Develop your own heuristics to try to improve the performance of mif2 in the Consett measles example. Specifically, for a global optimization procedure carried out using random starting values in the specified box, let $\hat\Theta_{\mathrm{max}}$ be a random Monte Carlo estimate of the resulting MLE, and let $\hat\theta$ be the true (unknown) MLE. We can define the maximization error in the log likelihood to be \[e = \ell(\hat\theta) - E[\ell(\hat\Theta_{\mathrm{max}})].\] We cannot directly evaluate $e$, since there is also Monte Carlo error in our evaluation of $\ell(\theta)$, but we can compute it up to a known precision. Plan some code to estimates $e$ for a search procedure using a computational effort of $JM=2\times 10^7$, comparable to that used for each mif computation in the global search. Discuss the strengths and weaknesses of this quantification of optimization success. See if you can choose $J$ and $M$ subject to this constraint, together with choices of rw.sd and the cooling rate, cooling.fraction.50, to arrive at a quantifiably better procedure. Computationally, you may not be readily able to run your full procedure, but you could run a quicker version of it.