------------------------------------- lines 5-369 of file: xrst/theory.xrst ------------------------------------- {xrst_begin theory} {xrst_spell anders argmin kasper kristensen skaug } :math:`\newcommand{\dtheta}[1]{ \frac{\R{d}}{\R{d} \theta_{ #1}} }` Laplace Approximation for Mixed Effects Models ############################################## Reference ********* TMB: Automatic Differentiation and Laplace Approximation, Kasper Kristensen, Anders Nielsen, Casper W. Berg, Hans Skaug, Bradley M. Bell, Journal of Statistical Software 70, 1-21 April 2016. Total Likelihood **************** The reference above defines :math:`f( \theta, u)` to be the negative log-likelihood of the :math:`z`, :math:`y`, :math:`u` and :math:`\theta`; i.e., .. math:: - \log [ \; \B{p} ( y | \theta, u ) \B{p} ( u | \theta ) \; \B{p} ( z | \theta )\B{p} ( \theta ) \; ] Random Likelihood, f(theta, u) ****************************** We use :math:`f( \theta , u )` for the part of the likelihood that depends on the random effects :math:`u`; .. math:: f( \theta, u ) = - \log [ \B{p} ( y | \theta, u ) \B{p} ( u | \theta ) ] Assumption ========== The function :math:`f(\theta, u)` is assumed to be smooth. Furthermore, there are no constraints on the value of :math:`u`. Random Effects Objective ======================== Give a value for the fixed effects :math:`\theta`, the random effects objective is the random likelihood as just a function of the random effects; i.e., :math:`f( \theta , \cdot )`. Fixed Likelihood, g(theta) ************************** We use :math:`g( \theta )` for the part of the likelihood that only depends on the fixed effects :math:`\theta`; .. math:: g( \theta ) = - \log [ \B{p} ( z | \theta ) \B{p} ( \theta ) ] The function :math:`g( \theta )` may not be smooth, to be specific, it can have absolute values in it (corresponding to the Laplace densities). Furthermore, there may be constraints on the value of :math:`\theta`. Optimal Random Effects, u^(theta) ********************************* Given the fixed effects :math:`\theta`, we use :math:`\hat{u} ( \theta )` to denote the random effects that maximize the random likelihood; i.e., .. math:: \hat{u} ( \theta ) = \R{argmin} \; f( \theta, u ) \; \R{w.r.t.} \; u Note that this definition agrees with the other definition for :ref:`u^(theta)` . Objective ********* Laplace Approximation, h(theta, u) ================================== Using the notation above, the Laplace approximation as a function of both the fixed and random effects is .. math:: h( \theta, u ) = + \frac{1}{2} \log \det f_{u,u} ( \theta, u ) + f( \theta, u ) - \frac{n}{2} \log ( 2 \pi ) where :math:`n` is the number of random effects. Laplace Objective, r(theta) =========================== We refer to .. math:: r( \theta ) = h[ \theta , \hat{u} ( \theta ) ] \approx - \log \left[ \int_{-\infty}^{+\infty} \B{p} ( y | \theta, u ) \B{p} ( u | \theta ) \; \B{d} u \right] as the Laplace objective. This corresponds to equation (4) in the :ref:`theory@Reference` . Fixed Effects Objective, L(theta) ================================= The fixed effects objective, as a function of just the fixed effects, is .. math:: L ( \theta ) = r( \theta ) + g( \theta ) Derivative of Optimal Random Effects ************************************ Because :math:`f(\theta, u)` is smooth, and :math:`\hat{u} ( \theta )` is optimal w.r.t :math:`u`, we obtain .. math:: f_u [ \theta , \hat{u} ( \theta ) ] = 0 From this equation, and the implicit function theorem, it follows that .. math:: \hat{u}_\theta ( \theta ) = - f_{u,u} \left[ \theta , \hat{u} ( \theta ) \right]^{-1} f_{u,\theta} \left[ \theta , \hat{u} ( \theta ) \right] Derivative of Random Constraints ******************************** The derivative of the :ref:`random constraint function` is given by .. math:: \partial_\theta [ A \; \hat{u} ( \theta ) ] = A \; \hat{u}_\theta ( \theta ) Derivative of Laplace Objective ******************************* The derivative of the random part of the objective is given by .. math:: r_\theta ( \theta ) = h_\theta [ \theta , \hat{u} ( \theta ) ] + h_u [ \theta , \hat{u} ( \theta ) ] \hat{u}_\theta ( \theta ) Thus the derivative of :math:`r ( \theta )` can be computed using the derivative of :math:`\hat{u} ( \theta )` and the partials of :math:`h( \theta , u )`. Let :math:`\partial_k` denote the partial with respect to the *k*-th component of the combined vector :math:`( \theta , u )`. .. math:: \partial_k [ h( \theta , u ) ] = \partial_k [ f( \theta , u ) ] + \frac{1}{2} \sum_{i=0}^{n-1} \sum_{j=0}^{n-1} f_{u,u} ( \theta , u )_{i,j}^{-1} \partial_k [ f_{u,u} ( \theta , u)_{i,j} ] where :math:`n` is the number of random effects. Note that :math:`f_{u,u} ( \theta , u )` is often sparse and only non-zero components need be included in the summation. This is discussed in more detail near equation (8) in the :ref:`theory@Reference` . We also note that if :math:`k` corresponds to a component of :math:`u` then .. math:: \partial_k ( f[ \theta , \hat{u} ( \theta ) ] ) = 0 Approximate Optimal Random Effects ********************************** First Order, U(beta, theta, u) ============================== We define the function .. math:: U ( \beta , \theta , u ) = u - f_{u,u} ( \theta , u )^{-1} f_u ( \beta , u ) It follows that .. math:: U \left[ \theta , \theta , \hat{u} ( \theta ) \right] = \hat{u} ( \theta ) .. math:: U_{\beta} [ \theta , \theta , \hat{u} ( \theta ) ] = \hat{u}_\theta ( \theta ) Second Order, W(beta, theta, u) =============================== We define the function .. math:: W ( \beta , \theta , u ) = U( \beta , \theta , u ) - f_{u,u} ( \theta , u )^{-1} f_u [ \beta , U( \beta , \theta , u) ] It follows that .. math:: W \left[ \theta , \theta , \hat{u} ( \theta ) \right] = \hat{u} ( \theta ) .. math:: W_{\beta} [ \theta , \theta , \hat{u} ( \theta ) ] = \hat{u}_\theta ( \theta ) and for random effects indices :math:`i`, .. math:: W^i_{\beta \beta} [ \theta , \theta , \hat{u} ( \theta ) ] = \hat{u}^i_{\theta , \theta} ( \theta ) Approximate Laplace Objective, H(beta, theta, u) ************************************************ Given these facts we define .. math:: H( \beta , \theta , u) = + \frac{1}{2} \log \det f_{u,u} [ \beta, W( \beta , \theta , u) ] + f[ \beta, U( \beta , \theta , u) ] - \frac{n}{2} \log ( 2 \pi ) It follow that .. math:: r_{\theta,\theta} ( \theta ) = H_{\beta,\beta} \left[ \theta , \theta , \hat{u} ( \theta ) \right] Approximate Random Constraint Function, B(beta, theta, u) ********************************************************* We also define the approximation :ref:`random constraint function` .. math:: B( \beta , \theta , u) = A \; W( \beta , \theta , u ) Hessian of Laplace Objective **************************** Note that the Hessian of the Laplace objective :math:`r_{\theta,\theta} ( \theta )` is required when :ref:`derived_ctor@quasi_fixed` is false. In this case, the representation .. math:: r_{\theta,\theta} ( \theta ) = H_{\beta,\beta} \left[ \theta , \theta , \hat{u} ( \theta ) \right] is used to compute this Hessian. Hessian of Random Constraints ***************************** In the case where :ref:`derived_ctor@quasi_fixed` is false we need to compute second derivatives of the random constraint function. We use :math:`A^i` ( :math:`B^i`) to denote one of the rows of the random constraint matrix ( approximate random constraint function ). The Hessian of the random constraints can be computed using the formula .. math:: \partial_\theta \partial_\theta [ A^i \; \hat{u} ( \theta ) ] = B^i_{\beta,\beta} \left[ \theta , \theta , \hat{u} ( \theta ) \right] Sparse Observed Information *************************** Suppose that :math:`H` is a sparse positive definite Hessian of a likelihood at the maximum likelihood estimate for its unknown parameters. The corresponding asymptotic covariance for posterior distribution of the parameters is normal with covariance :math:`H^{-1}`. A vector :math:`v` with this covariance can be simulated as .. math:: v = R w where :math:`R` is defined by :math:`H^{-1} = R R^\R{T}` and :math:`w` is a normal with mean zero and the identity covariance. Suppose we have a sparse factorization of the form .. math:: L D L^\R{T} = P H P^\R{T} where :math:`L` is lower triangular, :math:`D` is diagonal, and :math:`P` is a permutation matrix. It follows that .. math:: H = P^\R{T} L D L^\R{T} P .. math:: H^{-1} = P^\R{T} L^{-\R{T}} D^{-1} L^{-1} P .. math:: R = P^\R{T} L^{-\R{T}} D^{-1/2} .. math:: v = P^\R{T} L^{-\R{T}} D^{-1/2} w If :math:`w` is simulated as a normal random vector with mean zero and identity covariance, and :math:`v` is computed using this formula, the mean of :math:`v` is zero and its covariance is given by .. math:: \B{E}[ v v^\R{T} ] = H^{-1} {xrst_end theory}