R语言辅导 | STAT 5361: Statistical Computing Chapter 1 Prerequisites


We assume that students can use R/Rstudio comfortably. If you are new to R, an extra amount of hard work should be devoted to make it up. In addition to the official documentations from the R-project, the presentations at a local SIAM workshop in Spring 2018 given by Wenjie Wang, a PhD student in Statistics at the time, can be enlighting: – https://github.com/wenjie2wang/2018-01-19-siam – https://github.com/wenjie2wang/2018-04-06-siam

If you have used R, but never paid attention to R programming styles, a style clinic would be a necessary step. A good place to start Hadley Wickham’s tidyverse style guide at http://style.tidyverse.org/. From my experience, the two most commonly overlooked styles for beginners are spacing and indentation. Appropriate spacing and indentation would immediately make crowdly piled code much more eye-friendly. Such styles can be automatically enforced by R packages such as formatr or lintr. Two important styles that cannot be automatically corrected are naming and documentation. As in any programming languate, naming of R objects (variables, functions, files, etc.) shoud be informative, concise, and consistent with certain naming convention. Documentation needs to be sufficient, concise, and kept close to the code; tools like R package roxygen2 can be very helpful.

For intermediate R users who want a skill lift, The Advanced R Programming book (Wickham 2019) by Hadley Wickham is available at https://adv-r.hadley.nz/. The source that generated the book is kindly made available at GitHub: https://github.com/hadley/adv-r. It is a great learning experience to complile the book from the source, during which you may pick up many necessary skills. To make R work efficiently, a lot of skills are needed. Even an experienced user would find unified resources of efficient R programming (Gillespie and Lovelace 2016) useful: https://csgillespie.github.io/efficientR/.

The homework, exam, and project will be completed by R Markdown. Following the step by step the instructions in Yihui Xie’s online book on bookdown (Xie 2016).at https://bookdown.org/yihui/bookdown/, you will be amazed how quickly you can learn to produce cool-looking documents and even book manuscripts. If you are a keener, you may as well follow Yihui’s blogdown (Xie, Thomas, and Hill 2017), see the online book https://bookdown.org/yihui/blogdown/, to build your own website using R Markdown.

All your source code will be version controlled by git and archived on GitHub. RStudio has made using git quite straightforward. The online tutorial by Jenny Bryan, Happy Git and GitHub for the useR at https://happygitwithr.com/, is a very useful tool to get started

In mathematical statistics, we have learned that, under certain regularity conditions, the maximum likelihood estimator (MLE) is consistent, asymptotically normal, and most efficient. The asymptotic variance of the estimator if the inverse of the Fisher information matrix. Specifically, let \(X_1, \ldots, X_n\) be a random sample from a distribution with density \(f(x; \theta)\), where \(\theta\) is a parameter. How do we obtain the MLE?

Package MASS provides a function fitdistr() to obtain the MLE for univariate distributions with a random sample. We can learn two things from this function. First, an objective function representing the negative loglikelihood is formed, depending on the input of the density function, and fed to the optimizer function optim. Second, the variance estimator of the MLE is obtained by inverting the Hessian matrix of the objective function, which is an estimator of the Fisher information matrix. For commonly used distributions, starting values are not necessary. The function computes moment estimator and use them as starting values.

For distributions not in the provided list, the densfun needs to be a function returning a density evaluated at its first arguments, in which case, the start argument needs to be a named list to feed to the optimizer. For example, pretend that someone has a fancy distribution which is just a rename of the gamma distribution. Its density function is defined based on the gamma density dgamma.

The stats4 package provides MLE using S4 classes.

Burns (2012) summarizes many traps beginners may fall into. The first one is the floating number trap. Are you surprised by what you see in the following?

How does a computer represent a number? Computers use switches, which has only two states, on and off, to represent numbers. That is, it uses the binary number system. With a finite (though large) number switches on a computer, not all real numbers are representable. Binary numbers are made of bits, each of which represent a switch. A byte is 8 bits, which represents one character.

An integer on many computers has length 4-byte or 32-bit. It is represented by the coefficients \(x_i\) in \[ \sum_{i=1}^{32} x_i 2^{i-1} – 2^{31}, \] where \(x_i\) is either 0 or 1.

A floating point number is represented by \[ (-1)^{x_0}\left(1 + \sum_{i=1}^t x_i 2^{-i}\right) 2^{k} \] for \(x_i \in \{0, 1\}\), where \(k\) is an integer called the exponent, \(x_0\) is the sign bit. The fractional part \(\sum_{i=1}^t x_i 2^{-i}\) is the significand. By convention (for unique representation), the exponent is chosen so that the fist digit of the significand is 1, unless it would result in the exponent being out of range.

A standard double precision representation uses 8 bytes or 64 bits: a sign bit, an 11 bit exponent, and \(t = 52\) bits for the significand. With 11 bits there are \(2^{11} = 2048\) possible values for the exponent, which are usually shifted so that the allowed values range from \(-1022\) to 1023, again with one special sign value.

The IEEE standard for floating point arithmetic calls for basic arithmetic operations to be performed to higher precision, and then rounded to the nearest representable floating point number.

Arithmetic underflow occurs where the result of a calculation is a smaller number than what the computer can actually represent. Arithmetic overflow occurs when a calculation produces a result that is greater in magnitude than what the computer can represent. Here is a demonstration (Gray 2001).

More generally, a floating point number with base \(\beta\) can be represented as \[ (-1)^{x_0}\left(\sum_{i=1}^t x_i \beta^{-i}\right) \beta^{k}, \] where \(x_0 \in \{0, 1\}\)\(x_i \in \{0, 1, \ldots, \beta – 1\}\)\(i = 1, \ldots, t\), and \(E_{\min} \le k \le E_{\max}\) for some \(E_{\min}\) and \(E_{\max}\). For a binary system \(\beta = 2\) while for the decimal system \(\beta = 10\). To avoid multiple representations, a normalized representation is defined by forcing the first significant digit (\(x_1\)) to be non-zero. The numbers \(\{x: |x| < \beta^{E_{\min} – 1}\}\) (called subnormals) can not be normalized and, are sometimes excluded from the system. If an operation results in a subnormal number underflow is said to have occurred. Similarly if an operation results in a number with the exponent \(k > E_{\max}\), overflow is said to have occurred.

Let \(f(x)\) be the floating point representation of \(x\). The relative error of \(f(x)\) is \(\delta_x = [f(x) – x] / x\). The basic formula for error analysis is \[ f(x \circ y) = (1+ \epsilon_m) (x \circ y) \] where \(\circ\) is an operator (e.g., add, subtract, multiple, divide, etc.), \((x \circ y)\) is the exact result, and \(\epsilon_m\) is the smallest positive floating-point number \(z\) such that \(1 + z \ne 1\)Add demonstration from Section 1.3 of Gray (2001); see also [notes] (https://people.eecs.berkeley.edu/~demmel/cs267/lecture21/lecture21.html).

(Section 1.4 from Gray) Consider computing \(x + y\), where \(x\) and \(y\) have the same sign. Let \(\delta_x\) and \(\delta_y\) be the relative error in the floating point representation of \(x\) and \(y\), respectively (e.g., \(\delta_x = [f(x) – x]/x\), so \(f(x) = x(1 + \delta_x)\)). What the computer actually calculates is the sum of the floating point representations \(f (x) + f (y)\), which may not have an exact floating point representation, in which case it is rounded to the nearest representable number \(f(f(x) + f(y))\). Let \(\delta_s\) be the relative error of the \(f(f(x) + f(y))\). Then, \[\begin{align*} \phantom{ = } & |f(f(x) + f(y)) – (x+y)| \\ = & |f(f(x) + f(y)) – f(x) – f(y) + f(x) – x + f(y) – y| \\ \le & |\delta_s[f(x) + f(y)]| + |\delta_x x| + |\delta_y y|\\ \le & |x+y|(\epsilon_m + \epsilon^2_m) + |x+y| \epsilon_m \\ \approx & 2\epsilon_m |x+y|, \end{align*}\] where the higher order terms in \(\epsilon_m\) have been dropped, since they are usually negligible. Thus \(2\epsilon_m\) is an (approximate) bound on the relative error in a single addition of two numbers with the same sign.

Example 1.1 (Computing sample sum) (Section 1.5.1 of Srivistava (2009)) Let \(x_1, \ldots, x_n\) be the observations from a random of size \(n\). The simplest algorithm to compute the sample sum is to start from \(s_0 = 0\) and add one term each time \(s_j = s_{j-1} + x_j\). Each addition may result in a number that is not a floating number, with an error denoted by multiple \((1 + \epsilon)\) to that number.

For example, when \(n = 3\), suppose that \((x_1, x_2, x_3)\) are floating numbers. We have: \[\begin{align*} f(s_1) &= (0 + x_1)(1 + \epsilon_1)\\ f(s_2) &= (f(s_1) + x_2)(1 + \epsilon_2)\\ f(s_3) &= (f(s_2) + x_3)(1 + \epsilon_3) \end{align*}\] It can be seen that \[\begin{align*} f(s_3) = (x_1 + x_2 + x_3) + x_1 (\epsilon_1 + \epsilon_2 + \epsilon_3) + x_2 (\epsilon_2 + \epsilon_3) + x_3(\epsilon_3) + o(\epsilon_m). \end{align*}\] The cumulative error is approximately \[\begin{align*} &\phantom{ = } \| f(s_3) – (x_1 + x_2 + x_3)\| \\ &\sim \| x_1 (\epsilon_1 + \epsilon_2 + \epsilon_3) + x_2 (\epsilon_2 + \epsilon_3) + x_3(\epsilon_3) \|\\ &\le \|x_1\| (\|\epsilon_1\| + \|\epsilon_2\| + \|\epsilon_3\|) + \|x_2\| (\|\epsilon_2\| + \|\epsilon_3\|) + \|x_3\| \|\epsilon_3\| \\ &\le (3 \|x_1\| + 2\|x_2\| + \|x_3\|) \epsilon_m. \end{align*}\] This suggests that better accuracy can be achieved at the expense of sorting in increasing order before computing the sum.

Todo: design a decimal system with limited precision to demonstrate the error analysis

Subtracting one number from another number of similar magnitude can lead to large relative error. This is more easily illustrated using a decimal system (\(\beta = 10\)); see Section~1.4 of Srivastava (2009). For simplicity, consider the case with \(t = 4\). Let \(x = 122.9572\) and \(y = 123.1498\). Their floating point representations are (1230, 3) and (1231, 3). The resulting difference is \(-0.1\) while the actual answer is \(-0.1926\). The idea is the same on a binary system.

Consider an extreme case: \[\begin{align*} f(x) &= 1 / 2 + \sum_{i=2}^{t-1} x_i / 2^i + 1 / 2^t,\\ f(y) &= 1 / 2 + \sum_{i=2}^{t-1} x_i / 2^i + 0 / 2^t. \end{align*}\] The absolute errors of \(f(x)\) and \(f(y)\) are in \((0, 1/2^{t+1})\). The true difference \(x – y\) could be anywhere in \((0, 1 / 2^{t-1})\) The computed subtraction is \(f(x) – f(y) = 1 / 2^t\). So the relative error can be arbitrarily large.

The error from subtraction operation also explains why approximating \(e^{x}\) by partial sums does not work well for negative \(x\) with a large magnitude. Consider an implementation of the exponential function with the Taylor series \(\exp(x) = \sum_{i=0}^{\infty} x^i / i!\).

The accuracy is poor for \(x < 0\). The problem in accuracy occurs because the terms alternate in sign, and some of the terms are much greater than the final answer. It can be fixed by noting that \(\exp(-x) = 1 / \exp(x)\).

(Wiki) In numerical analysis, the condition number of a function with respect to an argument measures how much the output value of the function can change for a small change in the input argument. This is used to measure how sensitive a function is to changes or errors in the input, and how much error in the output results from an error in the input.

The condition number is an application of the derivative, and is formally defined as the value of the asymptotic worst-case relative change in output for a relative change in input. The “function” is the solution of a problem and the “arguments” are the data in the problem. In particular, \(C(x)\) is called the condition number of a function \(f\) at a given point \(x\) if it satisfies the condition \[ |\mbox{relative change in } f(x)| \sim C(x)|\mbox{relative change in } x| . \] The condition number is frequently applied to questions in linear algebra, in which case the derivative is straightforward but the error could be in many different directions, and is thus computed from the geometry of the matrix. More generally, condition numbers can be defined for non-linear functions in several variables.

A problem with a low condition number is said to be well-conditioned, while a problem with a high condition number is said to be ill-conditioned. The condition number is a property of the problem. Paired with the problem are any number of algorithms that can be used to solve the problem, that is, to calculate the solution. Some algorithms have a property called backward stability. In general, a backward stable algorithm can be expected to accurately solve well-conditioned problems. Numerical analysis textbooks give formulas for the condition numbers of problems and identify the backward stable algorithms. In the exponential function example, the first implementation is unstable for large negative \(x\) while the second is stable.

It is important in likelihood maximization to work on the log scale. Use argument log in dpq functions instead of taking logs. This can be important in probalistic networks and MC(MC) where \(P = P_1 · P_2 \cdots · P_n\) quickly underflows to zero.

It is also important to use log1p and expm1 alike when caculating \(\log(1 + x)\) and \(\exp(x)- 1\) when \(|x| \ll 1\). One example is the evaluation of distribution and density function of the generalized extreme value (GEV) distribution. The GEV distribution has distribution function \[\begin{equation*} F(x; \mu, \sigma, \xi) = \begin{cases} \exp[- \left[ 1 + \left(\frac{x – \mu}{\sigma}\right) \xi\right]^{-1 / \xi} & \xi \ne 0, \quad (1 + \xi (x – \mu) / \sigma) > 0,\\ \exp[e^{ -(x – \mu) / \sigma}] & \xi = 0, \end{cases} \end{equation*}\] for \(\mu \in R\)\(\sigma > 0\)\(\xi \in R\).

Two implementations of GEV distribution function when $\xi$ is close to zero.Two implementations of GEV distribution function when $\xi$ is close to zero.

Figure 1.1: Two implementations of GEV distribution function when \(\xi\) is close to zero.

Figure 1.1 shows a comparison of evaluating the GEV distribution function at \(x = 1\) for \(\mu = 0\)\(\sigma = 1\), and a sequence of \(\xi\) values near zero using two implementations.