5 向量微积分(Vector Calculus)

我们将函数(function)定义为:

$$ \begin{aligned} f : \mathbb{R}^{D} & \rightarrow \mathbb{R} \\ \boldsymbol{x} & \mapsto f(\boldsymbol{x}) \end{aligned} $$

第一行表示 $f$ 是一个从 $\mathbb{R}^D$ 到 $\mathbb{R}$ 的映射,而第二行表示将具体的向量 $\boldsymbol{x}$ 输入到 $f$ 中得到函数值 $f(\boldsymbol{x})$。函数 $f$ 对每个输入 $\boldsymbol{x}$ 有且仅有一个对应的函数值 $f(\boldsymbol{x})$。

1.png

5.1 一元函数的微分(Differentiation of Univariate Functions)

2.png

定义5.1 差商(Difference Quotient)

差商的定义为:

$$ \frac{\delta y}{\delta x} :=\frac{f(x+\delta x)-f(x)}{\delta x} $$

它计算函数曲线上两点之间割线的斜率。

定义5.2 导数(Derivative)

形式上,对于 $h>0$,函数 $f$ 在 $x$ 处的导数定义为:

$$ \frac{\mathrm{d} f}{\mathrm{d} x} :=\lim _{h \rightarrow 0} \frac{f(x+h)-f(x)}{h} $$

5.1.1 泰勒级数(Taylor Series)

定义5.3 泰勒多项式(Taylor Polynomial)

函数 $f : \mathbb{R} \rightarrow \mathbb{R}$ 在 $x_0$ 处的次数为 $n$ 的泰勒多项式定义为:

$$ T_{n}(x) :=\sum_{k=0}^{n} \frac{f^{(k)}\left(x_{0}\right)}{k !}\left(x-x_{0}\right)^{k} $$

其中 $f^{(k)}\left(x_{0}\right)$ 是 $f$ 在 $x_0$ 处的 $k$ 阶导数(假设存在),$\frac{f^{(k)}\left(x_{0}\right)}{k !}$ 是多项式的系数。

定义5.4 泰勒级数(Taylor Series)

对于光滑函数(smooth function) $f \in \mathcal{C}^{\infty}, f : \mathbb{R} \rightarrow \mathbb{R}$,$f$ 在 $x_0$ 处的泰勒级数定义为:

$$ T_{\infty}(x)=\sum_{k=0}^{\infty} \frac{f^{(k)}\left(x_{0}\right)}{k !}\left(x-x_{0}\right)^{k} $$

如果 $x_0=0$,则得到的是麦克劳林级数(Maclaurin Series)。如果 $f(x)=T_{\infty}(x)$ 则称 $f$ 是解析函数(analytic function)。

5.1.2 求导法则(Differentiation Rules)

  • 乘法法则(Product Rule):

$$ (f(x) g(x))^{\prime}=f^{\prime}(x) g(x)+f(x) g^{\prime}(x) $$

  • 除法法则(Quotient Rule):

$$ \left(\frac{f(x)}{g(x)}\right)^{\prime}=\frac{f^{\prime}(x) g(x)-f(x) g^{\prime}(x)}{(g(x))^{2}} $$

  • 加法法则(Sum Rule):

$$ (f(x)+g(x))^{\prime}=f^{\prime}(x)+g^{\prime}(x) $$

  • 链式法则(Chain Rule):

$$ (g(f(x)))^{\prime}=(g \circ f)^{\prime}(x)=g^{\prime}(f(x)) f^{\prime}(x) $$

5.2 偏微分和梯度(Partial Differentiation and Gradients)

定义5.5 偏导数(Partial Derivative)和梯度(Gradient)

对于函数 $f : \mathbb{R}^{n} \rightarrow \mathbb{R}, \boldsymbol{x} \mapsto f(\boldsymbol{x}), \boldsymbol{x} \in \mathbb{R}^{n}$,我们将偏导数定义为:

$$ \begin{aligned} \frac{\partial f}{\partial x_{1}} &=\lim _{h \rightarrow 0} \frac{f\left(x_{1}+h, x_{2}, \ldots, x_{n}\right)-f(\boldsymbol{x})}{h} \\ & \vdots \\ \frac{\partial f}{\partial x_{n}} &=\lim _{h \rightarrow 0} \frac{f\left(x_{1}, \ldots, x_{n-1}, x_{n}+h\right)-f(\boldsymbol{x})}{h} \end{aligned} $$

并将这些偏导数集合为一个行向量,称为梯度:

$$ \nabla_{x} f=\operatorname{grad} f=\frac{\mathrm{d} f}{\mathrm{d} \boldsymbol{x}}=\left[\frac{\partial f(\boldsymbol{x})}{\partial x_{1}} \quad \frac{\partial f(\boldsymbol{x})}{\partial x_{2}} \quad \cdots \quad \frac{\partial f(\boldsymbol{x})}{\partial x_{n}}\right] \in \mathbb{R}^{1 \times n} $$

5.2.1 偏导数的基本法则(Basic Rules of Partial Differentiation)

  • 乘法法则(Product Rule):

$$ \frac{\partial}{\partial \boldsymbol{x}}(f(\boldsymbol{x}) g(\boldsymbol{x}))=\frac{\partial f}{\partial \boldsymbol{x}} g(\boldsymbol{x})+f(\boldsymbol{x}) \frac{\partial g}{\partial \boldsymbol{x}} $$

  • 加法法则(Sum Rule):

$$ \frac{\partial}{\partial \boldsymbol{x}}(f(\boldsymbol{x})+g(\boldsymbol{x}))=\frac{\partial f}{\partial \boldsymbol{x}}+\frac{\partial g}{\partial \boldsymbol{x}} $$

  • 链式法则(Chain Rule):

$$ \frac{\partial}{\partial \boldsymbol{x}}(g \circ f)(\boldsymbol{x})=\frac{\partial}{\partial \boldsymbol{x}}(g(f(\boldsymbol{x})))=\frac{\partial g}{\partial f} \frac{\partial f}{\partial \boldsymbol{x}} $$

5.2.2 链式法则(Chain Rule)

考虑二元函数 $f : \mathbb{R}^{2} \rightarrow \mathbb{R}$,其自变量 $x_1, x_2$ 又分别是关于 $t$ 的函数 $x_1(t)$ 和 $x_2(t)$,则 $f$ 关于 $t$ 的梯度为:

$$ \frac{\mathrm{d} f}{\mathrm{d} t}=\left[\begin{array}{cc}{\frac{\partial f}{\partial x_{1}}} & {\frac{\partial f}{\partial x_{2}}}\end{array}\right]\left[\begin{array}{c}{\frac{\partial x_{1}(t)}{\partial t}} \\ {\frac{\partial x_{2}(t)}{\partial t}}\end{array}\right]=\frac{\partial f}{\partial x_{1}} \frac{\partial x_{1}}{\partial t}+\frac{\partial f}{\partial x_{2}} \frac{\partial x_{2}}{\partial t} $$

考虑二元函数 $f : \mathbb{R}^{2} \rightarrow \mathbb{R}$,其自变量 $x_{1}, x_{2}$ 又分别是关于 $s$ 和 $t$ 的函数 $x_{1}(s, t)$ 和 $x_{2}(s, t)$,则偏导数

$$ \begin{aligned}\frac{\partial f}{\partial s}&=\frac{\partial f}{\partial x_{1}} \frac{\partial x_{1}}{\partial s}+\frac{\partial f}{\partial x_{2}} \frac{\partial x_{2}}{\partial s} \\ \frac{\partial f}{\partial t}&=\frac{\partial f}{\partial x_{1}} \frac{\partial x_{1}}{\partial t}+\frac{\partial f}{\partial x_{2}} \frac{\partial x_{2}}{\partial t}\end{aligned} $$

梯度通过矩阵乘法获得:

$$ \frac{\mathrm{d} f}{\mathrm{d}(s, t)}=\frac{\partial f}{\partial \boldsymbol{x}} \frac{\partial \boldsymbol{x}}{\partial(s, t)}=\underbrace{\left[\frac{\partial f}{\partial x_{1}} \frac{\partial f}{\partial x_{2}}\right]}_{=\frac{\partial f}{\partial \boldsymbol{x}}}\underbrace{\left[\begin{array}{cc}{\frac{\partial x_{1}}{\partial s}} & {\frac{\partial x_{1}}{\partial t}} \\ {\frac{\partial x_{2}}{\partial s}} & {\frac{\partial x_{2}}{\partial t}}\end{array}\right]}_{=\frac{\partial \boldsymbol{x}}{\partial(s, t)}} $$

只有在将梯度定义为行向量时,将链式法则写成矩阵相乘的形式才有意义。

梯度检查:

可以利用偏导数对计算机程序求出的梯度的正确性进行检查。选择一个很小的 $h$ 值,如 $h=10^{-4}$,利用公式

$$ \begin{aligned} \frac{\partial f}{\partial x_{1}} &=\lim _{h \rightarrow 0} \frac{f\left(x_{1}+h, x_{2}, \ldots, x_{n}\right)-f(\boldsymbol{x})}{h} \\ \vdots & \\ \frac{\partial f}{\partial x_{n}} &=\lim _{h \rightarrow 0} \frac{f\left(x_{1}, \ldots, x_{n-1}, x_{n}+h\right)-f(\boldsymbol{x})}{h} \end{aligned} $$

计算出梯度估计,与程序求出的梯度进行对比,如果它们之间的差异小到一定程度,如 $\sqrt{\frac{\sum_{i}\left(d h_{i}-d f_{i}\right)^{2}}{\sum_{i}\left(d h_{i}+d f_{i}\right)^{2}}}<10^{-6}$,则认为梯度的计算是正确的。

5.3 向量函数的梯度(Gradient of Vector-Valued Functions)

接下来我们将梯度的概念推广到向量函数中:$\boldsymbol{f}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}, n\geq 1,m>1$。

对于函数 $\boldsymbol{f} : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ 和向量 $\boldsymbol{x}=\left[x_{1}, \dots, x_{n}\right]^{\top} \in \mathbb{R}^{n}$,相应的函数值向量为:

$$ \boldsymbol{f}(\boldsymbol{x})=\left[\begin{array}{c}{f_{1}(\boldsymbol{x})} \\ {\vdots} \\ {f_{m}(\boldsymbol{x})}\end{array}\right] \in \mathbb{R}^{m} $$

因此函数 $\boldsymbol{f} : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ 关于 $x_{i} \in \mathbb{R}, i=1, \ldots n$ 的偏导数为:

$$ \frac{\partial \boldsymbol{f}}{\partial x_{i}}=\left[\begin{array}{c}{\frac{\partial f_{1}}{\partial x_{i}}} \\ {\vdots} \\ {\frac{\partial f_{m}}{\partial x_{i}}}\end{array}\right]=\left[\begin{array}{c}{\lim _{h \rightarrow 0} \frac{f_{1}\left(x_{1}, \ldots, x_{i-1}, x_{i}+h, x_{i+1}, \ldots x_{n}\right)-f_{1}(\boldsymbol{x})}{h}} \\ {\vdots} \\ {\lim _{h \rightarrow 0} \frac{f_{m}\left(x_{1}, \ldots, x_{i-1}, x_{i}+h, x_{i+1}, \ldots x_{n}\right)-f_{m}(\boldsymbol{x})}{h}}\end{array}\right] \in \mathbb{R}^{m} $$

梯度是行向量,因此 $\boldsymbol{f} : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ 关于 $\boldsymbol{x}\in \mathbb{R}^{n}$ 的梯度为:

$$ \frac{\mathrm{d} f(\boldsymbol{x})}{\mathrm{d} \boldsymbol{x}}=\left[\frac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial x_{1}} \cdot \cdot \cdot\frac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial x_{n}}\right]=\left[\begin{array}{ccc}{\frac{\partial f_{1}(\boldsymbol{x})}{\partial x_{1}}} & \cdots & {\frac{\partial f_{1}(\boldsymbol{x})}{\partial x_{n}}} \\ {\vdots} & {} & {\vdots} \\ {\frac{\partial f_{m}(\boldsymbol{x})}{\partial x_{1}}} &\cdots & {\frac{\partial f_{m}(\boldsymbol{x})}{\partial x_{n}}}\end{array}\right]\in \mathbb{R}^{m \times n} $$

定义5.6 雅可比矩阵(Jacobian Matrix)

向量函数 $\boldsymbol{f} : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}$ 的一阶偏导数集合称为雅可比矩阵。雅可比矩阵 $\boldsymbol{J}$ 是一个 $m \times n$ 矩阵,定义为:

$$ J=\nabla_{x} f=\frac{\mathrm{d} \boldsymbol{f}(\boldsymbol{x})}{\mathrm{d} \boldsymbol{x}}=\left[\frac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial x_{1}} \ldots \frac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial x_{n}}\right]==\left[\begin{array}{ccc}{\frac{\partial f_{1}(\boldsymbol{x})}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{1}(\boldsymbol{x})}{\partial x_{n}}} \\ {\vdots} & {} & {\vdots} \\ {\frac{\partial f_{m}(\boldsymbol{x})}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{m}(\boldsymbol{x})}{\partial x_{n}}}\end{array}\right] $$

其中 $\boldsymbol{x}=\left[\begin{array}{c}{x_{1}} \\ {\vdots} \\ {x_{n}}\end{array}\right]$,并定义 $J(i, j)=\frac{\partial f_{i}}{\partial x_{j}}$。

3.png

如上图所示,雅可比矩阵的行列式可用于计算两个区域之间的放大倍数。

$\boldsymbol{b}_{1}=[1,0]^{\top},\boldsymbol{b}_{2}=[0,1]^{\top},\boldsymbol{c}_{1}=[-2,1]^{\top},\boldsymbol{c}_{2}=[1,1]^{\top}$,基变换矩阵的行列式的绝对值为:

$$ \left|\operatorname{det}\left(\left[\begin{array}{cc}{-2} & {1} \\ {1} & {1}\end{array}\right]\right)\right|=|-3|=3 $$

从另一个方面来看:

$$ \begin{array}{l}{y_{1}=-2 x_{1}+x_{2}} \\ {y_{2}=x_{1}+x_{2}}\end{array} $$

$$ \frac{\partial y_{1}}{\partial x_{1}}=-2, \quad \frac{\partial y_{1}}{\partial x_{2}}=1, \quad \frac{\partial y_{2}}{\partial x_{1}}=1, \quad \frac{\partial y_{2}}{\partial x_{2}}=1 $$

因此雅可比矩阵也是

$$ \boldsymbol{J}=\left[\begin{array}{cc}{\frac{\partial y_{1}}{\partial x_{1}}} & {\frac{\partial y_{1}}{\partial x_{2}}} \\ {\frac{\partial y_{2}}{\partial x_{1}}} & {\frac{\partial y_{2}}{\partial x_{2}}}\end{array}\right]=\left[\begin{array}{cc}{-2} & {1} \\ {1} & {1}\end{array}\right] $$

  • 函数 $f : \mathbb{R} \rightarrow \mathbb{R}$ 的梯度是一个标量;
  • 函数 $f : \mathbb{R}^{D} \rightarrow \mathbb{R}$ 的梯度是一个 $1 \times D$ 行向量;
  • 函数 $\boldsymbol{f} : \mathbb{R} \rightarrow \mathbb{R}^{E}$ 的梯度是一个 $E \times 1$ 列向量;
  • 函数 $\boldsymbol{f} : \mathbb{R}^{D} \rightarrow \mathbb{R}^{E}$ 的梯度是一个 $E \times D$ 矩阵。

例1 求 $\boldsymbol{f}(\boldsymbol{x})=\boldsymbol{A} \boldsymbol{x}$ 的梯度 $\mathrm{d} \boldsymbol{f} / \mathrm{d} \boldsymbol{x}$,其中 $\boldsymbol{f}(\boldsymbol{x}) \in \mathbb{R}^{M}, \boldsymbol{A} \in \mathbb{R}^{M \times N}, \boldsymbol{x} \in \mathbb{R}^{N}$。

首先确定维度,$\boldsymbol{f} : \mathbb{R}^{N} \rightarrow \mathbb{R}^{M}$,因此 $\mathrm{d} \boldsymbol{f} / \mathrm{d} \boldsymbol{x} \in \mathbb{R}^{M \times N}$。

$$ f_{i}(\boldsymbol{x})=\sum_{j=1}^{N} A_{i j} x_{j} \Longrightarrow \frac{\partial f_{i}}{\partial x_{j}}=A_{i j} \\ \frac{\mathrm{d} \boldsymbol{f}}{\mathrm{d} \boldsymbol{x}}=\left[\begin{array}{ccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{1}}{\partial x_{N}}} \\ {\vdots} & {} & {\vdots} \\ {\frac{\partial f_{M}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{M}}{\partial x_{N}}}\end{array}\right]=\left[\begin{array}{ccc}{A_{11}} & {\cdots} & {A_{1 N}} \\ {\vdots} & {} & {\vdots} \\ {A_{M 1}} & {\cdots} & {A_{M N}}\end{array}\right]=\boldsymbol{A} \in \mathbb{R}^{M \times N} $$

例2 链式法则:对于函数 $h : \mathbb{R} \rightarrow \mathbb{R}, h(t)=(f \circ g)(t),f : \mathbb{R}^{2} \rightarrow \mathbb{R},g : \mathbb{R} \rightarrow \mathbb{R}^{2},f(\boldsymbol{x})=\exp \left(x_{1} x_{2}^{2}\right),\boldsymbol{x}=\left[\begin{array}{l}{x_{1}} \\ {x_{2}}\end{array}\right]=g(t)=\left[\begin{array}{c}{t \cos t} \\ {t \sin t}\end{array}\right]$,求 $h$ 关于 $t$ 的梯度。

由 $f : \mathbb{R}^{2} \rightarrow \mathbb{R}, g : \mathbb{R} \rightarrow \mathbb{R}^{2}$ 可得

$$ \frac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^{1 \times 2}, \quad \frac{\partial g}{\partial t} \in \mathbb{R}^{2 \times 1} $$

使用链式法则求出待求梯度:

$$ \begin{aligned} \frac{\mathrm{d} h}{\mathrm{d} t} &=\frac{\partial f}{\partial \boldsymbol{x}} \frac{\partial \boldsymbol{x}}{\partial t}=\left[\frac{\partial f}{\partial x_{1}} \quad \frac{\partial f}{\partial x_{2}}\right]\left[\begin{array}{c}{\frac{\partial x_{1}}{\partial t}} \\ {\frac{\partial x_{2}}{\partial t}}\end{array}\right] \\ &=\left[\exp \left(x_{1} x_{2}^{2}\right) x_{2}^{2} \quad 2 \exp \left(x_{1} x_{2}^{2}\right) x_{1} x_{2}\right]\left[\begin{array}{c}{\cos t-t \sin t} \\ {\sin t+t \cos t}\end{array}\right] \\ &=\exp \left(x_{1} x_{2}^{2}\right)\left(x_{2}^{2}(\cos t-t \sin t)+2 x_{1} x_{2}(\sin t+t \cos t)\right) \end{aligned} $$

例3 线性模型中最小平方损失的梯度

考虑线性模型

$$ \boldsymbol{y}=\mathbf{\Phi} \boldsymbol{\theta} $$

其中 $\boldsymbol{\theta} \in \mathbb{R}^{D}$ 是参数向量,$\Phi \in \mathbb{R}^{N \times D}$ 是输入特征,$\boldsymbol{y} \in \mathbb{R}^{N}$ 是相应的观测值。定义函数

$$ \begin{array}{l}{L(\boldsymbol{e}) :=\|\boldsymbol{e}\|^{2}} \\ {\boldsymbol{e}(\boldsymbol{\theta}) :=\boldsymbol{y}-\mathbf{\Phi} \boldsymbol{\theta}}\end{array} $$

$L$ 称为最小平方损失,我们希望计算 $\frac{\partial L}{\partial \boldsymbol{\theta}}$。

首先确定维度:

$$ \frac{\partial L}{\partial \boldsymbol{\theta}} \in \mathbb{R}^{1 \times D} $$

根据链式法则:

$$ \frac{\partial L}{\partial \boldsymbol{\theta}}=\frac{\partial L}{\partial \boldsymbol{e}} \frac{\partial \boldsymbol{e}}{\partial \theta} $$

其中第 $d$ 个元素为

$$ \frac{\partial L}{\partial \boldsymbol{\theta}}[1, d]=\sum_{n=1}^{N} \frac{\partial L}{\partial \boldsymbol{e}}[n] \frac{\partial \boldsymbol{e}}{\partial \boldsymbol{\theta}}[n, d] $$

由于 $\|\boldsymbol{e}\|^{2}=\boldsymbol{e}^{\top} \boldsymbol{e}$,可得

$$ \frac{\partial L}{\partial \boldsymbol{e}}=2 \boldsymbol{e}^{\top} \in \mathbb{R}^{1 \times N} $$

$$ \frac{\partial \boldsymbol{e}}{\partial \boldsymbol{\theta}}=-\boldsymbol{\Phi} \in \mathbb{R}^{N \times D} $$

因此

$$ \frac{\partial L}{\partial \boldsymbol{\theta}}=-2 \boldsymbol{e}^{\top} \boldsymbol{\Phi} =-\underbrace{2\left(\boldsymbol{y}^{\top}-\boldsymbol{\theta}^{\top} \mathbf{\Phi}^{\top}\right)}_{1 \times N} \underbrace{\boldsymbol{\Phi}}_{N \times D} \in \mathbb{R}^{1 \times D} $$

5.4 矩阵的梯度(Gradients of Matrices)

求矩阵关于向量的梯度的可视化:

4.png

例1 向量关于矩阵的梯度

$\boldsymbol{f}=\boldsymbol{A} \boldsymbol{x}, \quad \boldsymbol{f} \in \mathbb{R}^{M}, \boldsymbol{A} \in \mathbb{R}^{M \times N}, \boldsymbol{x} \in \mathbb{R}^{N}$,求 $\mathrm{d} \boldsymbol{f} / \mathrm{d} \boldsymbol{A}$。

首先确定维度:

$$ \frac{\mathrm{d} \boldsymbol{f}}{\mathrm{d} \boldsymbol{A}} \in \mathbb{R}^{M \times(M \times N)} $$

根据梯度定义,有

$$ \frac{\mathrm{d} \boldsymbol{f}}{\mathrm{d} \boldsymbol{A}}=\left[\begin{array}{c}{\frac{\partial f_{1}}{\partial \boldsymbol{A}}} \\ {\vdots} \\ {\frac{\partial f_{M}}{\partial \boldsymbol{A}}}\end{array}\right], \quad \frac{\partial f_{i}}{\partial \boldsymbol{A}} \in \mathbb{R}^{1 \times(M \times N)} $$

由 $f_{i}=\sum_{j=1}^{N} A_{i j} x_{j}, i=1, \ldots, M$ 得

$$ \frac{\partial f_{i}}{\partial A_{i q}}=x_{q} $$

因此 $f_i$ 关于 $\boldsymbol{A}$ 的行向量的梯度为

$$ \begin{aligned} \frac{\partial f_{i}}{\partial A_{i, :}}&=\boldsymbol{x}^{\top} \in \mathbb{R}^{1 \times 1 \times N} \\ \frac{\partial f_{i}}{\partial A_{k \neq i}}&=\mathbf{0}^{\top} \in \mathbb{R}^{1 \times 1 \times N} \end{aligned} $$

因此 $f_i$ 关于 $\boldsymbol{A}$ 的梯度为

$$ \frac{\partial f_{i}}{\partial \boldsymbol{A}}=\left[\begin{array}{c}{\mathbf{0}^{\top}} \\ {\vdots} \\ {\mathbf{0}^{\top}} \\ {\boldsymbol{x}^{\top}} \\ {\mathbf{0}^{\top}} \\ {\vdots} \\ {\mathbf{0}^{\top}}\end{array}\right] \in \mathbb{R}^{1 \times(M \times N)} $$

例2 矩阵关于矩阵的梯度

$\boldsymbol{f}(\boldsymbol{R})=\boldsymbol{R}^{\top} \boldsymbol{R}=: \boldsymbol{K} \in \mathbb{R}^{N \times N}, \boldsymbol{R} \in \mathbb{R}^{M \times N}, \boldsymbol{f} : \mathbb{R}^{M \times N} \rightarrow \mathbb{R}^{N \times N}$,求 $\mathrm{d} \boldsymbol{K} / \mathrm{d} \boldsymbol{R}$。

首先确定维度:

$$ \frac{\mathrm{d} \boldsymbol{K}}{\mathrm{d} \boldsymbol{R}} \in \mathbb{R}^{(N \times N) \times(M \times N)},\frac{\mathrm{d} K_{p q}}{\mathrm{d} \boldsymbol{R}} \in \mathbb{R}^{1 \times M \times N} $$

使用 $\boldsymbol{r}_{i}$ 表示 $\boldsymbol{R}$ 的第 $i$ 列,则有

$$ K_{p q}=\boldsymbol{r}_{p}^{\top} \boldsymbol{r}_{q}=\sum_{m=1}^{M} R_{m p} R_{m q} $$

因此

$$ \begin{aligned} \frac{\partial K_{p q}}{\partial R_{i j}}&=\sum_{m=1}^{M} \frac{\partial}{\partial R_{i j}} R_{m p} R_{m q}=\partial_{p q i j} \\ \partial_{p q i j}&=\left\{\begin{array}{ll}{R_{i q}} & {\text { if } j=p, p \neq q} \\ {R_{i p}} & {\text { if } j=q, p \neq q} \\ {2 R_{i q}} & {\text { if } j=p, p=q} \\ {0} & {\text { otherwise }}\end{array}\right. \end{aligned} $$

5.5 计算梯度的实用公式(Useful Identities for Computing Gradients)

  • $\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{f}(\boldsymbol{X})^{\top}=\left(\frac{\partial \boldsymbol{f}(\boldsymbol{X})}{\partial \boldsymbol{X}}\right)^{\top}$
  • $\frac{\partial}{\partial \boldsymbol{X}} \operatorname{tr}(\boldsymbol{f}(\boldsymbol{X}))=\operatorname{tr}\left(\frac{\partial \boldsymbol{f}(\boldsymbol{X})}{\partial \boldsymbol{X}}\right)$
  • $\frac{\partial}{\partial \boldsymbol{X}} \operatorname{det}(\boldsymbol{f}(\boldsymbol{X}))=\operatorname{det}(\boldsymbol{f}(\boldsymbol{X})) \operatorname{tr}\left(\boldsymbol{f}(\boldsymbol{X})^{-1} \frac{\partial \boldsymbol{f}(\boldsymbol{X})}{\partial \boldsymbol{X}}\right)$
  • $\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{f}(\boldsymbol{X})^{-1}=-\boldsymbol{f}(\boldsymbol{X})^{-1} \frac{\partial \boldsymbol{f}(\boldsymbol{X})}{\partial \boldsymbol{X}} \boldsymbol{f}(\boldsymbol{X})^{-1}$
  • $\frac{\partial \boldsymbol{a}^{\top} \boldsymbol{X}^{-1} \boldsymbol{b}}{\partial \boldsymbol{X}}=-\left(\boldsymbol{X}^{-1}\right)^{\top} \boldsymbol{a} \boldsymbol{b}^{\top}\left(\boldsymbol{X}^{-1}\right)^{\top}$
  • $\frac{\partial \boldsymbol{x}^{\top} \boldsymbol{a}}{\partial \boldsymbol{x}}=\boldsymbol{a}^{\top}$
  • $\frac{\partial \boldsymbol{a}^{\top} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{a}^{\top}$
  • $\frac{\partial \boldsymbol{a}^{\top} \boldsymbol{X} \boldsymbol{b}}{\partial \boldsymbol{X}}=\boldsymbol{a} \boldsymbol{b}^{\top}$
  • $\frac{\partial \boldsymbol{x}^{\top} \boldsymbol{B} \boldsymbol{x}}{\partial \boldsymbol{x}}=\boldsymbol{x}^{\top}\left(\boldsymbol{B}+\boldsymbol{B}^{\top}\right)$
  • $\frac{\partial}{\partial \boldsymbol{s}}(\boldsymbol{x}-\boldsymbol{A} \boldsymbol{s})^{\top} W(\boldsymbol{x}-\boldsymbol{A} \boldsymbol{s})=-2(\boldsymbol{x}-\boldsymbol{A} \boldsymbol{s})^{\top} \boldsymbol{W} \boldsymbol{A} \quad \text { for symmetric } \boldsymbol{W}$

5.6 反向传播和自动求导(Backpropagation and Auto Differentiation)

5.6.1 深度网络中的梯度(Gradients in a Deep Network)

对于神经网络

$$ \begin{array}{l}{\boldsymbol{f}_{0} :=\boldsymbol{x}} \\ {\boldsymbol{f}_{i} :=\sigma_{i}\left(\boldsymbol{A}_{i-1} \boldsymbol{\boldsymbol{f}}_{i-1}+\boldsymbol{b}_{i-1}\right), \quad i=1, \ldots, K}\end{array} $$

设最优化的目标为最小平方损失

$$ L(\boldsymbol{\theta})=\left\|\boldsymbol{y}-\boldsymbol{f}_{K}(\boldsymbol{\theta}, \boldsymbol{x})\right\|^{2},\boldsymbol{\theta}=\left\{\boldsymbol{A}_{0}, \boldsymbol{b}_{0}, \ldots, \boldsymbol{A}_{K-1}, \boldsymbol{b}_{K-1}\right\} $$

根据链式法则求梯度如下:

$$ \begin{aligned} \frac{\partial L}{\partial \boldsymbol{\theta}_{K-1}}&=\frac{\partial L}{\partial \boldsymbol{f}_{K}} \frac{\partial \boldsymbol{f}_{K}}{\partial \boldsymbol{\theta}_{K-1}}\\ \frac{\partial L}{\partial \boldsymbol{\theta}_{K-2}}&=\frac{\partial L}{\partial \boldsymbol{f}_{K}}\cdot\frac{\partial \boldsymbol{f}_{K}}{\partial \boldsymbol{f}_{K-1}} \frac{\partial \boldsymbol{f}_{K-1}}{\partial \boldsymbol{\theta}_{K-2}}\\ \frac{\partial L}{\partial \boldsymbol{\theta}_{K-3}}&=\frac{\partial L}{\partial \boldsymbol{f}_{K}} \frac{\partial f_{K}}{\partial \boldsymbol{f}_{K-1}}\cdot\frac{\partial \boldsymbol{f}_{K-1}}{\partial \boldsymbol{f}_{K-2}} \frac{\partial \boldsymbol{f}_{K-2}}{\partial \boldsymbol{\theta}_{K-3}}\\ \frac{\partial L}{\partial \boldsymbol{\theta}_{i}}&=\frac{\partial L}{\partial \boldsymbol{f}_{K}} \frac{\partial \boldsymbol{f}_{K}}{\partial \boldsymbol{f}_{K-1}}\cdots\frac{\partial \boldsymbol{f}_{i+2}}{\partial \boldsymbol{f}_{i+1}} \frac{\partial \boldsymbol{f}_{i+1}}{\partial \boldsymbol{\theta}_{i}} \end{aligned} $$

反向传播(Backpropagation)示意图:

5.png

5.6.2 自动求导(Automatic Differentiation)

反向传播是一种广义数值分析技术——自动求导的特例。

如下图所示的数据流:

6.png

$$ \frac{\mathrm{d} y}{\mathrm{d} x}=\frac{\mathrm{d} y}{\mathrm{d} b} \frac{\mathrm{d} b}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} x} $$

直觉上,前向和逆向模式在区别在于计算乘积的顺序:

$$ \begin{aligned} \frac{\mathrm{d} y}{\mathrm{d} x} &=\left(\frac{\mathrm{d} y}{\mathrm{d} b} \frac{\mathrm{d} b}{\mathrm{d} a}\right) \frac{\mathrm{d} a}{\mathrm{d} x} \\ \frac{\mathrm{d} y}{\mathrm{d} x} &=\frac{\mathrm{d} y}{\mathrm{d} b}\left(\frac{\mathrm{d} b}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} x}\right) \end{aligned} $$

在神经网络中,输入的维度通常比标签的维度大得多,因此逆向模式的计算代价要远低于前向模式。

对于函数 $f(x)=\sqrt{x^{2}+\exp \left(x^{2}\right)}+\cos \left(x^{2}+\exp \left(x^{2}\right)\right)$,使用逆向模式求其梯度。

$$ \begin{aligned} a &=x^{2} \\ b &=\exp (a) \\ c &=a+b \\ d &=\sqrt{c} \\ e &=\cos (c) \\ f &=d+e \end{aligned} $$

8.png

$$ \frac{\partial f}{\partial c}=\frac{\partial f}{\partial d} \frac{\partial d}{\partial c}+\frac{\partial f}{\partial e} \frac{\partial e}{\partial c}=1 \cdot \frac{1}{2 \sqrt{c}}+1 \cdot(-\sin (c))\\ \frac{\partial f}{\partial b}=\frac{\partial f}{\partial c} \frac{\partial c}{\partial b}=\frac{\partial f}{\partial c} \cdot 1\\ \frac{\partial f}{\partial a}=\frac{\partial f}{\partial b} \frac{\partial b}{\partial a}+\frac{\partial f}{\partial c} \frac{\partial c}{\partial a}=\frac{\partial f}{\partial b} \exp (a)+\frac{\partial f}{\partial c} \cdot 1\\ \frac{\partial f}{\partial x}=\frac{\partial f}{\partial a} \frac{\partial a}{\partial x}=\frac{\partial f}{\partial a} \cdot 2 x $$

5.7 高阶导数(Higher-order Derivatives)

二元函数 $f : \mathbb{R}^{2} \rightarrow \mathbb{R}$ 的黑塞矩阵(Hessian Matrix)为:

$$ \boldsymbol{H}=\left[\begin{array}{cc}{\frac{\partial^{2} f}{\partial x^{2}}} & {\frac{\partial^{2} f}{\partial x \partial y}} \\ {\frac{\partial^{2} f}{\partial x \partial y}} & {\frac{\partial^{2} f}{\partial y^{2}}}\end{array}\right] $$

可以写作 $\nabla_{x, y}^{2} f(x, y)$。一般地,对于 $n$ 元函数 $f : \mathbb{R}^{n} \rightarrow \mathbb{R}$,其黑塞矩阵是 $n \times n $ 矩阵。黑塞矩阵表示函数在某一点处的曲率(curvature)。

5.8 多元泰勒级数(Multivariate Taylor Series)

定义5.7 多元泰勒级数

考虑在 $x_0$ 处光滑的函数

$$ \begin{aligned} f : \mathbb{R}^{D} & \rightarrow \mathbb{R} \\ \boldsymbol{x} &\mapsto f(\boldsymbol{x}), \boldsymbol{x} \in \mathbb{R}^{D} \end{aligned} $$

定义 $\boldsymbol{\delta} :=\boldsymbol{x}-\boldsymbol{x}_{0}$,则 $f$ 在 $\boldsymbol{x}_0$ 处的多元泰勒级数定义为:

$$ f(\boldsymbol{x})=\sum_{k=0}^{\infty} \frac{D_{\boldsymbol{x}}^{k} f\left(\boldsymbol{x}_{0}\right)}{k !} \boldsymbol{\delta}^{k} $$

其中 $D_{x}^{k} f\left(\boldsymbol{x}_{0}\right)$ 是函数 $f$ 在 $\boldsymbol{x}_0$ 处的 $k$ 阶导数。

定义5.8 泰勒多项式(Taylor Polynomial)

函数 $f$ 在 $\boldsymbol{x}_0$ 处的次数为 $n$ 的泰勒多项式包含泰勒级数中的前 $n+1$ 项,定义为:

$$ T_{n}(\boldsymbol{x})=\sum_{k=0}^{n} \frac{D_{\boldsymbol{x}}^{k} f\left(\boldsymbol{x}_{0}\right)}{k !} \boldsymbol{\delta}^{k} $$

上面使用了偷懒的写法,对于向量 $\boldsymbol{\delta} \in \mathbb{R}^D,D>1,k>1$,$\boldsymbol{\delta}^k$ 是未定义的。

$k$ 阶张量 $\boldsymbol{\delta}^{k} \in \mathbb{R}^{D \times D \times \ldots \times D}$ 是通过 $k$ 折外积(k-fold outer product)求出的,用 $\otimes$ 表示。

$$ \begin{aligned} \boldsymbol{\delta}^{2} &:=\boldsymbol{\delta} \otimes \boldsymbol{\delta}=\boldsymbol{\delta} \boldsymbol{\delta}^{\top}, \quad \boldsymbol{\delta}^{2}[i, j]=\delta[i] \delta[j]\\ \boldsymbol{\delta}^{3} &:=\boldsymbol{\delta} \otimes \boldsymbol{\delta} \otimes \boldsymbol{\delta}, \quad \boldsymbol{\delta}^{3}[i, j, k]=\delta[i] \delta[j] \delta[k] \end{aligned} $$

9.png

Last modification:August 13th, 2019 at 02:41 am
如果觉得我的文章对你有用,请随意赞赏