1. Overview

强化学习里面有很多经典的理论结果，比如策略梯度定理，两个策略的累计期望奖励的差，策略提升等等，这个blog介绍一个统一的工具来分析这些问题。
我们先介绍一些基本的符号。一般来说，强化学习考虑一个MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P}, \mathcal{R},\gamma,\rho_0)$和策略$\pi$，这里$\mathcal{S}$和$\mathcal{A}$分别表示状态和动作空间，$\mathcal{P}$和$\mathcal{R}$分别表示转移函数和奖励函数，$\gamma$表示长时回报的折扣，$\rho_0$表示状态的初始分布。我们可以定义$\mathcal{M}$和$\pi$下的累计折扣状态分布： $$ \begin{equation} d_{\mathcal{M}}^{\pi}(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t \mathcal{P}(s_t=s|\pi,\mathcal{M}). \end{equation} $$ 我们有如下引理
引理：对于任意$\mathcal{S}$上的函数$f,g$，如果他们满足 $$ \begin{equation} \label{eq_bell} f(s) = g(s) + \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) f(s') dads', \end{equation} $$ 那么我们有 $$ \begin{equation} \label{eq_result} \mathbb{E}_{s\sim\rho_0} \left[f(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right]. \end{equation} $$ 为了便于阅读，我们先看看这个引理的具体应用，然后再给出这个引理的证明（第5小节）。
乍一看，这个引理的条件有点复杂，但是熟悉强化学习的同学就会发现这个条件和Bellman Equation非常像，例如，值函数$V_{\pi,\mathcal{M}}$和奖励函数$\mathcal{R}$就自然满足这个条件： $$ \begin{equation} \underbrace{V_{\pi,\mathcal{M}}(s)}_{f(s)} = \underbrace{\mathbb{E}_{a\sim\pi(\cdot|s)} [\mathcal{R}(s,a)]}_{g(s)} + \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) \underbrace{V_{\pi,\mathcal{M}}(s')}_{f(s')} dads'. \end{equation} $$ 那么基于这个引理，我们就可以得到这个引理的第一个推论： $$ \begin{equation} \begin{split} J_{\mathcal{M}}(\pi) \triangleq & \mathbb{E}_{s\sim\rho_0} \left[V_{\pi,\mathcal{M}}(s)\right] = \mathbb{E}_{s\sim\rho_0} \left[f(s)\right] \\ = & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \mathbb{E}_{a\sim\pi(\cdot|s)} [\mathcal{R}(s,a)]. \end{split} \end{equation} $$ 这个结论其实非常平凡，事实上用$V_{\pi,\mathcal{M}}$和$d_{\mathcal{M}}^{\pi}$的定义就可以证明。那么接下来我们看一下这个引理的一些更有趣的推论。

2. Policy Gradient [1]

策略梯度定理是强化学习最重要的结论之一，具体来说，就是考虑累计期望奖励对策略的梯度，即 $$ \begin{equation} \nabla_{\pi} J_{\mathcal{M}}(\pi) \triangleq \nabla_{\pi} \mathbb{E}_{s\sim\rho_0} [V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [\nabla_{\pi} V_{\mathcal{M}, \pi}(s)]. \end{equation} $$ 为了用上我们的引理，我们自然考虑$f(s) = \nabla_{\pi} V_{\pi,\mathcal{M}}(s)$,并可以化简 $$ \begin{equation} \begin{split} f(s) =& \nabla_{\pi} [V_{\pi,\mathcal{M}}(s)] = \nabla_{\pi} \left[\int \pi(a|s) Q_{\pi,\mathcal{M}}(s) ds\right] \\ =& \underbrace{\int Q_{\pi,\mathcal{M}}(s) \nabla_{\pi} \pi(a|s) ds}_{\mathrm{defined\ as\ } g(s)} + \int \pi(a|s) [\nabla_{\pi} Q_{\pi,\mathcal{M}}(s)] ds \\ =& g(s) + \int \pi(a|s) \nabla_{\pi} \left[\int \mathcal{R}(s,a) + \gamma \int \mathcal{P}(s'|s,a) V_{\pi,\mathcal{M}}(s') ds' da\right] ds \\ =& g(s) + \int \pi(a|s) \left[ \gamma \int \mathcal{P}(s'|s,a) \nabla_{\pi} V_{\pi,\mathcal{M}}(s') ds' da\right] ds \\ =& g(s) + \gamma \int \pi(a|s) \mathcal{P}(s'|s,a) \underbrace{\nabla_{\pi} V_{\pi,\mathcal{M}}(s')}_{f(s')} ds' da ds. \end{split} \end{equation} $$ 因此我们就可以得到 $$ \begin{equation} \begin{split} \nabla_{\pi} J_{\mathcal{M}}(\pi) = & \nabla_{\pi} \mathbb{E}_{s\sim\rho_0} [V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [\nabla_{\pi} V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [f(s)] \\ = & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] =\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \int [\nabla_{\pi} \pi(a|s)] Q_{\pi,\mathcal{M}}(s) ds \\ = & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}, a\sim\pi(\cdot|s)} [Q_{\pi,\mathcal{M}}(s) \nabla_{\pi} \log\pi(a|s)], \end{split} \end{equation} $$ 这样我们就证明了策略梯度定理（Policy Gradient）。

3. 不同策略/MDP上累计期望奖励之差

一个经常被考虑的问题是不同策略在不同MDP下的累计期望奖励的差，即给定另一个MDP ${\color{red}{\mathcal{M'}}}=(\mathcal{S},\mathcal{A},\mathcal{P}',\mathcal{R}',\gamma,\rho_0)$和策略${\color{red}{\pi'}}$，这里我们假设${\color{red}{\mathcal{M'}}}$的初始状态分布和$\mathcal{M}$一样为$\rho_0$。为了方便区分，我们将${\color{red}{\mathcal{M}'}}$和${\color{red}{\pi'}}$都用红色表示。我们考虑$J_{{\color{red}{\mathcal{M}'}}} ({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi)$。 \begin{equation} \begin{split} & J_{{\color{red}{\mathcal{M}'}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \mathbb{E}_{s\sim \rho_0} [V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M}'}}}(s) - V_{\pi,\mathcal{M}}(s)] \end{split} \end{equation} 为了用上我们的引理，我们自然取$f(s) = V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)$并有 $$ \begin{equation} \begin{split} f(s) = & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)\\ = & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{\pi, \mathcal{M}}(s')\right]dads' \\ = & \underbrace{V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s')\right] dads'}_{\mathrm{defined\ as\ } g(s)} \\ + & \gamma \int \pi(a|s) \mathcal{P}(s'|s,a)\underbrace{\left[V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') - V_{\pi, \mathcal{M}}(s')\right]}_{f(s')} dads'. \end{split} \end{equation} $$ 因此我们可以得到 $$ \begin{equation} \begin{split} & J_{{\color{red}{\mathcal{M}'}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \mathbb{E}_{s\sim \rho_0} [V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)] = \mathbb{E}_{s\sim\rho_0} [f(s)] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)]. \end{split} \end{equation} $$ 特别的，当${\color{red}{\mathcal{M'}}}=\mathcal{M}$时，即我们考虑同一个MDP下的不同策略的累计期望奖励之差，此时$g(s)$可以简化为： $$ \begin{equation} \begin{split} \label{eq_same_mdp} g(s) = & V_{{\color{red}{\pi'}},\mathcal{M}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, \mathcal{M}}(s')\right] dads' \\ = & V_{{\color{red}{\pi'}},\mathcal{M}}(s) - \int \pi(a|s) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\ \overset{(*)}{=} & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\ = & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da - \underbrace{\int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) V_{{\color{red}{\pi'}}, \mathcal{M}}(s) da}_{=0} \\ = & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\ = & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da - \underbrace{\int {\color{red}{\pi'}}(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da}_{=0} \\ = & - \int \pi(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da . \end{split} \end{equation} $$ 因此我们得到 $$ \begin{equation} \begin{split} J_{\mathcal{M}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \frac{-1}{1-\gamma} \int d_{\mathcal{M}}^{\pi}(s) \pi(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) dsda. \end{split} \end{equation} $$ 这就是文章[2]中的引理6.1以及TRPO [3]中的式子(2)。
同时，我们也可以基于式子$(*)$证明贪心策略的策略提升（Policy Improvement），我们有 $$ \begin{equation} \begin{split} J_{\mathcal{M}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \frac{1}{1-\gamma} \int d_{\mathcal{M}}^{\pi}(s) ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) ds da, \end{split} \end{equation} $$ 交换一下$\pi$和${\color{red}{\pi'}}$可以得到 $$ \begin{equation} \begin{split} J_{\mathcal{M}}(\pi) - J_{\mathcal{M}}({\color{red}{\pi'}}) = \frac{1}{1-\gamma} \int d_{\mathcal{M}}^{{\color{red}{\pi'}}}(s) (\pi(a|s) - {\color{red}{\pi'}}(a|s)) Q_{\pi, \mathcal{M}}(s,a) ds da \leq 0, \end{split} \end{equation} $$ 这里最后一个小于等于号是因为${\color{red}{\pi'}}(s) = \argmax_a Q_{\pi,\mathcal{M}}(s,a)$。
除此之外，我们也可以得到$J_{{\color{red}{\mathcal{M'}}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi)$的一些上界，为了不影响阅读的连贯性，我们把这些内容放到第6小节。

4. Max-Entropy RL上的策略提升

Max-Entropy RL是强化学习中一类重要的问题，包括Soft Q Learning [4]，SAC等算法。这些方法主要考虑在最大化奖励的同时鼓励一定的探索，即优化目标为 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}(\pi) \triangleq & \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t \left(\mathcal{R}(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right)\right] = J(\pi) + \frac{\alpha}{1-\gamma} \mathbb{E}_{s\sim d_{\pi}} \left[\mathcal{H}(\pi(s)) \right]. \end{split} \end{equation} $$ 我们定义Soft Q函数为 $$ \begin{equation} \begin{split} Q'_{\pi,\mathcal{M}}(s, a) & = \mathbb{E}\left[ \mathcal{R}(s, a) + \sum_{i=1}^{\infty} \gamma^i \left(\mathcal{R}(s_i, a_i) + \alpha \mathcal{H}(\pi(\cdot|s_i)) \right) \right]. \end{split} \end{equation} $$ 每次策略更新为 $$ \begin{equation} {\color{red}{\pi'}}(\cdot|s) \propto \exp (Q'_{\pi,\mathcal{M}}(s,\cdot) / \alpha). \end{equation} $$ 我们希望证明这种策略更新可以实现策略提升，即$J_{\mathrm{ME}}({\color{red}{\pi'}}) \ge J_{\mathrm{ME}}(\pi)$. 首先，为了简化符号，我们定义 $$ \begin{equation} \begin{split} F(\mu, Q, s) = \mathbb{E}_{a\sim\mu(\cdot|s)} \left[ Q(s, a) + \alpha \mathcal{H}(\mu(\cdot|s)) \right]. \end{split} \end{equation} $$ 我们可以证明 $$ \begin{equation} \begin{split} Q'_{\pi,\mathcal{M}}(s, a) = &\mathcal{R}(s, a) + \alpha \gamma \mathbb{E}_{s_1\sim\mathcal{P}(\cdot|s,a)} \left(\mathcal{H}(\pi(\cdot|s_1)) \right) + \gamma \mathbb{E}_{s_1,a_1} \left[ Q'_{\pi,\mathcal{M}}(s_1, a_1)\right] \\ = & \mathcal{R}(s, a) + \gamma \mathbb{E}_{s_1} [F(\pi, Q'_{\pi,\mathcal{M}}, s_1)]. \end{split} \end{equation} $$ 同时显然我们有 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}({\color{red}{\pi'}}) - J_{\mathrm{ME}}(\pi) = & \mathbb{E}_{s\sim\rho_0} \left[ F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}(s,\cdot), s) - F(\pi, Q'_{\pi,\mathcal{M}}(s,\cdot), s)\right] \end{split} \end{equation} $$ 因此，为了用上我们的引理，我们自然定义 $$ \begin{equation} \begin{split} f(s) \triangleq & F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s) \\ =& \underbrace{F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s)}_{\mathrm{defined\ as\ } g(s)} + F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s) - F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \\ =& g(s) + \mathbb{E}_{a\sim{\color{red}{\pi'}}(\cdot|s)} [ Q'_{{\color{red}{\pi'}},\mathcal{M}}(s, a) - Q'_{\pi,\mathcal{M}}(s, a) ] \\ =& g(s) + \gamma\mathbb{E}_{a\sim{\color{red}{\pi'}}(\cdot|s), s_1} [\underbrace{F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s_1) - F(\pi, Q'_{\pi,\mathcal{M}}, s_1)}_{f(s_1)}]. \end{split} \end{equation} $$ 因此，我们可以证明 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}({\color{red}{\pi'}}) - J_{\mathrm{ME}}(\pi) = & \mathbb{E}_{s\sim\rho_0} \left[ F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}(s,\cdot), s) - F(\pi, Q'_{\pi,\mathcal{M}}(s,\cdot), s)\right] = \mathbb{E}_{s\sim\rho_0} [f(s)] \\ = & \frac{1}{1-\gamma} \mathbb{E}_{s\sim d_{\mathcal{M}}^{{\color{red}{\pi'}}}}[g(s)] \\ = & \frac{1}{1-\gamma} \mathbb{E}_{s\sim d_{\mathcal{M}}^{{\color{red}{\pi'}}}}[F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s)] \end{split} \end{equation} $$ 最后我们证明$F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \ge F(\pi, Q'_{\pi,\mathcal{M}}, s)$，将$F$看成关于$\mu$的函数，最优解$\mu^*$满足 $$ \begin{equation} \begin{split} Q(s, a) = \alpha \log \mu^*(a|s) + b \alpha, \end{split} \end{equation} $$ 这里$b$是一个常数，并且我们有 $\mu^*(a|s) = e^{\frac{Q(s, a)}{\alpha} - b}$。由于$\int \mu^*(a|s) da = 1$我们有 $$ \begin{equation} \begin{split} b =& \log\int e^{\frac{Q(s, a)}{\alpha}} da, \quad \mu^*(a|s) = \frac{ e^{\frac{Q(s, a)}{\alpha}}}{\int e^{\frac{Q(s, a')}{\alpha}} da'} \propto e^{\frac{Q(s, a)}{\alpha}}. \end{split} \end{equation} $$ 即 $$ \begin{equation} F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \ge F(\pi, Q'_{\pi,\mathcal{M}}, s). \end{equation} $$ 这样我们就证明了$J_{\mathrm{ME}}({\color{red}{\pi'}}) \ge J_{\mathrm{ME}}(\pi)$，也即Max-Entropy RL的策略提升定理（[4]中的定理4）。

5. 引理的证明

在这一小节，我们给出这个引理的证明（证明的核心思路基于文章[5]）
首先我们有 $$ \begin{equation} \begin{split} \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [f(s)] =& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[ \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) f(s') da ds' \right] \\ =& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s') \left[\int \gamma d_{\mathcal{M}}^{\pi}(s) \pi(a|s)\mathcal{P}(s'|s,a) ds da \right] ds' \\ \overset{1}{=}& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s') \left[d_{\mathcal{M}}^{\pi}(s') - (1-\gamma)\rho_0(s') \right] ds' \\ =& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s) \left[d_{\mathcal{M}}^{\pi}(s) - (1-\gamma)\rho_0(s) \right] ds \\ \end{split} \end{equation} $$ 这里式子1是基于下面的等式（可以直接代入$d_{\mathcal{M}}^{\pi}$的定义证明，也可以看文章 [5]中的引理1） $$ \begin{equation} \begin{split} d_{\mathcal{M}}^{\pi}(s) - (1-\gamma)\rho_0(s) = \gamma\sum_{s'}d_{\mathcal{M}}^{\pi}(s')\sum_{a}\pi(a|s')\mathcal{P}(s|s', a). \end{split} \end{equation} $$ 因此我们可以得到 $$ \begin{equation} \begin{split} \int f(s) (1-\gamma)\rho_0(s) ds = \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)]. \end{split} \end{equation} $$ 即$\mathbb{E}_{s\sim\rho_0} \left[f(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right]$.

6. 第3小节中的进一步上界

最后，基于第3小节对$J_{\mathcal{M}'}(\pi') - J_{\mathcal{M}}(\pi)$的推导，我们进一步分析它的一些上界。
当${\color{red}{\mathcal{M'}}}=\mathcal{M}$时，基于第3小节中的式子$(*)$，我们记$r^* = \max_{s,a} |\mathcal{R}(s,a)|$, $D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) = \frac{1}{2}\int |{\color{red}{\pi'}}(a|s) - \pi(a|s)|da, \hat{V}_{\mathcal{M},\pi} = \max_s V_{\mathcal{M},\pi}(s) - \min_s V_{\mathcal{M},\pi}(s)$, $\bar{V}_{\mathcal{M},\pi} = \frac{1}{2}(\max_s V_{\mathcal{M},\pi}(s) + \min_s V_{\mathcal{M},\pi}(s))$: $$ \begin{equation} \begin{split} g(s) = & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\ = & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{R}(s,a) da + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, \mathcal{M}}(s') dads' \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, \mathcal{M}}(s') dads' \\ = & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) (V_{{\color{red}{\pi'}}, \mathcal{M}}(s') - \bar{V}_{\mathcal{M},\pi}) dads' \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int |{\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) \frac{\hat{V}_{\mathcal{M},\pi}}{2} dads' \\ = & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int |{\color{red}{\pi'}}(a|s) - \pi(a|s)| \frac{\hat{V}_{\mathcal{M},\pi}}{2} da \\ \overset{(*1)}{\leq} & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \hat{V}_{\mathcal{M},\pi} \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) 2\max_s|V_{\mathcal{M},\pi}(s)| \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + 2\gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \frac{1}{1-\gamma}r^* \\ \overset{(*2)}{=} & \left(2 + \frac{2\gamma}{1-\gamma}\right) r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s). \end{split} \end{equation} $$ 这里式子$(*1)$即文章[5]中的定理2，式子$(*2)$即文章[6]中的定理5.
另一种特殊情况是我们假设两个MDP只有奖励函数相同，即只有$\mathcal{R}'=\mathcal{R}$而$\mathcal{P}'\neq \mathcal{P}$，此时 $$ \begin{equation} \begin{split} g(s) = & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s')\right] dads' \\ = & \int \left[{\color{red}{\pi'}}(a|s) - \pi(a|s) \right] \mathcal{R}(s,a) da \\ + & \gamma \int \left[ {\color{red}{\pi'}}(a|s) \mathcal{P}'(s'|s,a) - \pi(a|s) \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int \left[ {\color{red}{\pi'}}(a|s) \mathcal{P}'(s'|s,a) - \pi(a|s) \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\ = & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int {\color{red}{\pi'}}(a|s) \left[ \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\ + & \gamma \int \left[ {\color{red}{\pi'}}(a|s) - \pi(a|s)\right] \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int {\color{red}{\pi'}}(a|s) | \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) | \max_{s''}| V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s'')| dads' \\ + & \gamma \int | {\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) \max_{s''}| V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s'')| dads' \\ \leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{\gamma}{1-\gamma} r^* \int {\color{red}{\pi'}}(a|s) | \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) | dads' \\ + & \frac{\gamma}{1-\gamma} r^* \int | {\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) dads' \\ = & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\mathcal{P},\mathcal{P}')(s,a) + \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \\ \leq & \frac{2}{1-\gamma}r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\mathcal{P},\mathcal{P}')(s,a) \end{split} \end{equation} $$ 这就是MBPO [7]中的引理3。

BibTeX

@article{ying2025arllemma,
      title = "A Useful Lemma for Several RL Results",
      author="Ying, Chengyang",
      journal="yingchengyang.github.io",
      year="2025",
      url="https://yingchengyang.github.io/posts/2025-02-15-bellman-lemma/"
    }

References

[1] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. “Policy gradientmethods for reinforcement learning with function approximation.“ Advances in neural informationprocessing systems, 12, 1999.

[2] Sham Kakade and John Langford. “ Approximately optimal approximate reinforcement learning.“In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274,2002.

[3] John Schulman, et al. “Trust region policy optimization.” In International conference on machine learning,. 2015.

[4] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. “Reinforcement learning withdeep energy-based policies.” In International conference on machine learning, pages 1352–1361.PMLR, 2017.

[5] Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, and Jun Zhu. “Towards safe re-inforcement learning via constraining conditional value-at-risk.” arXiv preprint arXiv:2206.04436,2022.

[6] Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-JuiHsieh. “Robust deep reinforcement learning against adversarial perturbations on state observations.” Advances in Neural Information Processing Systems, 33:21024–21037, 2020.

[7] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. “When to trust your model: Model-based policy optimization.” Advances in neural information processing systems, 32, 2019.

A Useful Lemma for Several RL Results