强化学习里面有很多经典的理论结果,比如策略梯度定理,两个策略的累计期望奖励的差,策略提升等等,这个blog介绍一个统一的工具来分析这些问题。
我们先介绍一些基本的符号。一般来说,强化学习考虑一个MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P}, \mathcal{R},\gamma,\rho_0)$和策略$\pi$,这里$\mathcal{S}$和$\mathcal{A}$分别表示状态和动作空间,$\mathcal{P}$和$\mathcal{R}$分别表示转移函数和奖励函数,$\gamma$表示长时回报的折扣,$\rho_0$表示状态的初始分布。我们可以定义$\mathcal{M}$和$\pi$下的累计折扣状态分布:
$$
\begin{equation}
d_{\mathcal{M}}^{\pi}(s) = (1-\gamma)\sum_{t=0}^{\infty} \gamma^t \mathcal{P}(s_t=s|\pi,\mathcal{M}).
\end{equation}
$$
我们有如下引理
引理:对于任意$\mathcal{S}$上的函数$f,g$,如果他们满足
$$
\begin{equation}
\label{eq_bell}
f(s) = g(s) + \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) f(s') dads',
\end{equation}
$$
那么我们有
$$
\begin{equation}
\label{eq_result}
\mathbb{E}_{s\sim\rho_0} \left[f(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right].
\end{equation}
$$
为了便于阅读,我们先看看这个引理的具体应用,然后再给出这个引理的证明(第5小节)。
乍一看,这个引理的条件有点复杂,但是熟悉强化学习的同学就会发现这个条件和Bellman Equation非常像,例如,值函数$V_{\pi,\mathcal{M}}$和奖励函数$\mathcal{R}$就自然满足这个条件:
$$
\begin{equation}
\underbrace{V_{\pi,\mathcal{M}}(s)}_{f(s)} = \underbrace{\mathbb{E}_{a\sim\pi(\cdot|s)} [\mathcal{R}(s,a)]}_{g(s)} + \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) \underbrace{V_{\pi,\mathcal{M}}(s')}_{f(s')} dads'.
\end{equation}
$$
那么基于这个引理,我们就可以得到这个引理的第一个推论:
$$
\begin{equation}
\begin{split}
J_{\mathcal{M}}(\pi) \triangleq & \mathbb{E}_{s\sim\rho_0} \left[V_{\pi,\mathcal{M}}(s)\right] = \mathbb{E}_{s\sim\rho_0} \left[f(s)\right] \\
= & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \mathbb{E}_{a\sim\pi(\cdot|s)} [\mathcal{R}(s,a)].
\end{split}
\end{equation}
$$
这个结论其实非常平凡,事实上用$V_{\pi,\mathcal{M}}$和$d_{\mathcal{M}}^{\pi}$的定义就可以证明。那么接下来我们看一下这个引理的一些更有趣的推论。
策略梯度定理是强化学习最重要的结论之一,具体来说,就是考虑累计期望奖励对策略的梯度,即 $$ \begin{equation} \nabla_{\pi} J_{\mathcal{M}}(\pi) \triangleq \nabla_{\pi} \mathbb{E}_{s\sim\rho_0} [V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [\nabla_{\pi} V_{\mathcal{M}, \pi}(s)]. \end{equation} $$ 为了用上我们的引理,我们自然考虑$f(s) = \nabla_{\pi} V_{\pi,\mathcal{M}}(s)$,并可以化简 $$ \begin{equation} \begin{split} f(s) =& \nabla_{\pi} [V_{\pi,\mathcal{M}}(s)] = \nabla_{\pi} \left[\int \pi(a|s) Q_{\pi,\mathcal{M}}(s) ds\right] \\ =& \underbrace{\int Q_{\pi,\mathcal{M}}(s) \nabla_{\pi} \pi(a|s) ds}_{\mathrm{defined\ as\ } g(s)} + \int \pi(a|s) [\nabla_{\pi} Q_{\pi,\mathcal{M}}(s)] ds \\ =& g(s) + \int \pi(a|s) \nabla_{\pi} \left[\int \mathcal{R}(s,a) + \gamma \int \mathcal{P}(s'|s,a) V_{\pi,\mathcal{M}}(s') ds' da\right] ds \\ =& g(s) + \int \pi(a|s) \left[ \gamma \int \mathcal{P}(s'|s,a) \nabla_{\pi} V_{\pi,\mathcal{M}}(s') ds' da\right] ds \\ =& g(s) + \gamma \int \pi(a|s) \mathcal{P}(s'|s,a) \underbrace{\nabla_{\pi} V_{\pi,\mathcal{M}}(s')}_{f(s')} ds' da ds. \end{split} \end{equation} $$ 因此我们就可以得到 $$ \begin{equation} \begin{split} \nabla_{\pi} J_{\mathcal{M}}(\pi) = & \nabla_{\pi} \mathbb{E}_{s\sim\rho_0} [V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [\nabla_{\pi} V_{\mathcal{M}, \pi}(s)] = \mathbb{E}_{s\sim\rho_0} [f(s)] \\ = & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] =\frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \int [\nabla_{\pi} \pi(a|s)] Q_{\pi,\mathcal{M}}(s) ds \\ = & \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}, a\sim\pi(\cdot|s)} [Q_{\pi,\mathcal{M}}(s) \nabla_{\pi} \log\pi(a|s)], \end{split} \end{equation} $$ 这样我们就证明了策略梯度定理(Policy Gradient)。
一个经常被考虑的问题是不同策略在不同MDP下的累计期望奖励的差,即给定另一个MDP ${\color{red}{\mathcal{M'}}}=(\mathcal{S},\mathcal{A},\mathcal{P}',\mathcal{R}',\gamma,\rho_0)$和策略${\color{red}{\pi'}}$,这里我们假设${\color{red}{\mathcal{M'}}}$的初始状态分布和$\mathcal{M}$一样为$\rho_0$。为了方便区分,我们将${\color{red}{\mathcal{M}'}}$和${\color{red}{\pi'}}$都用红色表示。我们考虑$J_{{\color{red}{\mathcal{M}'}}} ({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi)$。
\begin{equation}
\begin{split}
& J_{{\color{red}{\mathcal{M}'}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \mathbb{E}_{s\sim \rho_0} [V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M}'}}}(s) - V_{\pi,\mathcal{M}}(s)]
\end{split}
\end{equation}
为了用上我们的引理,我们自然取$f(s) = V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)$并有
$$
\begin{equation}
\begin{split}
f(s) = & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)\\
= & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s)
- \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{\pi, \mathcal{M}}(s')\right]dads' \\
= & \underbrace{V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s')\right] dads'}_{\mathrm{defined\ as\ } g(s)} \\
+ & \gamma \int \pi(a|s) \mathcal{P}(s'|s,a)\underbrace{\left[V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') - V_{\pi, \mathcal{M}}(s')\right]}_{f(s')} dads'.
\end{split}
\end{equation}
$$
因此我们可以得到
$$
\begin{equation}
\begin{split}
& J_{{\color{red}{\mathcal{M}'}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \mathbb{E}_{s\sim \rho_0} [V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - V_{\pi,\mathcal{M}}(s)] = \mathbb{E}_{s\sim\rho_0} [f(s)]
= \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)].
\end{split}
\end{equation}
$$
特别的,当${\color{red}{\mathcal{M'}}}=\mathcal{M}$时,即我们考虑同一个MDP下的不同策略的累计期望奖励之差,此时$g(s)$可以简化为:
$$
\begin{equation}
\begin{split}
\label{eq_same_mdp}
g(s) = & V_{{\color{red}{\pi'}},\mathcal{M}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, \mathcal{M}}(s')\right] dads' \\
= & V_{{\color{red}{\pi'}},\mathcal{M}}(s) - \int \pi(a|s) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\
\overset{(*)}{=} & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\
= & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da - \underbrace{\int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) V_{{\color{red}{\pi'}}, \mathcal{M}}(s) da}_{=0} \\
= & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\
= & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da - \underbrace{\int {\color{red}{\pi'}}(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da}_{=0} \\
= & - \int \pi(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da .
\end{split}
\end{equation}
$$
因此我们得到
$$
\begin{equation}
\begin{split}
J_{\mathcal{M}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \frac{-1}{1-\gamma} \int d_{\mathcal{M}}^{\pi}(s) \pi(a|s) A_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) dsda.
\end{split}
\end{equation}
$$
这就是文章[2]中的引理6.1以及TRPO [3]中的式子(2)。
同时,我们也可以基于式子$(*)$证明贪心策略的策略提升(Policy Improvement),我们有
$$
\begin{equation}
\begin{split}
J_{\mathcal{M}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi) = \frac{1}{1-\gamma} \int d_{\mathcal{M}}^{\pi}(s) ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) ds da,
\end{split}
\end{equation}
$$
交换一下$\pi$和${\color{red}{\pi'}}$可以得到
$$
\begin{equation}
\begin{split}
J_{\mathcal{M}}(\pi) - J_{\mathcal{M}}({\color{red}{\pi'}}) = \frac{1}{1-\gamma} \int d_{\mathcal{M}}^{{\color{red}{\pi'}}}(s) (\pi(a|s) - {\color{red}{\pi'}}(a|s)) Q_{\pi, \mathcal{M}}(s,a) ds da \leq 0,
\end{split}
\end{equation}
$$
这里最后一个小于等于号是因为${\color{red}{\pi'}}(s) = \argmax_a Q_{\pi,\mathcal{M}}(s,a)$。
除此之外,我们也可以得到$J_{{\color{red}{\mathcal{M'}}}}({\color{red}{\pi'}}) - J_{\mathcal{M}}(\pi)$的一些上界,为了不影响阅读的连贯性,我们把这些内容放到第6小节。
Max-Entropy RL是强化学习中一类重要的问题,包括Soft Q Learning [4],SAC等算法。这些方法主要考虑在最大化奖励的同时鼓励一定的探索,即优化目标为 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}(\pi) \triangleq & \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t \left(\mathcal{R}(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right)\right] = J(\pi) + \frac{\alpha}{1-\gamma} \mathbb{E}_{s\sim d_{\pi}} \left[\mathcal{H}(\pi(s)) \right]. \end{split} \end{equation} $$ 我们定义Soft Q函数为 $$ \begin{equation} \begin{split} Q'_{\pi,\mathcal{M}}(s, a) & = \mathbb{E}\left[ \mathcal{R}(s, a) + \sum_{i=1}^{\infty} \gamma^i \left(\mathcal{R}(s_i, a_i) + \alpha \mathcal{H}(\pi(\cdot|s_i)) \right) \right]. \end{split} \end{equation} $$ 每次策略更新为 $$ \begin{equation} {\color{red}{\pi'}}(\cdot|s) \propto \exp (Q'_{\pi,\mathcal{M}}(s,\cdot) / \alpha). \end{equation} $$ 我们希望证明这种策略更新可以实现策略提升,即$J_{\mathrm{ME}}({\color{red}{\pi'}}) \ge J_{\mathrm{ME}}(\pi)$. 首先,为了简化符号,我们定义 $$ \begin{equation} \begin{split} F(\mu, Q, s) = \mathbb{E}_{a\sim\mu(\cdot|s)} \left[ Q(s, a) + \alpha \mathcal{H}(\mu(\cdot|s)) \right]. \end{split} \end{equation} $$ 我们可以证明 $$ \begin{equation} \begin{split} Q'_{\pi,\mathcal{M}}(s, a) = &\mathcal{R}(s, a) + \alpha \gamma \mathbb{E}_{s_1\sim\mathcal{P}(\cdot|s,a)} \left(\mathcal{H}(\pi(\cdot|s_1)) \right) + \gamma \mathbb{E}_{s_1,a_1} \left[ Q'_{\pi,\mathcal{M}}(s_1, a_1)\right] \\ = & \mathcal{R}(s, a) + \gamma \mathbb{E}_{s_1} [F(\pi, Q'_{\pi,\mathcal{M}}, s_1)]. \end{split} \end{equation} $$ 同时显然我们有 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}({\color{red}{\pi'}}) - J_{\mathrm{ME}}(\pi) = & \mathbb{E}_{s\sim\rho_0} \left[ F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}(s,\cdot), s) - F(\pi, Q'_{\pi,\mathcal{M}}(s,\cdot), s)\right] \end{split} \end{equation} $$ 因此,为了用上我们的引理,我们自然定义 $$ \begin{equation} \begin{split} f(s) \triangleq & F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s) \\ =& \underbrace{F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s)}_{\mathrm{defined\ as\ } g(s)} + F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s) - F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \\ =& g(s) + \mathbb{E}_{a\sim{\color{red}{\pi'}}(\cdot|s)} [ Q'_{{\color{red}{\pi'}},\mathcal{M}}(s, a) - Q'_{\pi,\mathcal{M}}(s, a) ] \\ =& g(s) + \gamma\mathbb{E}_{a\sim{\color{red}{\pi'}}(\cdot|s), s_1} [\underbrace{F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}, s_1) - F(\pi, Q'_{\pi,\mathcal{M}}, s_1)}_{f(s_1)}]. \end{split} \end{equation} $$ 因此,我们可以证明 $$ \begin{equation} \begin{split} J_{\mathrm{ME}}({\color{red}{\pi'}}) - J_{\mathrm{ME}}(\pi) = & \mathbb{E}_{s\sim\rho_0} \left[ F({\color{red}{\pi'}}, Q'_{{\color{red}{\pi'}},\mathcal{M}}(s,\cdot), s) - F(\pi, Q'_{\pi,\mathcal{M}}(s,\cdot), s)\right] = \mathbb{E}_{s\sim\rho_0} [f(s)] \\ = & \frac{1}{1-\gamma} \mathbb{E}_{s\sim d_{\mathcal{M}}^{{\color{red}{\pi'}}}}[g(s)] \\ = & \frac{1}{1-\gamma} \mathbb{E}_{s\sim d_{\mathcal{M}}^{{\color{red}{\pi'}}}}[F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) - F(\pi, Q'_{\pi,\mathcal{M}}, s)] \end{split} \end{equation} $$ 最后我们证明$F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \ge F(\pi, Q'_{\pi,\mathcal{M}}, s)$,将$F$看成关于$\mu$的函数,最优解$\mu^*$满足 $$ \begin{equation} \begin{split} Q(s, a) = \alpha \log \mu^*(a|s) + b \alpha, \end{split} \end{equation} $$ 这里$b$是一个常数,并且我们有 $\mu^*(a|s) = e^{\frac{Q(s, a)}{\alpha} - b}$。由于$\int \mu^*(a|s) da = 1$我们有 $$ \begin{equation} \begin{split} b =& \log\int e^{\frac{Q(s, a)}{\alpha}} da, \quad \mu^*(a|s) = \frac{ e^{\frac{Q(s, a)}{\alpha}}}{\int e^{\frac{Q(s, a')}{\alpha}} da'} \propto e^{\frac{Q(s, a)}{\alpha}}. \end{split} \end{equation} $$ 即 $$ \begin{equation} F({\color{red}{\pi'}}, Q'_{\pi,\mathcal{M}}, s) \ge F(\pi, Q'_{\pi,\mathcal{M}}, s). \end{equation} $$ 这样我们就证明了$J_{\mathrm{ME}}({\color{red}{\pi'}}) \ge J_{\mathrm{ME}}(\pi)$,也即Max-Entropy RL的策略提升定理([4]中的定理4)。
在这一小节,我们给出这个引理的证明(证明的核心思路基于文章[5])
首先我们有
$$
\begin{equation}
\begin{split}
\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [f(s)] =& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[ \gamma\int \pi(a|s)\mathcal{P}(s'|s,a) f(s') da ds' \right] \\
=& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s') \left[\int \gamma d_{\mathcal{M}}^{\pi}(s) \pi(a|s)\mathcal{P}(s'|s,a) ds da \right] ds' \\
\overset{1}{=}& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s') \left[d_{\mathcal{M}}^{\pi}(s') - (1-\gamma)\rho_0(s') \right] ds' \\
=& \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)] + \int f(s) \left[d_{\mathcal{M}}^{\pi}(s) - (1-\gamma)\rho_0(s) \right] ds \\
\end{split}
\end{equation}
$$
这里式子1是基于下面的等式(可以直接代入$d_{\mathcal{M}}^{\pi}$的定义证明,也可以看文章 [5]中的引理1)
$$
\begin{equation}
\begin{split}
d_{\mathcal{M}}^{\pi}(s) - (1-\gamma)\rho_0(s)
= \gamma\sum_{s'}d_{\mathcal{M}}^{\pi}(s')\sum_{a}\pi(a|s')\mathcal{P}(s|s', a).
\end{split}
\end{equation}
$$
因此我们可以得到
$$
\begin{equation}
\begin{split}
\int f(s) (1-\gamma)\rho_0(s) ds = \mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} [g(s)].
\end{split}
\end{equation}
$$
即$\mathbb{E}_{s\sim\rho_0} \left[f(s)\right] = \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{\mathcal{M}}^{\pi}} \left[g(s)\right]$.
最后,基于第3小节对$J_{\mathcal{M}'}(\pi') - J_{\mathcal{M}}(\pi)$的推导,我们进一步分析它的一些上界。
当${\color{red}{\mathcal{M'}}}=\mathcal{M}$时,基于第3小节中的式子$(*)$,我们记$r^* = \max_{s,a} |\mathcal{R}(s,a)|$, $D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) = \frac{1}{2}\int |{\color{red}{\pi'}}(a|s) - \pi(a|s)|da, \hat{V}_{\mathcal{M},\pi} = \max_s V_{\mathcal{M},\pi}(s) - \min_s V_{\mathcal{M},\pi}(s)$, $\bar{V}_{\mathcal{M},\pi} = \frac{1}{2}(\max_s V_{\mathcal{M},\pi}(s) + \min_s V_{\mathcal{M},\pi}(s))$:
$$
\begin{equation}
\begin{split}
g(s)
= & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) Q_{{\color{red}{\pi'}}, \mathcal{M}}(s,a) da \\
= & \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{R}(s,a) da + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, \mathcal{M}}(s') dads' \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, \mathcal{M}}(s') dads' \\
= & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int ({\color{red}{\pi'}}(a|s) - \pi(a|s)) \mathcal{P}(s'|s,a) (V_{{\color{red}{\pi'}}, \mathcal{M}}(s') - \bar{V}_{\mathcal{M},\pi}) dads' \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int |{\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) \frac{\hat{V}_{\mathcal{M},\pi}}{2} dads' \\
= & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int |{\color{red}{\pi'}}(a|s) - \pi(a|s)| \frac{\hat{V}_{\mathcal{M},\pi}}{2} da \\
\overset{(*1)}{\leq} & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \hat{V}_{\mathcal{M},\pi} \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) 2\max_s|V_{\mathcal{M},\pi}(s)| \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + 2\gamma D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \frac{1}{1-\gamma}r^* \\
\overset{(*2)}{=} & \left(2 + \frac{2\gamma}{1-\gamma}\right) r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s).
\end{split}
\end{equation}
$$
这里式子$(*1)$即文章[5]中的定理2,式子$(*2)$即文章[6]中的定理5.
另一种特殊情况是我们假设两个MDP只有奖励函数相同,即只有$\mathcal{R}'=\mathcal{R}$而$\mathcal{P}'\neq \mathcal{P}$,此时
$$
\begin{equation}
\begin{split}
g(s) = & V_{{\color{red}{\pi'}},{\color{red}{\mathcal{M'}}}}(s) - \int \pi(a|s) \left[\mathcal{R}(s,a) + \gamma\mathcal{P}(s'|s,a)V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s')\right] dads' \\
= & \int \left[{\color{red}{\pi'}}(a|s) - \pi(a|s) \right] \mathcal{R}(s,a) da \\
+ & \gamma \int \left[ {\color{red}{\pi'}}(a|s) \mathcal{P}'(s'|s,a) - \pi(a|s) \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int \left[ {\color{red}{\pi'}}(a|s) \mathcal{P}'(s'|s,a) - \pi(a|s) \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\
= & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int {\color{red}{\pi'}}(a|s) \left[ \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) \right] V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\
+ & \gamma \int \left[ {\color{red}{\pi'}}(a|s) - \pi(a|s)\right] \mathcal{P}(s'|s,a) V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s') dads' \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \gamma \int {\color{red}{\pi'}}(a|s) | \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) | \max_{s''}| V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s'')| dads' \\
+ & \gamma \int | {\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) \max_{s''}| V_{{\color{red}{\pi'}}, {\color{red}{\mathcal{M'}}}}(s'')| dads' \\
\leq & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{\gamma}{1-\gamma} r^* \int {\color{red}{\pi'}}(a|s) | \mathcal{P}'(s'|s,a) - \mathcal{P}(s'|s,a) | dads' \\
+ & \frac{\gamma}{1-\gamma} r^* \int | {\color{red}{\pi'}}(a|s) - \pi(a|s)| \mathcal{P}(s'|s,a) dads' \\
= & 2r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\mathcal{P},\mathcal{P}')(s,a)
+ \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) \\
\leq & \frac{2}{1-\gamma}r^* D_{\mathbf{TD}}(\pi,{\color{red}{\pi'}})(s) + \frac{2\gamma}{1-\gamma} r^* D_{\mathbf{TD}}(\mathcal{P},\mathcal{P}')(s,a)
\end{split}
\end{equation}
$$
这就是MBPO [7]中的引理3。
@article{ying2025arllemma,
title = "A Useful Lemma for Several RL Results",
author="Ying, Chengyang",
journal="yingchengyang.github.io",
year="2025",
url="https://yingchengyang.github.io/posts/2025-02-15-bellman-lemma/"
}
[1] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. “Policy gradientmethods for reinforcement learning with function approximation.“ Advances in neural informationprocessing systems, 12, 1999.
[2] Sham Kakade and John Langford. “ Approximately optimal approximate reinforcement learning.“In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274,2002.
[3] John Schulman, et al. “Trust region policy optimization.” In International conference on machine learning,. 2015.
[4] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. “Reinforcement learning withdeep energy-based policies.” In International conference on machine learning, pages 1352–1361.PMLR, 2017.
[5] Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, and Jun Zhu. “Towards safe re-inforcement learning via constraining conditional value-at-risk.” arXiv preprint arXiv:2206.04436,2022.
[6] Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-JuiHsieh. “Robust deep reinforcement learning against adversarial perturbations on state observations.” Advances in Neural Information Processing Systems, 33:21024–21037, 2020.
[7] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. “When to trust your model: Model-based policy optimization.” Advances in neural information processing systems, 32, 2019.