强化学习定义¶
约 415 个字 预计阅读时间 2 分钟
策略函数 \(\pi(s,a)\):智能体在状态s下采取动作a的概率
确定的策略函数指在给定状态s的情况下,只有一个动作a使得概率\(\pi(s,a)\)取值为1,记\(a=\pi(s)\)
回报值\(G_t\)
价值函数\(V_\pi(s)=\mathbb E_\pi[G_t|S_t=s]\):智能体在时刻t处于状态s时,按照策略\(\pi\)采取行动时所获得回报的期望
动作-价值函数\(q_\pi(s,a)=\mathbb E_\pi[G_t|S_t=s,A_t=a]\):表示智能体在时刻t处于状态s时,选择动作a后,在t时刻后根据策略\(\pi\)采取行动所获得回报的期望
强化学习问题可转化为一个策略学习问题:给定一个马尔可夫过程\(MDP=(S,A,P,R,\gamma)\),学习一个最优策略\(\pi^*\ s.t.\ \forall s\in S,V_{\pi^*}(s) \max\)
贝尔曼方程:
\[
\begin{array}{ll}
V_{\pi}(s)&=\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots|S_t=s]\\
&=\mathbb E_{a\sim\pi(s,\cdot)}[\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots|S_t=s,A_t=a]]\\
&=\sum\limits_{a\in A}\pi(s,a)q_\pi(s,a)
\end{array}
\]
\[
\begin{array}{ll}
q_\pi(s,a)&=\mathbb E_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\cdots|S_t=s,A_t=a]\\
&=\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma\mathbb E_\pi[R_{t+2}+\gamma R_{t+3}+\cdots|S_{t+1}=s']]\\
&=\sum_{s'\in S}P(s'|s,a)[R(s,a,s')+\gamma V_\pi(s')]
\end{array}
\]
互相带入,得贝尔曼方程:
\[
V_\pi(s)=\mathbb E_{a\sim\pi(s,\cdot)}\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma V_\pi(s')]
\]
\[
q_\pi(s,a)=\mathbb E_{s'\sim P(\cdot|s,a)}[R(s,a,s')+\gamma\mathbb E_{a'\sim\pi(s',\cdot)}|q_\pi(s',a')]
\]