Deterministic policy gradient is a variation of A2C, but is off-policy. In A2C, the actor estimates the stochastic policy, either in the form of probability distribute over discrete actions or, the parameters fo normal distribution. DPG also belong to the A2C family, but its policy is deterministic. This makes it possible to apply the chain rule to maximize the Q-value.
DPG has to components. First is the actor. In a continuous action domain, every action is a number, so the actor network will take the state as input and output N values, one for each action. This mapping is deterministic. Second is the critic, which is used to estimate the Q-value, calculated as the discounted reward of the action taken at some state Q(s,a). Its inputs include state and action (a vector) and outputs a single number corresponding to the Q-value.
Now, we can substitute the actor function into the critic and get Q(s, u(s)), which we are interested to approximate in the first place. It depends not only on state, but also the parameters of actor and critic network. At every step of optimization, we want to change the actor’s weight to improve the total reward. Expressed in mathematics, we want to get the gradient of the policy: $\triangledown_{a}Q(s,a)_{|a=\mu(s)}\triangledown_{\theta_{\mu}}\mu(s) $.
Note that despite both A2C and DDPG belonging to the A2C family, critic is used in different ways. In A2C, critic is used as a baseline for calculating advantage for improving stability. In DDPG, as our policy is deterministic, we can calculate the gradient from Q, obtained from critic up to actor’s weights, so the whole system is end-to-end differentiable with SGD. To update the critic network, Bellman equation is used to approximate Q(s,a) and MSE objective is minimized.
The idea of DDPG is quite intuitive, the critic is updated as in A2C and the actor is updated in a way to maximize the critic’s output. Additionally, it’s off-policy, which leverages a replay buffer to improve sample efficiency just like DQN.
A result trained with DDPG is at here.