double DQN 에 대한 정보 및 source code (python) 분석

python source code 2023. 9. 20. 21:56

728x90

벌써 8년이 된 이론이지만 구글 딥마인드에서 논문을 게재한 이후 다양한 분야에 활용되고 있다.. 아직까지도..

- 아주 유명한 그래프이다. overestimation 발생에 대한 이미지로 본인도 이와 같은 패턴의 학습 score가 나와서 많은 고민끝에 여러 방법들을 찾기 시작했다...

알고리즘은 위와 같다.

1. QA, QB를 초기화

2. 각각 QA,QB에 대한 행동을 추출, (보상과 next state 관찰)

3. 랜덤으로 QA, QB 중 하나 선택

4. QA를 업데이트 할 경우,

: QA가 최댓값이 되는 (next state에 대한) 행동 정의,

QA업데이트 : {QA(s,a) + a(s,a){r+ gamma*QB(s',a*)-QA(s,a)}

반대의 경우도 동일하게 적용.

그리고 업데이트 버전의 알고리즘은 아래와 같다.

1. DQN과 같이 st, at, rt, s(t+1)을 메모리에 저장

2. 샘플링

3. Q*(st,at) = rt + gamma*Q(theta)(s(t+1), argmax(a')Q(theta)'(S(t+1,a')))

4. 두 Q값의 오차제곱을 업데이트

위 두 알고리즘의 방식은 비슷한 것 같다. 다만 독립적인 q 값 두 개를 사용하지 않는다.

where θ’ is the target network parameter, θ is the primary network parameter, and τ (rate of averaging) is usually set to 0.01.

소스코드들에 대한 정보들이 다양하며 커스터마이징하여 사용하기에 어느정도 분석과 이론에 대한 이해가 필요하다고 생각해서 여기 정리하게 되었다.

기본적으로 Q-Learning과 DQN에 대한 사전이해가 되어야 Double DQN 사용이 용이하다고 생각되어

여기서는 training에 대한 코드 위주로 살펴보고자 한다.

1번 예제 (1번 참고링크) 분석

# Calculate q values and targets
# This is where Double Q-Learning comes in!

# next 상태에 대한 estimator의 Q값 도출
q_values_next = q_estimator.predict(sess, next_states_batch)

# 앞서 도출한 Q 값 중 최댓값 확인 : estimator의 최댓값을 나타내는 행동 확인
best_actions = np.argmax(q_values_next, axis=1)

# 여기서부터는 ----- target 모델을 사용한 Q값을 구하게 된다(마찬가지로 next 상태)
q_values_next_target = target_estimator.predict(sess, next_states_batch)

# 위 두 가지 Q 값을 사용, 학습 target 값을 구하는 코드
# (현재 reward)에 더하여 (target 모델의 Q값 * estimator에서 구한 Q값이 가장 높은 행동) 의 곱을 더한다.
targets_batch = reward_batch + np.invert(done_batch).astype(np.float32) * \
    discount_factor * q_values_next_target[np.arange(batch_size), best_actions]

# Perform gradient descent update
states_batch = np.array(states_batch)

# estimator 모델 업데이드 ###
loss = q_estimator.update(sess, states_batch, action_batch, targets_batch)

if done:
    break

state = next_state
total_t += 1

위 코드에서 가장 중요한 부분은 바로 이 부분 :

# (현재 reward)에 더하여 (target 모델의 Q값 * estimator에서 구한 Q값이 가장 높은 행동) 의 곱을 더한다.

그리고

# 모델 업데이트는 estimator를 업데이트 수행

2번 예제 분석

self.use_conv = use_conv


if self.use_conv:
    self.model = ConvDQN(env.observation_space.shape, env.action_space.n).to(self.device)
    self.target_model = ConvDQN(env.observation_space.shape, env.action_space.n).to(self.device)
else:
    self.model = DQN(env.observation_space.shape, env.action_space.n).to(self.device)
    self.target_model = DQN(env.observation_space.shape, env.action_space.n).to(self.device)

# hard copy model parameters to target model parameters
for target_param, param in zip(self.model.parameters(), self.target_model.parameters()):
    target_param.data.copy_(param)

self.optimizer = torch.optim.Adam(self.model.parameters())

def compute_loss(self, batch):     
    states, actions, rewards, next_states, dones = batch
    states = torch.FloatTensor(states).to(self.device)
    actions = torch.LongTensor(actions).to(self.device)
    rewards = torch.FloatTensor(rewards).to(self.device)
    next_states = torch.FloatTensor(next_states).to(self.device)
    dones = torch.FloatTensor(dones)

    # resize tensors
    actions = actions.view(actions.size(0))
    dones = dones.view(dones.size(0))

    # compute loss
    curr_Q = self.model.forward(states).gather(1, actions.view(actions.size(0), 1))
    next_Q = self.target_model.forward(next_states)
    max_next_Q = torch.max(next_Q, 1)[0]
    max_next_Q = max_next_Q.view(max_next_Q.size(0), 1)
    expected_Q = rewards + (1 - dones) * self.gamma * max_next_Q

    loss = F.mse_loss(curr_Q, expected_Q.detach())

    return loss

def update(self, batch_size):
    batch = self.replay_buffer.sample(batch_size)
    loss = self.compute_loss(batch)

    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()

    # target network update
    for target_param, param in zip(self.target_model.parameters(), self.model.parameters()):
        target_param.data.copy_(self.tau * param + (1 - self.tau) * target_param)

1. DQN network : 최고의 행동 선택 , next 상태에 대한... (가장 높은 Q value를 갖는)

2. target network : target Q value 계산에 활용 (next state의 action 활용)

use our DQN network to select what is the best action to take for the next state (the action with the highest Q value).
use our target network to calculate the target Q value of taking that action at the next state.

참고 :

https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/Double%20DQN%20Solution.ipynb

https://towardsdatascience.com/double-deep-q-networks-905dd8325412

Double Deep Q Networks

Tackling maximization bias in Deep Q-learning

towardsdatascience.com

https://www.freecodecamp.org/news/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682/

Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed…

by Thomas Simonini Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets > This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here. [https://simoninithomas.g

www.freecodecamp.org

'python source code' 카테고리의 다른 글

A3C 내용 및 코드 정리 (1) (6)	2023.10.03
double DQN 에 대한 정보 및 source code (python) 분석 (2) (5)	2023.09.20
강화학습 DQN 파라미터 최적화 관련 (0)	2023.08.22
Keras 케라스 batch predict 예제 및 메소드 사용 (0)	2023.08.05
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was inhomogeneous part. 에러 파이썬 문제 해결 (0)	2023.08.04

ABOUT ME

data_engineering_story data_engineering_story

'python source code' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'python source code' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바