在Python中使用gym库实现DQN算法来训练智能体玩经典Atari游戏
发布时间:2023-12-16 08:59:13
使用gym库实现DQN算法来训练智能体玩经典Atari游戏的代码实例如下:
import gym
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam
# 创建DQN模型
def create_model(state_size, action_size):
model = Sequential()
model.add(Flatten(input_shape=(1, state_size)))
model.add(Dense(24, activation='relu'))
model.add(Dense(24, activation='relu'))
model.add(Dense(action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam(lr=0.001))
return model
# 创建经验回放缓冲区
class ReplayBuffer:
def __init__(self, buffer_size):
self.buffer = []
self.buffer_size = buffer_size
def add(self, experience):
if len(self.buffer) + len(experience) >= self.buffer_size:
self.buffer[0:(len(experience) + len(self.buffer)) - self.buffer_size] = []
self.buffer.extend(experience)
def sample(self, size):
return np.reshape(np.array(random.sample(self.buffer, size)), [size, 5])
# 创建DQN智能体
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.memory = ReplayBuffer(2000)
self.model = create_model(state_size, action_size)
def act(self, state, epsilon):
if np.random.rand() <= epsilon:
return np.random.choice(self.action_size)
act_values = self.model.predict(state)
return np.argmax(act_values[0])
def replay(self, batch_size):
minibatch = self.memory.sample(batch_size)
for state, action, reward, next_state, done in minibatch:
target = reward
if not done:
target = (reward + 0.95 * np.amax(self.model.predict(next_state)[0]))
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
def remember(self, state, action, reward, next_state, done):
self.memory.add([(state, action, reward, next_state, done)])
# 主训练循环
if __name__ == "__main__":
env = gym.make('Pong-v0')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
agent = DQNAgent(state_size, action_size)
batch_size = 32
num_episodes = 1000
for e in range(num_episodes):
state = env.reset()
state = np.reshape(state, [1, state_size])
epsilon = 1.0 / ((e / 50) + 10)
done = False
time = 0
while not done:
env.render()
action = agent.act(state, epsilon)
next_state, reward, done, _ = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
agent.remember(state, action, reward, next_state, done)
state = next_state
time += 1
print("Episode: {}/{}, Score: {}, Epsilon: {:.2}".format(e, num_episodes, time, epsilon))
if len(agent.memory.buffer) > batch_size:
agent.replay(batch_size)
上述代码中,我们首先创建了DQN模型,该模型由两个隐藏层和一个输出层组成。然后创建了经验回放缓冲区,用于存储智能体的经验。接着创建了DQNAgent类,该类包含了智能体的动作选择、经验回放和训练过程。在主循环中,我们使用gym库创建了Atari游戏的环境,并对智能体进行训练。
在训练过程中,我们使用epsilon贪心策略来选择动作,即智能体有一定概率随机选择动作,以便探索环境。在每个时间步,我们将环境的观察作为输入,通过DQN模型预测每个动作的Q值,并选择Q值最大的动作作为智能体的动作。然后,我们执行动作并观察环境的奖励和下一个状态。接下来,我们将智能体的经验存储到经验回放缓冲区中,并从缓冲区中采样一批经验进行训练。训练过程中使用的目标Q值是当前奖励加上下一个状态的最大Q值的折现,以便使Q值逐渐收敛到最优值。
最后,我们输出每个回合的得分和epsilon值,以及训练过程中的Q值的变化。通过大量的训练回合,智能体可以逐渐学习到最优的策略,以在Atari游戏中取得高分。
