欢迎访问宙启技术站
智能推送

在Python中使用gym库实现DQN算法来训练智能体玩经典Atari游戏

发布时间:2023-12-16 08:59:13

使用gym库实现DQN算法来训练智能体玩经典Atari游戏的代码实例如下:

import gym
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

# 创建DQN模型
def create_model(state_size, action_size):
    model = Sequential()
    model.add(Flatten(input_shape=(1, state_size)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(action_size, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=0.001))
    return model

# 创建经验回放缓冲区
class ReplayBuffer:
    def __init__(self, buffer_size):
        self.buffer = []
        self.buffer_size = buffer_size

    def add(self, experience):
        if len(self.buffer) + len(experience) >= self.buffer_size:
            self.buffer[0:(len(experience) + len(self.buffer)) - self.buffer_size] = []
        self.buffer.extend(experience)

    def sample(self, size):
        return np.reshape(np.array(random.sample(self.buffer, size)), [size, 5])

# 创建DQN智能体
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = ReplayBuffer(2000)
        self.model = create_model(state_size, action_size)

    def act(self, state, epsilon):
        if np.random.rand() <= epsilon:
            return np.random.choice(self.action_size)
        act_values = self.model.predict(state)
        return np.argmax(act_values[0])

    def replay(self, batch_size):
        minibatch = self.memory.sample(batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                target = (reward + 0.95 * np.amax(self.model.predict(next_state)[0]))
            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)

    def remember(self, state, action, reward, next_state, done):
        self.memory.add([(state, action, reward, next_state, done)])

# 主训练循环
if __name__ == "__main__":
    env = gym.make('Pong-v0')
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n
    agent = DQNAgent(state_size, action_size)
    batch_size = 32
    num_episodes = 1000

    for e in range(num_episodes):
        state = env.reset()
        state = np.reshape(state, [1, state_size])
        epsilon = 1.0 / ((e / 50) + 10)
        done = False
        time = 0

        while not done:
            env.render()
            action = agent.act(state, epsilon)
            next_state, reward, done, _ = env.step(action)
            next_state = np.reshape(next_state, [1, state_size])
            agent.remember(state, action, reward, next_state, done)
            state = next_state
            time += 1

        print("Episode: {}/{}, Score: {}, Epsilon: {:.2}".format(e, num_episodes, time, epsilon))
        if len(agent.memory.buffer) > batch_size:
            agent.replay(batch_size)

上述代码中,我们首先创建了DQN模型,该模型由两个隐藏层和一个输出层组成。然后创建了经验回放缓冲区,用于存储智能体的经验。接着创建了DQNAgent类,该类包含了智能体的动作选择、经验回放和训练过程。在主循环中,我们使用gym库创建了Atari游戏的环境,并对智能体进行训练。

在训练过程中,我们使用epsilon贪心策略来选择动作,即智能体有一定概率随机选择动作,以便探索环境。在每个时间步,我们将环境的观察作为输入,通过DQN模型预测每个动作的Q值,并选择Q值最大的动作作为智能体的动作。然后,我们执行动作并观察环境的奖励和下一个状态。接下来,我们将智能体的经验存储到经验回放缓冲区中,并从缓冲区中采样一批经验进行训练。训练过程中使用的目标Q值是当前奖励加上下一个状态的最大Q值的折现,以便使Q值逐渐收敛到最优值。

最后,我们输出每个回合的得分和epsilon值,以及训练过程中的Q值的变化。通过大量的训练回合,智能体可以逐渐学习到最优的策略,以在Atari游戏中取得高分。