dependabot[bot] on pip
Bump mistune from 0.8.4 to 2.0.… (compare)
dependabot[bot] on pip
Bump tensorflow from 2.8.0 to 2… (compare)
dependabot[bot] on pip
AlexKuhnle on master
Bump tensorflow from 2.7.0 to 2… Merge pull request #855 from te… (compare)
dependabot[bot] on pip
dependabot[bot] on pip
Bump tensorflow from 2.7.0 to 2… (compare)
dependabot[bot] on pip
Bump tensorflow from 2.7.0 to 2… (compare)
AlexKuhnle on master
Correct type (compare)
AlexKuhnle on master
Add missing box2d dependency (compare)
AlexKuhnle on master
Downgrade numpy version for Py3… (compare)
AlexKuhnle on master
Update to TF 2.7, update depend… (compare)
AlexKuhnle on master
Update setup and travis config (compare)
AlexKuhnle on master
make states ArrayDict to pass a… Merge pull request #849 from dx… (compare)
dependabot[bot] on pip
AlexKuhnle on master
Bump tensorflow from 2.6.0 to 2… Merge pull request #840 from te… (compare)
dependabot[bot] on pip
Bump tensorflow from 2.6.0 to 2… (compare)
AlexKuhnle on master
Update gym version requirement (compare)
AlexKuhnle on master
fix ZeroDivisionError in the pr… Merge pull request #836 from hr… (compare)
AlexKuhnle on master
Fix environments unittest (compare)
AlexKuhnle on master
Update gym dependency and fix e… (compare)
Hello,
so I'm currently attempting to get a DQN agent to work for my current solution and I'm finding a few things not entirely clear, so I have a couple of questions + an error that I'm getting.
The questions:
1) Does the DQN agent automatically update the weights at the end of each episode or do I have to manually call the Update() method?
2) Does the agent automatically store the state, action, reward it's given so we it can use that to train afterwards, or do I have to manually do it by storing them in a memory module and then use that for training?
The error I'm getting:
As for the error I'm getting, it's the following:
InvalidArgumentError (see above for traceback): assertion failed: [] [Condition x == y did not hold element-wise:] [x (agent.observe/strided_slice:0) = ] [407] [y (agent.observe/strided_slice_1:0) = ] [0]
[[node agent.observe/assert_equal_1/Assert/AssertGuard/Assert (defined at F:\ProgramFiles\Anaconda3\envs\Tensorforce\lib\site-packages\tensorforce\core\models\model.py:1094) ]]
[[{{node GroupCrossDeviceControlEdges_0/agent.observe/agent.core_observe/agent.core_experience/estimator.enqueue/assert_equal/Assert/AssertGuard/Assert/data_4}}]]
tf.debugging.assert_equal(
x=tf.shape(input=terminal, out_type=tf.int64)[0],
y=tf.dtypes.cast(x=self.buffer_index[parallel], dtype=tf.int64)
),
update(...)
function doesn't usually need to be called. You can specify how frequently the update should happen via the update_frequency
argument, or implicitly via batch_size
(if update_frequency is None
, then update_frequency = batch_size
). These numbers are timestep-based, so independent of episodes (since DQN is generally largely agnostic to episodes).act(...)
and observe(...)
are called iteratively (or Runner
is used, which takes care of it). No need to take care of anything here.observe(...)
only when you encounter a terminal state? As @qZhang88 mentioned, it would be good to see the code and how you call act()
and observe()
.
\\
def run_And_Update_States(self):
#this method is responsible for running a one step iteration and updating the states
#a one step iteration can be every X int amount of simulation steps, depending on how often this method is called
#note: the way this works is by taking one action per dqn_agent per timestep, which is necessary
#as I'm running multiple agents within the same environment, then executing the action and then updating
#the reward through observe the next time step. To do so, it's important to distinguish between the first step and every other step. It isn't possible to return the
#reward immediately from the environment for the current action before at least executing one simulation step, this is because we have to wait for the other agents to take
#their actions as well
reward = 0
#update queues and variables
self.update_TLS_Queues()
if self.previous_State is None and self.current_State is None:
#first call of this method --> first step
print("***First step***")
self.current_State = self.Get_State()
self.current_Action = self.choose_action(self.current_State)
self.action_changed = True
self.action_counter += 1
else:
#not the first time this method is called, i.e. we've already taken at least 1 action --> we can update memory + accumulate reward for previously taken action
#update previous state and current state
self.previous_State = self.current_State
self.current_State = self.Get_State()
self.previous_Action = self.current_Action #the previously taken action is now stored in its own varibale, so we can correlate between state, action, next state and reward
#retrieve and save info about terminal state
terminal = False
if (traci.simulation.getMinExpectedNumber() == 0):
terminal = True
print("***Terminal state reached, ending episode for "+ self.TLS_ID)
if(self.ack_Count != 0):
# acknowledgements since last timestep
avg_Travel_Time = float (avg_Travel_Time) / float(self.ack_Count)
print("Avg travel time for %s is %d" %(self.TLS_ID,avg_Travel_Time ))
reward = self.Evaluate_Reward(avg_Travel_Time, self.ack_Count)
else:
#no acknowledgements since last timestep
reward = self.Evaluate_Reward(1,0)
self.Total_reward += reward
print("Action_Counter = %d & Observe_Counter = %d" %(self.action_counter, self.observe_counter))
#pass info about terminal state to agent, 0 reward + true on terminal state
update_bool = self._model.DQN_Agent.observe(reward = reward, terminal = terminal)
self.observe_counter += 1
if(update_bool):
#print when an update occurs
print("Model with TLS ID # "+ self.TLS_ID + "was updated at timestep = %d" + self.step )
self.Total_reward += reward
#take action
if not(traci.simulation.getMinExpectedNumber() == 0):
self.current_Action = self.choose_action(self.current_State) #action to take in this timestep
self.action_counter += 1
#the change in phase is set from inside the run() method so we can keep track of the number of steps spent in the yellow phase before switching
if(self.current_Action == self.previous_Action):
self.action_changed = False
else:
self.action_changed = True
self._steps +=1
print(self.TLS_ID)
print("**Previous action:")
print(self.previous_Action)
print("**Current action:")
print(self.current_Action)
print("**Action changed bool:")
print(self.action_changed)
\\
choose_action(...)
method look like, and how do these multiple agents exactly work? Also, it sounds like there may be a better way of doing this. Feel free to write me a private message and we can discuss it in more detail.
Runner
, and if so, was it not clear how to use it for parallel execution? Or have you tried to use the slightly more low-level interface via the parallel
argument of agent.act/observe
? I can certainly add more information, but it would also be very welcome if you would consider contributing a short guide... :-) Also, I'm happy to help if there are still questions...
parallel_interactions
argument, but it should be automatically set internally. Note that currently you're running 16 environments, which run locally and hence will be executed iteratively, and the agent call will be batched, i.e. "in parallel". For computationally more expensive environments, it makes sense to use the remote
argument (see here) to execute remotely and hence "fully in parallel".
@AlexKuhnle I was working on updating my code from tensorforce 0.5.0 to 0.5.3, but figured that as parallel environments now have been added I would try to update the code all the way to the latest github version.
I have a custom environment I want to run on multiple CPUs (locally) as my environments include a bunch of fluid mechanical simulations which are very computationally heavy. I borrowed the script Jeff linked above and tried to use Remote="multiprocessing" and changed the environment to use my custom Class, also adding Remote="multiprocessing" to the Env.create() call. This seems to work ok, and the enviroment type is recognized as a MultiprocessingEnvironment.
However, when the code reaches the Runner call, I get an assertion error:
File "/home/fenics/local/tensorforce/tensorforce/execution/runner.py", line 99, in __init__
assert not isinstance(environment, Environment)
AssertionError
Is something going wrong with how I'm creating my environment, or is this something in the Runner class not taking into account that my environment now is of a different type?
Seems like my editing timed out, so I continue here.
I tried editing the script Jeff linked by only adding remote="multiprocessing"
to Env.create()
and Runner()
, which seems to work, except for slowing down over time and when reaching the final episode seems like nothing is happening, and won't finish the run.
I suspect I might have misunderstood "multiprocessing" vs "socket-client", and that what I actually need to use is "socket-client". (I have used the code contributed by Jerab29 with TensorForce 0.5.0, which used the same naming convention with Client, Server, Socket etc. which is causing some suspicions.)
Runner(environment='CartPole-v1')
to Runner(environment=environment)
which causes the same ExceptionError as for a custom env.<class 'tensorforce.environments.multiprocessing_environment.MultiprocessingEnvironment'>
which seems right.
Environment
objects to environments=
and in that case don't need to specify the remote arguments. (Same, if you pass the agent spec dict, you don't need to set parallel_interactions
, as it will be automatically set based on the runner arguments.)
I was able to figure out what I was doing wrong with my custom Environment, and probably a similar reason for the Assertion error above.
I was doing env = Environment.create()
and then passing env
to the Agent and the Runner. However, "multiprocessing" requires the environment we send to Runner NOT be of type "Environment, MultiprocessingEnvironment" or similar.
When I pass the custom Environment Class directly to the runner (and the agent), Agent.create(environment='CustomClass')
and Runner(environment='CustomClass/CartPole-v1')
tensorforce calls Environment.create()
on its own.
I was calling Environment.create()
on an instance which already had gone through Environment.create()
, sort of double stack of Environment.create
.
I'm meeting a few other errors, but they are very much more likely to be issues with how I'm defining my environment and not necessarily caused by tensorforce, so I'll take a closer look myself on those before bothering you again.
1) DQN as every other agent updates automatically, the update(...) function doesn't usually need to be called. You can specify how frequently the update should happen via the update_frequency argument, or implicitly via batch_size (if update_frequency is None, then update_frequency = batch_size). These numbers are timestep-based, so independent of episodes (since DQN is generally largely agnostic to episodes).
update_frequency
always has the same unit
as batch_size
, all specified as part of update
(in TensorforceAgent). So in case of PPO it can't be timestep-based. As you've probably read, update_frequency
specifies how frequently an update is scheduled -- > batch_size
doesn't make sense, otherwise some experience would just be ignored, = batch_size
is the default, but it makes sense to experiment with "increasing" the periodicity / "decreasing" the frequency < batch_size
.
memory = dict(type='recent')
instead of DQN's replay
and custom capacity
.
dict(type=..., shape=...)
, in general you can specify a nested action dict like dict(action1=dict(type=..., shape=...), action2=dict(type=..., shape=...), ...)
. Your environment (if you implement the Environment
class) can just return this for actions()
, and/or your agent can receive this as actions
argument.
Hi @qZhang88, hope the following explanation clarifies your question: PPO, as many other standard policy gradient algorithms, uses complete rollouts (episodes) for reward estimation. In Tensorforce this means that
batch_size
defines the number of episodes (each consisting of many timesteps) per update batch. Moreover, the way the PPO update works according to the paper is that it actually performs multiple updates based on randomly subsampled timestep-minibatches (the entire batch of n episodes is quite big). So thesubsampling_fraction
specifies what fraction of the full batch is subsampled for each minibatch, andoptimization_steps
specifies how often these mini-updates should happen.
still have some questions here, let's say batch size 10, max timestep is 1000, subsampling_fraction is 0.2, so each update batch size is still 10 and timesteps would be less than 200, right? the optimization steps could be increased, to take full advantage of the whole episodes?
I wonder something, while constructing a custom environment can we return different actions values. For example, [1,2,3] in a state and [2,3,4] in another or do I have to handle available actions in execute().
Thanks for the answer in advance.
Hi i have been tinkering with the DQN agent on the BreakoutDeterministic-v4 environment, but i am running into the problem of the agent receiving low rewards and plateaus at around an episode reward of 2-6 after running 10k-40k episodes.
The network config i am currently using is:
keras_net_conf = [
{
"type": "keras",
"layer": "Conv2D",
"filters": 32,
"kernel_size": 8,
"strides": 4,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "keras",
"layer": "Conv2D",
"filters": 64,
"kernel_size": 4,
"strides": 2,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "keras",
"layer": "Conv2D",
"filters": 64,
"kernel_size": 3,
"strides": 1,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "flatten",
},
{
"type": "keras",
"layer": "Dense",
"units": 512,
"activation": "relu",
"use_bias": False,
"kernel_initializer": 'VarianceScaling',
}
]
With preprocessing and exploration set up as:
preproc = [
{
"type": "image",
"width": 50,
"height": 50,
"grayscale": True
},
{
"type": "sequence",
"length": 4,
"concatenate": True
}
]
st_exp = dict(type='decaying', unit='timesteps', decay='polynomial', decay_steps=1000000, initial_value=1.0,
final_value=EXPLORATION, power=1.0)
The actual agent creation is defined as:
agent = Agent.create(agent='dqn',
environment=env,
states=env.states(),
batch_size=32,
preprocessing=dict(
state=preproc,
reward=dict(type="clipping", upper=1.0)
),
learning_rate=LR,
memory=100000,
start_updating=50000,
discount=DISC,
exploration=st_exp,
network=keras_net_conf,
update_frequency=4,
target_sync_frequency=10000,
summarizer=summarizer,
huber_loss=1.0,
name='DQN_agent')
The learning rate is set to 1e-5 and the discount factor to 0.99. The other parameters such a s memory size, max_ep_steps, start_update etc are all set from other implementations that do not use Tensorforce but have managed to achieve comparable scores to the original paper.
So i am wondering whether somebody has come across this issue and if so managed to get it to properly learn and get higher rewards.
Regards.