amirrezaheidari
@amirrezaheidari

Hello
I have developed an agent to control a heating system. However, I am wondering that every time that I run my code, exactly the same code with the same parameters, I observe a different performance (sometimes very different). I was thinking maybe it can be because of exploration, in which agent is taking random actions and therefore in each run the performane is different. So I tried decaying the exploration with the "st_exp" function that I found in this channel. However, it still performs differently in each run with the same parameters.
Basically most of the times I do not get satisfactory performance metrics so I need to tune my reward function, but as far as I get a very different performance under the same reward function I can not tune it. For example the same parameters were giving me quite good performance last night but now the performance is terrible.

(1) Any suggestions on what the reason can be?

(2) Also it is not clear for me that what does "st_exp" do exactly? what is the function exactly?

(3) For "st_exp", can you please let me know what parameters you suggest to use for "decay_steps", "final_value" and "num_steps=Train_hours_number"? As the name implies, "Train_hours_number " is the number of my training hours

(4) How we can see what are the default parameters of agent? For example the default exploration or default architecture of network

st_exp = dict(type='decaying', unit='timesteps', decay='polynomial', decay_steps=100000, initial_value=1.0,
final_value=0, power=1.0,num_steps=Train_hours_number)

agent = Agent.create(
agent='dqn',
max_episode_timesteps=Train_hours_number,
environment=environment,
network=[
dict(type='dense', size=50, activation='tanh'),
dict(type='flatten'),
dict(type='dense', size=50, activation='tanh'),
dict(type='flatten', name="out"),
],
exploration=st_exp,
learning_rate=1e-3,
batch_size=72,
memory=Train_hours_number

)
Alexander Kuhnle
@AlexKuhnle
Hey, for the exploration arguments, the docs may help, similar for default arguments. Variability in results across different runs is quite normal in RL, at least for not perfect unstable solutions. You can do is to average over multiple runs per configuration, if that's feasible. Other than that it's very hard to say what the problem is, how to solve it, or whether it even can be solved. One thing I would recommend, though: start with the PPO agent, not the DQN agent, since the former tends to be better and more stable.
amirrezaheidari
@amirrezaheidari

Hi @AlexKuhnle , thanks for the documents. I went through the documents and now the concepts are more clear to me. However, still I have two problems: (1) I get "very" different results on different runs (2) the agent performance most of the times is very poor. I tried to see what is affecting most this unstable and poor behavior. It seems to me that the rest of the code (definition of environment, states, reward, etc) is fine and the agent definition should be a problem. For example, when I change the batch_size, the results change significantly. After reading about the batch_size, update_frequency and other parameters I tried following items:
(1) Changing the agent to ppo, double_ddq, temsorforce
(2) Comparing episode of 1 week versus 1 day
(3) Comparing different weights on Reward function componenets
(4) Adding more states
(5) Increasing learing rate
(6) Changing batch_size into 8, 12,24 (without specifying update frequency)
(7) Including and discluding exploration
Finally I have designed my agent as follow but the issue are not solved, still a poor and variating performance. Do you have any suggestions what other modifications I can try with agent?
My agent interacts with the environemnt 13 weeks, each episode is one day(24 timesteps). Then I test the train model over 2.5 weeks. The problem is not that hard, just to turn on and off a heater.

linear_decay = dict(type='decaying', unit='timesteps', decay="linear", num_steps=168, initial_value=0.99,
final_value=0.01)

agent = Agent.create(
agent='dqn',
learning_rate=1e-3,
environment=environment,
batch_size=24,
update_frequency=8,
exploration=linear_decay,
network="auto",
memory=Train_hours_number
)

Alexander Kuhnle
@AlexKuhnle
It's very hard to say what could be wrong and what could help. Few comments: (a) I would recommend using the PPO agent since it avoids some potentially-hard-to-choose hyperparameters like memory size or target network (note one detail when switching: the unit for batch-size and update-frequency is "episodes" for PPO, not "timesteps" as is the case for DQN, so the numbers will generally be lower, like batch-size 8 frequency 1 or so). (b) Looking at Tensorboard plots may help. (c) 13 weeks, i.e. 13724 = 2184 timesteps is not an awful lot, I would push this number more towards 100k. (d) How do you decide that the environment "is fine" and the agent is not? It's often worth assessing how state/action representation and reward function affect performance (harder to do if performance is just indistinguishably bad, of course).
Benno Geißelmann
@GANdalf2357
hi, currently I'm trying to let tensorforce train on a GPU cluster, unfortunately I'm getting this error and I could not figure out what the problem is, maybe someone of you had the same issue? File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/model.py", line 315, in initialize
self.initialize_api()
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/tensorforce.py", line 629, in initialize_api
super().initialize_api()
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/model.py", line 373, in initialize_api
parallel=self.parallel_spec.empty(batched=True)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/module.py", line 128, in decorated
output_args = function_graphsstr(graph_params)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(args, *kwds)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation agent/StatefulPartitionedCall/agent/Gather: Could not satisfy explicit device specification '' because the node {{colocation_node agent/StatefulPartitionedCall/agent/Gather}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Identity: GPU CPU XLA_CPU XLA_GPU
ResourceScatterAdd: CPU XLA_CPU XLA_GPU
_Arg: GPU CPU XLA_CPU XLA_GPU
ResourceGather: GPU CPU XLA_CPU XLA_GPU
seems like it it trying to execute an operation on a GPU which is maybe not supported? "agent/StatefulPartitionedCall/agent/Gather"
Alexander Kuhnle
@AlexKuhnle
Hi @GANdalf2357 , I think this issue was recently resolved, so updating to the latest Github master should help (or if you're using the pip version, changing to the Github version -- of course, sooner or later there will be a new pip version, maybe should do that soon).
Benno Geißelmann
@GANdalf2357
@AlexKuhnle thanks for the info!
HYDesmondLiu
@HYDesmondLiu
Hi Experts,
With PPO agent, for example if I want to set "reward_estimation" with horizon=5.
How do I do so? I have tried agent = Agent.create( agent = 'ppo', ... reward_estimation=dict(horizon=5), ...)
Then I got an error:TypeError: init() got multiple values for keyword argument 'reward_estimation'
Alexander Kuhnle
@AlexKuhnle
Hi @HYDesmondLiu, setting a reward horizon for PPO is "not possible", since PPO as policy gradient algorithm is episode-based, not n-step. That's at least the reason why this option is not offered as configuration for the PPO agent. However, one could, in principle, configure a "PPO-like variant" using a shorter estimation horizon. How: by replicating the PPO config using the more general Tensorforce agent (the parent of all agents in Tensorforce), and then modifying the corresponding argument. I was planning to add these configs for the users who want to modify agent types beyond their "intended domain", if that would help.
HYDesmondLiu
@HYDesmondLiu
Hi @AlexKuhnle , Thanks a lot for the response, sorry I did not figure it out clearly. Your solution looks good. (seems like I cannot reply in the thread?)
Alexander Kuhnle
@AlexKuhnle
@HYDesmondLiu , I've added the example config for PPO based on the more general Tensorforce agent in benchmarks/configs/ppo_tensorforce.json. It is equivalent to the config ppo.json.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks a lot will try it out.
May I know what is the easiest way to record reward vs. episode?
I cannot tee the progress sometimes that is output from TensorForce.
Alexander Kuhnle
@AlexKuhnle
Have you tried the TensorBoard summaries?
Benno Geißelmann
Hi @AlexKuhnle I now switched to master to overcome this issue but now the reset() call failes, do have I have to change something in my state with the new master? I now get this error in reset() tensorforce.exception.TensorforceError: Environment.reset: invalid type <class 'tuple'> != float for state.

till now I was on 0.6.2 which worked fine with my code
Benno Geißelmann
@GANdalf2357
my state is a tuple about like this (0,0,0,0,0.34243,0.5424,0.4211)
on a first look: if i remove the check which leads to this error the training seems to run fine again.
HYDesmondLiu
@HYDesmondLiu

Have you tried the TensorBoard summaries?

@AlexKuhnle Thanks very much, will try !

HYDesmondLiu
@HYDesmondLiu
Does the "update_frequency" mean how frequently the reward is updated? Like "temporal difference update"?
HYDesmondLiu
@HYDesmondLiu
For DQN agent setup, how should we set the 'memory'?
Alexander Kuhnle
@AlexKuhnle
@GANdalf2357 Yes, I think I've recently added some tests to catch "invalid" inputs, which otherwise might later lead to obscure errors. It's not covering this case properly, but if you try it with the latest commit in a few min, hopefully the problem should be gone.
@HYDesmondLiu update_frequency is not about the reward, but about the optimization step. So for each update you have a certain batch-size, say 10 episodes, or 64 timesteps, and unless specified differently, that will also be the update_frequency. However, you may want to update more frequently than that, so you may set update_frequency to something between 1 and batch_size (> batch_size doesn't really make sense).
Alexander Kuhnle
@AlexKuhnle
Regarding DQN agent and memory, it's typically a number >> batch_size, at least 100x or much more. Moreover, it's typical particularly here to set the update_frequency lower than batch_size, say 8 vs 64. But it's a parameter that can be tuned, at least to get the rough magnitude right.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks so much.
I met another problem while running DQN, where the action space should be discrete so the actions should be int types and min_value, max_value, and num_value should be specified. However if I set these three I got an error File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/core/utils/tensor_spec.py", line 45, in __init__ name='TensorSpec', argument='min/max_value', condition='num_values specified' tensorforce.exception.TensorforceError: Invalid TensorSpec argument min/max_value given num_values specified. As I trace the source code it seems like these three cannot be all set at the same time?
For example one of my action is $x_1 \in [58,80]$ and the other is $x_2 \in [0, 1500]$ should I set the num_values for $x_1$ as $80-58+1 = 23$ and $x_2$ as $1500-0=1500$?
What I do not understand is since we have already set min_value and max_value, why not let the code to calculate the num_value itself?
Alexander Kuhnle
@AlexKuhnle
Hmm, yes, it could -- the idea is that min_/max_value are for float types as lower and upper bound, whereas num_values is for int types. The fact that it also specifies min_/max_value implicitly shouldn't matter, so you shouldn't need to specify all three, just the two or one, depending on the type. (The idea to still add it is, I think, they internally share some asserts via min/max-value, plus there could potentially be layer types which want to know min/max-bounds... but it really doesn't matter right now.)
So in short: if I understand correctly, you set all three values -- so the solution should be to only set num_values in case of int.
I will change the exception you pointed out so that it will always complain about the right thing being specified invalidly.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks for prompt reply. But if I only set num_values how do agents know what the max_value and min_value are? I don't want to mess up my system.
Alexander Kuhnle
@AlexKuhnle
They will always be zero-based, so taking your example of num_values=23 produces values 0, ..., 22. Your environment can then just add the offset to actions, e.g. 58, or subtract the offset from states.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks a lot, it works~!
HYDesmondLiu
@HYDesmondLiu
Hi @AlexKuhnle it's me again. I just tried to set only the num_values without setting max_value and min_value and it worked,
however I got this error and I am not sure how this happens debugging it. Could you please give me some hints?
 Traceback (most recent call last):
File "reinforcement_learning_continuous.py", line 221, in <module>
save_best_agent=True
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/execution/runner.py", line 548, in run
self.handle_observe(parallel=n)
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/execution/runner.py", line 659, in handle_observe
terminal=self.terminals[parallel], reward=self.rewards[parallel], parallel=parallel
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/agents/agent.py", line 511, in observe
terminal=terminal_tensor, reward=reward_tensor, parallel=parallel_tensor
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/core/module.py", line 128, in decorated
output_args = function_graphs[str(graph_params)](*graph_args)
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 784, in __call__
result = self._call(*args, **kwds)
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 818, in _call
results = self._stateful_fn(*args, **kwds)
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2972, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1948, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 561, in call
ctx=ctx)
File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
**tensorflow.python.framework.errors_impl.InvalidArgumentError:  Invalid gradient: contains inf or nan. : Tensor had NaN values**
[[{{node agent/StatefulPartitionedCall/agent/cond_1/then/_311/agent/cond_1/StatefulPartitionedCall/agent/StatefulPartitionedCall_7/policy_optimizer/StatefulPartitionedCall/policy_optimizer/VerifyFinite/CheckNumerics}}]] [Op:__inference_observe_5452]

Function call stack:
observe
Alexander Kuhnle
@AlexKuhnle
Phew, that could have many reasons. While there are a few assertions to catch inf/nan inputs, it's hard to say what causes inf/nan gradients. I'm a bit confused by where this exception is thrown, since there is a gradient inf/nan check as well. Can you post the agent config? How quickly does this come up? I assume it trains okay for a while, before throwing this exception? Can you check whether the agent is always choosing the same action, or whether there's variation?
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Sorry for the late reply, it does not happen now.
HYDesmondLiu
@HYDesmondLiu
Hi @AlexKuhnle I think I am a bit confused. What are the algorithms in Tensorforce that are "model-based"?
Alexander Kuhnle
@AlexKuhnle
Hi, there are no model-based algorithms in Tensorforce, and probably won't be for the foreseeable future, the framework focuses on the typical model-free algorithm classes, in particular Q-learning and policy gradient.
R Puttkammer
@rpgit12
@AlexKuhnle - I'm using PPO for a AI gym like problem (network=auto, batch_size=100, state=float(77), action=int(4)) and am surprised to see exploration default at 0. Isn't exploration>0 required during training? And shouldn't it be reset to 0 afterwards? Couldn't find specifics in docs or sample code. Thanks!
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks a lot. That makes sense.
Alexander Kuhnle
@AlexKuhnle
@rpgit12 Exploration is only typical for deterministic policies like DQN or DPG. other policy gradient algorithms sample an action from the policy distribution, so they have exploration kind of built in. Moreover, I would say it depends to some degree on the randomness of your environment, whether the agent needs to be encouraged to explore or will be "forced to explore" due to a random environment.
(the latter point is secondary, though)
HYDesmondLiu
@HYDesmondLiu

@AlexKuhnle I got this TypeError TypeError: __init__() got multiple values for keyword argument 'optimizer' while setting up A2C like this:

     elif args.agent=="a2c":
agent = Agent.create(
agent='a2c',
environment=environment,
max_episode_timesteps=5,
batch_size=32,
network=[
dict(type='dense', size=128, activation='tanh'),
dict(type='dense', size=64, activation='tanh')
],
optimizer=dict(
multi_step=10, subsampling_fraction=64, linesearch_iterations=5,
doublecheck_update=True
),
critic=[
dict(type='dense', size=64, activation='tanh'),
dict(type='dense', size=64, activation='tanh')
],
)

However this is what described we are supposed to be setting optimizer in the 'optimizer' section.

Alexander Kuhnle
@AlexKuhnle
Currently, the A2C agent only has an argument learning_rate, whereas the rest of optimizer is implicitly specified as ADAM. You would need to move to the tensorforce agent to have all arguments available. The idea behind the various agent sub-classes is to provide a "standard" interface with only the typical arguments (and things like subsampling_fraction, linesearch are not typical for A2C), however, maybe I'll change this at some point.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks, this is very helpful.
R Puttkammer
@rpgit12
@AlexKuhnle - thanks indeed!
HYDesmondLiu
@HYDesmondLiu

Hi @AlexKuhnle , to your previous question regarding DQN gradient assertion, here are my responses:
To your questions, here are the reponses:
Agent config.:

 agent = Agent.create(
agent="dqn",
environment=environment,
memory=300,
batch_size=32,
network="auto",
update_frequency=1,
learning_rate=1e-5,
discount=0.9
)

2.&3. Very quick, running on the first episode.
4.&5. There is only one action since it happened on the first episode.

amirrezaheidari
@amirrezaheidari

Hi
I need to train my agent for many iterations. My agent interacts with TRNSYS as environment and calls it at each action to calculate the next state. Therefore, it takes a long time to train it (8 hours). I need to reduce this time.

1- Do you think if I perform parallel computing I will get fast enough operation?

2- I tried the following way to test if I can do parallel computing but I get the following error:

###### #

Runner(
agent=r'C:\Python_TRNSYS_integration\Parallelization test\agent.json', environment=HotWaterEnvironment,
num_parallel=4
)
runner.run(num_episodes=100, batch_agent_calls=True)

###### #

TypeError: init() got an unexpected keyword argument 'internals'

###### #

2- And if the above code gets work and shows fast enough results, I need to parallelize the following code which is the same training as above but I store some values in each uteration. Can you show me on the following code how can I parallelize it?

# Training

States_train_target=[]

Actions_train_target=[]

Rewards_train_target=[]

Energy_train_target=[]

Reward_total_train_allepisodes_target=[]
Reward_energy_train_allepisodes_target=[]
Reward_comfort_train_allepisodes_target=[]
Reward_hygiene_train_allepisodes_target=[]

Energy_train_allepisodes_target=[]

for episode in range(int(Train_weeks)):

print(episode)

Reward_total_train_eachepisode_target=[]
Reward_energy_train_eachepisode_target=[]
Reward_comfort_train_eachepisode_target=[]
Reward_hygiene_train_eachepisode_target=[]

Energy_train_eachepisode_target=[]

states = environment.reset()
environment.timestep=episode*episode_times

terminal = False
while not terminal:
actions = agent.act(states=states)
states, terminal, reward = environment.execute(actions=actions)
states=tuple(states)
agent.observe(terminal=terminal, reward=reward)

Reward_total_train_eachepisode_target.append(reward)
#Reward_energy_train_eachepisode_target.append(reward[1])
#Reward_comfort_train_eachepisode_target.append(reward[2])
#Reward_hygiene_train_eachepisode_target.append(reward[3])
#Energy_train_eachepisode_target.append(energy)
States_train_target.append(states)
Actions_train_target.append(actions)
Rewards_train_target.append(reward)
#Energy_train_target.append(energy)

Reward_total_train_allepisodes_target.append(Reward_total_train_eachepisode_target)
#Reward_energy_train_allepisodes_target.append(Reward_energy_train_eachepisode_target)
#Reward_comfort_train_allepisodes_target.append(Reward_comfort_train_eachepisode_target)
#Reward_hygiene_train_allepisodes_target.append(Reward_hygiene_train_eachepisode_target)
#Energy_train_allepisodes_target.append(Energy_train_eachepisode_target)
###### #

I do not have access to NVIDIA to work with GPU so I guess the only way would be to parallelize the above code.

Thanks