## Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
##### Activity
• Nov 21 20:42

dependabot[bot] on pip

• Nov 21 20:42

dependabot[bot] on pip

Bump tensorflow from 2.8.0 to 2… (compare)

• Jul 29 23:31

dependabot[bot] on pip

Bump mistune from 0.8.4 to 2.0.… (compare)

• May 24 17:30

dependabot[bot] on pip

Bump tensorflow from 2.8.0 to 2… (compare)

• Feb 10 08:43

dependabot[bot] on pip

• Feb 10 08:43

AlexKuhnle on master

Bump tensorflow from 2.7.0 to 2… Merge pull request #855 from te… (compare)

• Feb 09 23:35

dependabot[bot] on pip

• Feb 09 23:35

dependabot[bot] on pip

Bump tensorflow from 2.7.0 to 2… (compare)

• Feb 09 23:28

dependabot[bot] on pip

Bump tensorflow from 2.7.0 to 2… (compare)

• Jan 08 21:53

AlexKuhnle on master

Correct type (compare)

• Jan 08 21:41

AlexKuhnle on master

• Jan 08 16:56

AlexKuhnle on master

Downgrade numpy version for Py3… (compare)

• Jan 08 16:51

AlexKuhnle on master

Update to TF 2.7, update depend… (compare)

• Jan 03 16:15

AlexKuhnle on master

Update setup and travis config (compare)

• Dec 29 2021 14:54

AlexKuhnle on master

make states ArrayDict to pass a… Merge pull request #849 from dx… (compare)

• Nov 10 2021 20:00

dependabot[bot] on pip

• Nov 10 2021 20:00

AlexKuhnle on master

Bump tensorflow from 2.6.0 to 2… Merge pull request #840 from te… (compare)

• Nov 10 2021 19:45

dependabot[bot] on pip

Bump tensorflow from 2.6.0 to 2… (compare)

• Oct 20 2021 20:50

AlexKuhnle on master

Update gym version requirement (compare)

• Oct 20 2021 20:48

AlexKuhnle on master

fix ZeroDivisionError in the pr… Merge pull request #836 from hr… (compare)

Alexander Kuhnle
@AlexKuhnle
@chris405_gitlab , thanks, that looks correct, didn't consider that lambda attributes are treated differently from class functions. I'm surprised I never came across that... :-/
amirrezaheidari
@amirrezaheidari
Hello, I am applying "tensorforce" and "dqn" agents to my problem, but tensorforce is performing better. May I ask what is te algorithm behind this agent?
amirrezaheidari
@amirrezaheidari
Also I have another question about "episode" which is not clear for me. In your example of room temperature controller, you reset the environment at the begining of each episode, then interact with environment for a certain number of timesteps. This is repeated for 200 episodes. But assume that in my problem, I have a dataset of 1000 rows. What I assume is that I should keep for example 80% of this data for training. Then, in each episode, I should cycle through all of the rows. So I am cycling 200 times oer the same train data. Am I right?
Alexander Kuhnle
@AlexKuhnle
Hi @amirrezaheidari, learning via RL from a fixed dataset does not follow the "standard setup". There are two options: first, you can wrap the dataset in a "pseudo-environment" and learn via the usual agent-environment episodes setup, but the question is how does this environment react to model actions that don't follow the dataset? One option is to introduce "episodes" here: you just terminate the episode when the agent does something "invalid" (according to your dataset), and potentially give a negative reward as well, if desired. Second, you can use behavioral cloning and other off-policy learning techniques. Tensorforce provides a basic behavioral-cloning-like approach, as illustrated e.g. here. Feel free to write me a message if you have more questions, or here in the channel if they're not very specific to your problem.
And regarding the Tensorforce agent: it's basically the "parent" of all agents in Tensorforce, DQN and others are more specific configurations of this agent. So what algorithm it is depends on the arguments -- are you using the default arguments?
wushilan
@wushilan
Hello. I am trying to solve a high-dimensional action space problem. The action space has about 40 two-dimensional variables. I used Tensorforce's Dueling-DQN to solve this problem before, and achieved ideal results. I recently learned that DQN and related algorithms cannot be used for high-dimensional action space problems, but the Dueling-DQN algorithm in Tensorforce does solve it. Does Tensorforce optimize the Dueling-DQN algorithm for high-dimensional action spaces?
wushilan
@wushilan
This is my action spec ::<class 'dict'>: {'x_': {'type': 'int', 'shape': (2, 2, 4), 'num_values': 5}, 'y': {'type': 'int', 'shape': (10, 3, 4), 'num_values': 2}, 'z': {'type': 'int', 'shape': (10, 2, 4, 4), 'num_values': 2}}
Alexander Kuhnle
@AlexKuhnle
Hi @wushilan, great to hear that the DuelingDQN agent worked so surprisingly well! :-) Regarding high-dimensional action spaces, I think this "conflict" may be due to two versions of "high-dimensional": DQN doesn't scale well with the number of available actions (num_values), and if you look at your action space as a product space, (2,2,4) x 5 x (10,3,4) x 2 x (10,2,4,4) x 2, that's of course gigantic. However, Tensorforce splits this space into its factors, so it's (2,2,4) actions with 5 options, and (10,3,4) + (10,2,4,4) actions with two options, and so each individual action is actually quite "low-dimensional", and this factorization works well if the actions are not correlated in very complex ways -- which presumably they aren't. I'm not sure how common such a factorization is for other frameworks, but I would be surprised if this is very uncommon. Anyway, that's the only additional feature I can think of in Tensorforce which is beneficial in such a context (and this factorization may go particularly well with the "dueling" part, but not sure).
wushilan
@wushilan
Thank you for your answer. I think you mentioned two reasons. The second reason is that the actions are processed as x, y, and z respectively, which will reduce the total dimension. So does Tensorforce need to build three neural networks separately, or divide the output of a neural network into three parts and then select x, y, and z from these three parts? The first reason can you elaborate on the first reason, I I don't quite understand yet. For example, for a (2,2,4) action with 5 options, is its dimension 5^(2X2X4)?
wushilan
@wushilan
My main question is, should he calculate 5^(2X2X4) Q-values or just calculate 2X2X4X5 Q-values when processing (2,2,4) actions with 5 options? In other words, is this action internally processed jointly or independently?
Alexander Kuhnle
@AlexKuhnle
Regarding the first reason, that's my point. Standard DQN assumes a single discrete action, and a space like yours could be fit into this framework by looking at the product space, so 5^(2x2x4). However, Tensorforce factorizes such spaces into 2x2x4x5, as you suggest. Note the difference, e.g. if you choose an action via argmax over 5 Q-values, for each 2x2x4 action independently, or if you take into account the effect of combinations. Of course, it makes sense to factorize, but I don't know how common it is in implementations, since standard DQN doesn't do that.
Alexander Kuhnle
@AlexKuhnle
Regarding the second reason: You're right, there is an additional hierarchical aspect in Tensorforce. First, each of the 2x2x4 etc actions get their own (independent!) linear layer mapping network output embedding to 2/5 Q-values. This is implemented as one big matrix multiplication yielding the flattened action tensor, embedding -> 2*2*4 * {2,5}. Second, each "action component", so x,y,z, are implemented as separate matrix multiplications. However, ultimately this just means: the network produces an output embedding, and for each action (with N alternatives) across all components we have an independent linear transformation embedding -> N Q-values.
wushilan
@wushilan
Thank you for your detailed answers, and thank you for contributing such an outstanding work to Tensorforce.
Hello, folks. I've changed slightly a quick start code and saved a model, like so: agent.save(directory=g_sModel1, format='numpy'). That gave me agent.json and agent.npz files. Now, when I'd like to make some predictions, I load that model, like so:
and feed new data to it:
actions = agent.act(states=new_data).
Well, that agent.act returned an action for the first time. The second time it threw the following exception: "tensorforce.exception.TensorforceError: Calling agent.act must be preceded by agent.observe.". The issue #683 on github is also about the same exception, but it doesn't contain explanation for the cause. I am not trying to continue training, I'd like to make predictions. Therefore calling agent.observe with a reward, doesn't make sense to me. I must be missing something very basic here. Would anyone please point a lamer to his mistake? Thanks.
Alexander Kuhnle
@AlexKuhnle
Hi @notreadyyet , if you're calling only agent.act() for inference, you need to set the argument independent=True (and, if desired, also deterministic=True), to signal that this act() is not part of training and hence won't be followed by observe(). You're right, the exception could be more helpful here -- I will change that.
amirrezaheidari
@amirrezaheidari
Hi, I am developing a Reinforcement Learning control model using DQN agent. May I ask how the exploitation/exploration dilemma in DQN agent should be adjusted? Is there a parameter for that?
In general, I found Tensorfore a very flexible and straight forward library for Reinforcement Learning. However, the lack of examples makes it a bit challenging, especially for a beginner like me, to easily implement it for different problems. Would it be possible to provide an example of the proper implementation of DQN? For example, the modification of the current example on room temperature controller with implementing DQN would be a nice tutorial.
Thank you
Alexander Kuhnle
@AlexKuhnle
Hi @amirrezaheidari , thanks for the feedback. DQN is one of the available agent types, so it shouldn't require any implementation. Obviously the hyperparameters still need to be chosen and tuned, but that can't really be handled by the framework. Regarding the exploration/exploitation tradeoff: there is an argument exploration, which specifies the fraction of random actions (as typical for DQN exploration) and which can be chosen either as constant or e.g. decaying parameter. I agree that the framework needs more tutorials, and I would welcome anyone who is getting started on a suitable "tutorial problem" and going through the process of getting it to work to contribute a notebook or example script similar to the room temperature example.
amirrezaheidari
@amirrezaheidari

Hello
I have developed an agent to control a heating system. However, I am wondering that every time that I run my code, exactly the same code with the same parameters, I observe a different performance (sometimes very different). I was thinking maybe it can be because of exploration, in which agent is taking random actions and therefore in each run the performane is different. So I tried decaying the exploration with the "st_exp" function that I found in this channel. However, it still performs differently in each run with the same parameters.
Basically most of the times I do not get satisfactory performance metrics so I need to tune my reward function, but as far as I get a very different performance under the same reward function I can not tune it. For example the same parameters were giving me quite good performance last night but now the performance is terrible.

(1) Any suggestions on what the reason can be?

(2) Also it is not clear for me that what does "st_exp" do exactly? what is the function exactly?

(3) For "st_exp", can you please let me know what parameters you suggest to use for "decay_steps", "final_value" and "num_steps=Train_hours_number"? As the name implies, "Train_hours_number " is the number of my training hours

(4) How we can see what are the default parameters of agent? For example the default exploration or default architecture of network

st_exp = dict(type='decaying', unit='timesteps', decay='polynomial', decay_steps=100000, initial_value=1.0,
final_value=0, power=1.0,num_steps=Train_hours_number)

agent = Agent.create(
agent='dqn',
max_episode_timesteps=Train_hours_number,
environment=environment,
network=[
dict(type='dense', size=50, activation='tanh'),
dict(type='flatten'),
dict(type='dense', size=50, activation='tanh'),
dict(type='flatten', name="out"),
],
exploration=st_exp,
learning_rate=1e-3,
batch_size=72,
memory=Train_hours_number

)
Alexander Kuhnle
@AlexKuhnle
Hey, for the exploration arguments, the docs may help, similar for default arguments. Variability in results across different runs is quite normal in RL, at least for not perfect unstable solutions. You can do is to average over multiple runs per configuration, if that's feasible. Other than that it's very hard to say what the problem is, how to solve it, or whether it even can be solved. One thing I would recommend, though: start with the PPO agent, not the DQN agent, since the former tends to be better and more stable.
amirrezaheidari
@amirrezaheidari

Hi @AlexKuhnle , thanks for the documents. I went through the documents and now the concepts are more clear to me. However, still I have two problems: (1) I get "very" different results on different runs (2) the agent performance most of the times is very poor. I tried to see what is affecting most this unstable and poor behavior. It seems to me that the rest of the code (definition of environment, states, reward, etc) is fine and the agent definition should be a problem. For example, when I change the batch_size, the results change significantly. After reading about the batch_size, update_frequency and other parameters I tried following items:
(1) Changing the agent to ppo, double_ddq, temsorforce
(2) Comparing episode of 1 week versus 1 day
(3) Comparing different weights on Reward function componenets
(5) Increasing learing rate
(6) Changing batch_size into 8, 12,24 (without specifying update frequency)
(7) Including and discluding exploration
Finally I have designed my agent as follow but the issue are not solved, still a poor and variating performance. Do you have any suggestions what other modifications I can try with agent?
My agent interacts with the environemnt 13 weeks, each episode is one day(24 timesteps). Then I test the train model over 2.5 weeks. The problem is not that hard, just to turn on and off a heater.

linear_decay = dict(type='decaying', unit='timesteps', decay="linear", num_steps=168, initial_value=0.99,
final_value=0.01)

agent = Agent.create(
agent='dqn',
learning_rate=1e-3,
environment=environment,
batch_size=24,
update_frequency=8,
exploration=linear_decay,
network="auto",
memory=Train_hours_number
)

Alexander Kuhnle
@AlexKuhnle
It's very hard to say what could be wrong and what could help. Few comments: (a) I would recommend using the PPO agent since it avoids some potentially-hard-to-choose hyperparameters like memory size or target network (note one detail when switching: the unit for batch-size and update-frequency is "episodes" for PPO, not "timesteps" as is the case for DQN, so the numbers will generally be lower, like batch-size 8 frequency 1 or so). (b) Looking at Tensorboard plots may help. (c) 13 weeks, i.e. 13724 = 2184 timesteps is not an awful lot, I would push this number more towards 100k. (d) How do you decide that the environment "is fine" and the agent is not? It's often worth assessing how state/action representation and reward function affect performance (harder to do if performance is just indistinguishably bad, of course).
Benno Geißelmann
@GANdalf2357
hi, currently I'm trying to let tensorforce train on a GPU cluster, unfortunately I'm getting this error and I could not figure out what the problem is, maybe someone of you had the same issue? File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/model.py", line 315, in initialize
self.initialize_api()
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/tensorforce.py", line 629, in initialize_api
super().initialize_api()
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/models/model.py", line 373, in initialize_api
parallel=self.parallel_spec.empty(batched=True)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorforce/core/module.py", line 128, in decorated
output_args = function_graphsstr(graph_params)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
result = self._call(args, *kwds)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 846, in _call
return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds) # pylint: disable=protected-access
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
cancellation_manager=cancellation_manager)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/home/geb1imb/.conda/envs/tx/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation agent/StatefulPartitionedCall/agent/Gather: Could not satisfy explicit device specification '' because the node {{colocation_node agent/StatefulPartitionedCall/agent/Gather}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0, /job:localhost/replica:0/task:0/device:GPU:0].
Colocation Debug Info:
Colocation group had the following types and supported devices:
Identity: GPU CPU XLA_CPU XLA_GPU
_Arg: GPU CPU XLA_CPU XLA_GPU
ResourceGather: GPU CPU XLA_CPU XLA_GPU
seems like it it trying to execute an operation on a GPU which is maybe not supported? "agent/StatefulPartitionedCall/agent/Gather"
Alexander Kuhnle
@AlexKuhnle
Hi @GANdalf2357 , I think this issue was recently resolved, so updating to the latest Github master should help (or if you're using the pip version, changing to the Github version -- of course, sooner or later there will be a new pip version, maybe should do that soon).
Benno Geißelmann
@GANdalf2357
@AlexKuhnle thanks for the info!
HYDesmondLiu
@HYDesmondLiu
Hi Experts,
With PPO agent, for example if I want to set "reward_estimation" with horizon=5.
How do I do so? I have tried agent = Agent.create( agent = 'ppo', ... reward_estimation=dict(horizon=5), ...)
Then I got an error:TypeError: init() got multiple values for keyword argument 'reward_estimation'
Alexander Kuhnle
@AlexKuhnle
Hi @HYDesmondLiu, setting a reward horizon for PPO is "not possible", since PPO as policy gradient algorithm is episode-based, not n-step. That's at least the reason why this option is not offered as configuration for the PPO agent. However, one could, in principle, configure a "PPO-like variant" using a shorter estimation horizon. How: by replicating the PPO config using the more general Tensorforce agent (the parent of all agents in Tensorforce), and then modifying the corresponding argument. I was planning to add these configs for the users who want to modify agent types beyond their "intended domain", if that would help.
HYDesmondLiu
@HYDesmondLiu
Hi @AlexKuhnle , Thanks a lot for the response, sorry I did not figure it out clearly. Your solution looks good. (seems like I cannot reply in the thread?)
Alexander Kuhnle
@AlexKuhnle
@HYDesmondLiu , I've added the example config for PPO based on the more general Tensorforce agent in benchmarks/configs/ppo_tensorforce.json. It is equivalent to the config ppo.json.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks a lot will try it out.
May I know what is the easiest way to record reward vs. episode?
I cannot tee the progress sometimes that is output from TensorForce.
Alexander Kuhnle
@AlexKuhnle
Have you tried the TensorBoard summaries?
Benno Geißelmann
@GANdalf2357

Hi @GANdalf2357 , I think this issue was recently resolved, so updating to the latest Github master should help (or if you're using the pip version, changing to the Github version -- of course, sooner or later there will be a new pip version, maybe should do that soon).

Hi @AlexKuhnle I now switched to master to overcome this issue but now the reset() call failes, do have I have to change something in my state with the new master? I now get this error in reset() tensorforce.exception.TensorforceError: Environment.reset: invalid type <class 'tuple'> != float for state.

till now I was on 0.6.2 which worked fine with my code
Benno Geißelmann
@GANdalf2357
my state is a tuple about like this (0,0,0,0,0.34243,0.5424,0.4211)
on a first look: if i remove the check which leads to this error the training seems to run fine again.
HYDesmondLiu
@HYDesmondLiu

Have you tried the TensorBoard summaries?

@AlexKuhnle Thanks very much, will try !

HYDesmondLiu
@HYDesmondLiu
Does the "update_frequency" mean how frequently the reward is updated? Like "temporal difference update"?
HYDesmondLiu
@HYDesmondLiu
For DQN agent setup, how should we set the 'memory'?
Alexander Kuhnle
@AlexKuhnle
@GANdalf2357 Yes, I think I've recently added some tests to catch "invalid" inputs, which otherwise might later lead to obscure errors. It's not covering this case properly, but if you try it with the latest commit in a few min, hopefully the problem should be gone.
@HYDesmondLiu update_frequency is not about the reward, but about the optimization step. So for each update you have a certain batch-size, say 10 episodes, or 64 timesteps, and unless specified differently, that will also be the update_frequency. However, you may want to update more frequently than that, so you may set update_frequency to something between 1 and batch_size (> batch_size doesn't really make sense).
Alexander Kuhnle
@AlexKuhnle
Regarding DQN agent and memory, it's typically a number >> batch_size, at least 100x or much more. Moreover, it's typical particularly here to set the update_frequency lower than batch_size, say 8 vs 64. But it's a parameter that can be tuned, at least to get the rough magnitude right.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks so much.
I met another problem while running DQN, where the action space should be discrete so the actions should be int types and min_value, max_value, and num_value should be specified. However if I set these three I got an error File "/home/hsinyu/hyliu_Python/lib/python3.6/site-packages/tensorforce/core/utils/tensor_spec.py", line 45, in __init__ name='TensorSpec', argument='min/max_value', condition='num_values specified' tensorforce.exception.TensorforceError: Invalid TensorSpec argument min/max_value given num_values specified. As I trace the source code it seems like these three cannot be all set at the same time?
For example one of my action is $x_1 \in [58,80]$ and the other is $x_2 \in [0, 1500]$ should I set the num_values for $x_1$ as $80-58+1 = 23$ and $x_2$ as $1500-0=1500$?
What I do not understand is since we have already set min_value and max_value, why not let the code to calculate the num_value itself?
Alexander Kuhnle
@AlexKuhnle
Hmm, yes, it could -- the idea is that min_/max_value are for float types as lower and upper bound, whereas num_values is for int types. The fact that it also specifies min_/max_value implicitly shouldn't matter, so you shouldn't need to specify all three, just the two or one, depending on the type. (The idea to still add it is, I think, they internally share some asserts via min/max-value, plus there could potentially be layer types which want to know min/max-bounds... but it really doesn't matter right now.)
So in short: if I understand correctly, you set all three values -- so the solution should be to only set num_values in case of int.
I will change the exception you pointed out so that it will always complain about the right thing being specified invalidly.
HYDesmondLiu
@HYDesmondLiu
@AlexKuhnle Thanks for prompt reply. But if I only set num_values how do agents know what the max_value and min_value are? I don't want to mess up my system.