Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Jan 08 21:53

    AlexKuhnle on master

    Correct type (compare)

  • Jan 08 21:41

    AlexKuhnle on master

    Add missing box2d dependency (compare)

  • Jan 08 16:56

    AlexKuhnle on master

    Downgrade numpy version for Py3… (compare)

  • Jan 08 16:51

    AlexKuhnle on master

    Update to TF 2.7, update depend… (compare)

  • Jan 03 16:15

    AlexKuhnle on master

    Update setup and travis config (compare)

  • Dec 29 2021 14:54

    AlexKuhnle on master

    make states ArrayDict to pass a… Merge pull request #849 from dx… (compare)

  • Nov 10 2021 20:00

    dependabot[bot] on pip

    (compare)

  • Nov 10 2021 20:00

    AlexKuhnle on master

    Bump tensorflow from 2.6.0 to 2… Merge pull request #840 from te… (compare)

  • Nov 10 2021 19:45

    dependabot[bot] on pip

    Bump tensorflow from 2.6.0 to 2… (compare)

  • Oct 20 2021 20:50

    AlexKuhnle on master

    Update gym version requirement (compare)

  • Oct 20 2021 20:48

    AlexKuhnle on master

    fix ZeroDivisionError in the pr… Merge pull request #836 from hr… (compare)

  • Oct 02 2021 15:21

    AlexKuhnle on master

    Fix environments unittest (compare)

  • Oct 02 2021 13:46

    AlexKuhnle on master

    Update gym dependency and fix e… (compare)

  • Oct 02 2021 11:41

    AlexKuhnle on master

    Fix input format bug for agent.… (compare)

  • Oct 02 2021 09:19

    AlexKuhnle on ppo-revert

    (compare)

  • Aug 30 2021 20:20

    AlexKuhnle on 0.6.5

    (compare)

  • Aug 30 2021 20:02

    AlexKuhnle on master

    Update requirements, various mi… Move/rename reward_preprocessin… Fix PyPI version 0.6.5 (compare)

  • Aug 25 2021 21:48

    dependabot[bot] on pip

    (compare)

  • Aug 25 2021 21:48

    AlexKuhnle on master

    Bump tensorflow from 2.5.0 to 2… Merge pull request #818 from te… (compare)

  • Aug 25 2021 14:54

    dependabot[bot] on pip

    Bump tensorflow from 2.5.0 to 2… (compare)

Alexander Kuhnle
@AlexKuhnle
(But the exception is not great, so will improve this, and maybe add something to the docs on int types.)
MarkTension
@MarkTension
Great thanks, that solves it:)
MarkTension
@MarkTension

Hi! I'm running into this error

  File "/Users/schmark/anaconda/envs/tensorforce/lib/python3.6/site-packages/tensorforce/core/layers/dense.py", line 87, in initialize
    is_trainable=self.vars_trainable, is_saved=True
  File "/Users/schmark/anaconda/envs/tensorforce/lib/python3.6/site-packages/tensorforce/core/module.py", line 511, in variable
    name='variable', argument='spec', value=spec, hint='underspecified'
tensorforce.exception.TensorforceError: Invalid value for variable argument spec: TensorSpec(type=float, shape=(0, 32)) underspecified.

Since I added the network argument and my own custom layers:

agent = Agent.create(
    agent='ppo',
    environment=environment,
    network=[
        dict(type='conv2d', window=5, stride=3, size=8, activation='elu'),
        dict(type='flatten'),
        dict(type='dense', size=32),
        dict(type='flatten', name="out"),
    ], #etc (extra flatten is probably not necessary)

What does underspecified mean in this case, and what can be a cause?

Alexander Kuhnle
@AlexKuhnle
Don't see an obvious problem. What's the state space specification here?
MarkTension
@MarkTension
It’s a float array of shape [32, 32]
def states(self):
return dict(type='float’, shape=(self.params.egoSize,self.params.egoSize))
Alexander Kuhnle
@AlexKuhnle
I think it could potentially be because a conv2d expects rank-3 inputs, so of the shape (x, y, c). Could you try whether (32, 32, 1) works? (In which case it's obvs again not a great exception message)
MarkTension
@MarkTension
That's indeed the cause. Thanks again!
Drew Robinson
@l0phty

Hi folks. I'm new to Tensorforce and ML in general and am running into an error using TensorForce I'm not sure how to debug:

  File "/Users/marco0009/.virtualenvs/puzzle_solver-j2SM-PkM/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  indices[0] = [0, 3000] does not index into shape [1,3000,6]
     [[{{node agent/StatefulPartitionedCall/agent/TensorScatterUpdate_1}}]] [Op:__inference_act_1218]

3000 I recognize as my custom Environment's max_step_per_episode and the 6 I suspect is related to my env's actions:

    def actions(self):
        return {
            "make_move": dict(type="int", num_values=6),
        }

but I'm unsure as to what the cause of this exception actually is. Is there anywhere I should be looking in my Environment's configuration for issues?

Drew Robinson
@l0phty
I think I managed to find the problem. Instead of instantiating my custom env class and passing that to my agent, calling Environment.create and passing my custom class in that to it seems to have fixed it.
Alexander Kuhnle
@AlexKuhnle
Generally, that's the recommended way of creating an Env. I assume the problem was that, while your environment "claims" it terminates after max 3000 steps, that didn't actually happen. Is that possible? if you create it via Environment.create(), termination after max steps is guaranteed.
MarkTension
@MarkTension

Hi! Quick question on Loading a model. I saved it as a numpy file. When loading it as written in the docs :
agent = Agent.load(directory=checkpointFolder, filename="agent_episode600", format="numpy", environment=environment)
I get an error because it needs additional arguments that I'm not sure about.

`TypeError: __init__() missing 3 required positional arguments: 'update', 'objective', and 'reward_estimation'`

Are there any additional instructions that I need? I tried setting the first two to True, but I'm not sure about the reward_estimation argument

3 replies
Drew Robinson
@l0phty

Semi-related, when going through create it doesn't seem that max_episode_timesteps is set for a class where I inherit from Environment. Is this expected?

For example:

class SomeEnv(Environment):
    def __init__(self):
        super().__init__()

    def states(self):
        return dict(
            min_value=self.invalid_penalty * self.max_episode_timesteps(),
            ....
        )
    ....

environment = Environment.create(
    environment=SomeEnv,
    max_episode_timesteps=500,
)

In the above the call to self. max_episode_timesteps() returns None

Alexander Kuhnle
@AlexKuhnle
@MarkTension This shouldn't be the case, ideally. Can you post the agent configuration you were using? I think something in there may be causing the problem...
Alexander Kuhnle
@AlexKuhnle
@l0phty , interesting, I hadn't considered such a dependency. Environment.create(...) wraps the environment into another wrapper environment, which takes care of the specified max_episode_timesteps, hence it is not available in the underlying environment object. I slightly changed this in the latest commit, could you try it again?
MarkTension
@MarkTension

this is the configuration

agent = Agent.create(
        agent='ppo',
        environment=environment,
        network=[
            dict(type='conv2d', window=5, stride=3, size=8, activation='elu'),
            dict(type='flatten'),
            dict(type='dense', size=16),
            dict(type='flatten', name="out"),
        ],
        batch_size=10,
        learning_rate=1e-3,
        summarizer=dict(
            directory='results/summaries',
            labels=['entropy', 'kl-divergence', 'loss', 'reward', 'update-norm']
        )
    )

and the json

{"agent": "ppo", "states": {"type": "float", "shape": [31, 31, 3]}, "actions": {"move": {"type": "int", "num_values": 4}, "draw": {"type": "int", "num_values": 2}}, "max_episode_timesteps": 500, "batch_size": 10, "network": [{"type": "conv2d", "window": 5, "stride": 3, "size": 8, "activation": "elu"}, {"type": "flatten"}, {"type": "dense", "size": 16}, {"type": "flatten", "name": "out"}], "use_beta_distribution": false, "memory": "minimum", "update_frequency": "batch_size", "learning_rate": 0.001, "subsampling_fraction": 0.33, "optimization_steps": null, "likelihood_ratio_clipping": 0.25, "discount": 0.99, "predict_terminal_values": false, "baseline": null, "baseline_optimizer": null, "state_preprocessing": "linear_normalization", "reward_preprocessing": null, "exploration": 0.0, "variable_noise": 0.0, "l2_regularization": 0.0, "entropy_regularization": 0.0, "parallel_interactions": 1, "config": null, "saver": null, "summarizer": {"directory": "results/summaries", "labels": ["entropy", "kl-divergence", "loss", "reward", "update-norm"]}, "recorder": null, "internals": {}, "initial_internals": {"policy": {}}}
Alexander Kuhnle
@AlexKuhnle
The json is the saved agent.json, I assume? Thanks, can't see anything obvious here, but that should allow me to reproduce the problem. In the meantime, if you pass the agent arguments again to Agent.load(..., network=..., batch_size=...), that should hopefully help.
MarkTension
@MarkTension
the json is indeed the agent.json file. Passing the arguments into load() fixed it. Thanks!
1 reply
Steven Tobias
@stobias123
One of my observation items is a sparse matrix, and it seems I can't pass it into tensorforce with "auto" network... any ideas?

showing this error

ensorforce.exception.TensorforceError: Invalid value for TensorSpec.to_tensor argument value: [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]] > max_value.

with this shape

'enemy_radar': spaces.Box(low=0, high=100, shape=(100,100)),
Steven Tobias
@stobias123
btw - ^ this was dumb user error. I just needed to actually look at my matrix values... it was above my expected max
Alexander Kuhnle
@AlexKuhnle
@stobias123 good :-)
Pedro Chinen
@chinen93

I'm using PPO and I looked at the code, however I need a confirmation about what I understood about it.

The agent has a memory size (at least max_timestep * batch_size) which will hold all the episodes trajectory (state, action, reward).

After update_frequency episodes the agent's weights are updated. By default update_frequencyis the same as batch_size. Each update will do multi_steps where each step is clipped by likely_ratio_clipping and the trajectory is got from the memory based on the subsampling_fraction. So for a batch_size = 16, subsampling_fraction = 0.3 and multi_steps = 10: Each update will be 10 steps with each using 16*0.3 = 4.8 episode from the memory. Is that right?

I'm asking this because I want to improve the train speed. And the only place I can think of is the learning_rate, batch_size, multi_steps.

Alexander Kuhnle
@AlexKuhnle
Hi @chinen93 , yes, this is correct. Each actual multi-update will consist of timesteps equivalent to approx ~4.8 episodes, randomly sampled from the 16 episodes which were retrieved from memory for this update.
Alexander Kuhnle
@AlexKuhnle
Regarding training speed: An obvious point to start is to reduce the update_frequency. While this technically means that data will be slightly off-policy (e.g. update after every episode, but still use a batch_size of 16), in practice this is often no problem up to a point (also, PPOs objective accounts for importance sampling). Moreover, I think subsampling_fraction=1.0 may do the job in most cases, but it may be interesting to play around with it (also, recently you can also specify fixed-size subsampled batch-sizes by using integers (256) instead of floats (0.5), if that's desired). I would recommend starting with update_frequency=1 and subsampling_fraction=1.0, and then play around with batch_size (starting with ~10) and multi_steps (starting with 5-10). And of course learning_rate is also important.
Pedro Chinen
@chinen93
I have another question about PPO: If the update_frequency = 1 and I have 2 parallel environments. When one of these envs terminate the episode, for example the pole dropped in cartpole, the other env will have its actions based on the new updated network? Or it will wait to finish both envs and then update?
Alexander Kuhnle
@AlexKuhnle
The former, so it will update mid-episode for the other environment.
There is the sync_episodes option for the Runner, and if it were desired, this behavior could be changed such that, when using sync_episodes, episodes are terminated at the same time. But I would be surprised if this is very impactful.
Pedro Chinen
@chinen93

After each update_frequency, is the memory cleared?

For example batch_size=8, update_frequency=1, subsampling=1.0. The first episode will be updated without problems, but will the second update use the trajectory from the first in its update? Or the memory is cleared after each update_frequency.

If it is not cleared it means that after the seventh episode the first episode will still be used even tho it is from a policy so far away from the current policy. Is this correct?

Alexander Kuhnle
@AlexKuhnle
Yes, that's why by default frequency=batch_size. frequency>batch doesn't make sense, but frequency<batch means you use data more frequently at the cost of slightly of policy. The idea is of course that such "mild off-policyness" is not really problematic.
(So to be clear: it's not cleared, otherwise frequency and batch size would be equivalent)
nasrashvilg1
@nasrashvilg1
hi, is there a way to enable tensorforce use of FP16 precision rather than default FP32 to run faster? along the lines of this: https://www.tensorflow.org/guide/mixed_precision
Alexander Kuhnle
@AlexKuhnle
Hey @nasrashvilg1, there is a way of changing the TF dtype used within Tensorforce, as currently illustrated in the "precision unittest", but as the comment there says, the TF optimizers seem to expect float32 or float64 and not work with float16. Currently, Tensorforce just re-uses the TF versions for the typical first-order optimizers, but obviously one could re-implement them without this constraint. If you only need lower precision for deployment, I would expect that the SavedModel format makes it possible to convert/export to a float16 version of the model.
nasrashvilg1
@nasrashvilg1
@AlexKuhnle thanks for your response on this! - qq after 3rd episode my agent which is based on custom gym environment starts executing the same action all (sequence of actions is the same for every episode after the third-fourth episode onwards to the end of model training) what could be the issue if you or anyone else has run into this before?
Pedro Chinen
@chinen93
What is your agent configuration? For my experience, I added entropy regularization and exploration to prevent my agent to overfitting into taking always the same actions.
nasrashvilg1
@nasrashvilg1
@chinen93 agent = Agent.create(
agent='tensorforce',
environment=environment, # alternatively: states, actions, (max_episode_timesteps)
memory=10000,
update=dict(unit='timesteps', batch_size=64),
optimizer=dict(type='adam', learning_rate=3e-4),
policy=dict(network='auto'),
objective='policy_gradient',
reward_estimation=dict(horizon=20)
) - can you please guide me into how I can add the entropy regularization and exploration? The agent starts consistently taking the same action after few episodes - I think it might be prematurely converging to optimal policy when more learning needs to be done
Pedro Chinen
@chinen93

Take a look in this doc link about the "tensorforce" agente: https://tensorforce.readthedocs.io/en/latest/agents/tensorforce.html

You can add entropy_regularization, l2_regularization, exploration and other hyperparameters to try to improve your agent. However it depends on your action space

nasrashvilg1
@nasrashvilg1
@chinen93 my action space is spaces.Discrete(3)
Pedro Chinen
@chinen93
@nasrashvilg1 so try to create your agent as:
Agent.create(
agent='tensorforce',
environment=environment, # alternatively: states, actions, (max_episode_timesteps)
memory=10000,
update=dict(unit='timesteps', batch_size=64),
optimizer=dict(type='adam', learning_rate=3e-4),
policy=dict(network='auto'),
objective='policy_gradient',
reward_estimation=dict(horizon=20),
exploration=3e-4,
entropy_regularization=1e-4,
l2_regularization=1e-4
)
And change the last configurations as needed.
nasrashvilg1
@nasrashvilg1
@chinen93 ok thanks - will try that and run some experiments! :-)
in the last configuration you mean exploraiton, entropy_regularization and l2_regularization
Pedro Chinen
@chinen93
@AlexKuhnle How should I approach making my RL agent learn from an expert before trying things for itself? Make a simple supervisioned learning environment and just import the model into the RL loop, is this the right way?
Alexander Kuhnle
@AlexKuhnle
You can either use the pretrain function (if you have data in the right format, see in the example), or more manually use the experience and update functions. A supervised learning environment probably won't work as well.
Benno Geißelmann
@GANdalf2357
Hi, if I have a saved agent (saved as npz and json file), what is the right way to load/use this model for using it only within inference/prediction?
Pedro Chinen
@chinen93
@GANdalf2357, you can use the https://github.com/tensorforce/tensorforce/blob/master/examples/save_load_agent.py to see some examples of how to save/load a model. The runner part can be explicit with a while-loop if you need more control.
Pedro Chinen
@chinen93
@AlexKuhnle I still do not understand how the pretrain works. The PPO is on-policy right? How it can learn from a sequence of state-actions that is not from the current policy?
Alexander Kuhnle
@AlexKuhnle
Hi @GANdalf2357, in addition to what @chinen93 said: inference-only can be done e.g. by using evaluation=True when using Runner, or act with independent=True and deterministic=True (see here).
Alexander Kuhnle
@AlexKuhnle
@chinen93, the combination of experience and update is a bit like supervised learning, so train the policy distribution to output the corresponding action per state according to the data. This consideration is more or less ignoring the theory around policy gradient and on-policy etc, just looking at the problem from a supervised angle, and this can work (but can also not lead anywhere). The current Tensorforce interface is not ideal, bit too generic and hence may be used wrongly, but I also don't have too much experience with pretraining, behavioral cloning, etc.
Alexander Kuhnle
@AlexKuhnle
The pretrain function is a bit specific and assumes interaction traces including reward, so basically the recorded data of another agent, as in the pretrain example. Experience and update gives more flexibility, but might not be obvious. What data do you have? Individual data points of "expert" state->action decisions? Demonstration traces of state-action pairs? Or full demo traces including reward? Or potentially even more "random" trajectory data, not "expert"/"demonstration"? Depending on which case, applies, there are different possibilities.
Benno Geißelmann
@GANdalf2357
@chinen93 @AlexKuhnle thanks for your help! this is what I was looking for.