dependabot[bot] on pip
Bump tensorflow from 2.8.0 to 2… (compare)
dependabot[bot] on pip
AlexKuhnle on master
Bump tensorflow from 2.7.0 to 2… Merge pull request #855 from te… (compare)
dependabot[bot] on pip
dependabot[bot] on pip
Bump tensorflow from 2.7.0 to 2… (compare)
dependabot[bot] on pip
Bump tensorflow from 2.7.0 to 2… (compare)
AlexKuhnle on master
Correct type (compare)
AlexKuhnle on master
Add missing box2d dependency (compare)
AlexKuhnle on master
Downgrade numpy version for Py3… (compare)
AlexKuhnle on master
Update to TF 2.7, update depend… (compare)
AlexKuhnle on master
Update setup and travis config (compare)
AlexKuhnle on master
make states ArrayDict to pass a… Merge pull request #849 from dx… (compare)
dependabot[bot] on pip
AlexKuhnle on master
Bump tensorflow from 2.6.0 to 2… Merge pull request #840 from te… (compare)
dependabot[bot] on pip
Bump tensorflow from 2.6.0 to 2… (compare)
AlexKuhnle on master
Update gym version requirement (compare)
AlexKuhnle on master
fix ZeroDivisionError in the pr… Merge pull request #836 from hr… (compare)
AlexKuhnle on master
Fix environments unittest (compare)
AlexKuhnle on master
Update gym dependency and fix e… (compare)
AlexKuhnle on master
Fix input format bug for agent.… (compare)
1) DQN as every other agent updates automatically, the update(...) function doesn't usually need to be called. You can specify how frequently the update should happen via the update_frequency argument, or implicitly via batch_size (if update_frequency is None, then update_frequency = batch_size). These numbers are timestep-based, so independent of episodes (since DQN is generally largely agnostic to episodes).
update_frequency
always has the same unit
as batch_size
, all specified as part of update
(in TensorforceAgent). So in case of PPO it can't be timestep-based. As you've probably read, update_frequency
specifies how frequently an update is scheduled -- > batch_size
doesn't make sense, otherwise some experience would just be ignored, = batch_size
is the default, but it makes sense to experiment with "increasing" the periodicity / "decreasing" the frequency < batch_size
.
memory = dict(type='recent')
instead of DQN's replay
and custom capacity
.
dict(type=..., shape=...)
, in general you can specify a nested action dict like dict(action1=dict(type=..., shape=...), action2=dict(type=..., shape=...), ...)
. Your environment (if you implement the Environment
class) can just return this for actions()
, and/or your agent can receive this as actions
argument.
Hi @qZhang88, hope the following explanation clarifies your question: PPO, as many other standard policy gradient algorithms, uses complete rollouts (episodes) for reward estimation. In Tensorforce this means that
batch_size
defines the number of episodes (each consisting of many timesteps) per update batch. Moreover, the way the PPO update works according to the paper is that it actually performs multiple updates based on randomly subsampled timestep-minibatches (the entire batch of n episodes is quite big). So thesubsampling_fraction
specifies what fraction of the full batch is subsampled for each minibatch, andoptimization_steps
specifies how often these mini-updates should happen.
still have some questions here, let's say batch size 10, max timestep is 1000, subsampling_fraction is 0.2, so each update batch size is still 10 and timesteps would be less than 200, right? the optimization steps could be increased, to take full advantage of the whole episodes?
I wonder something, while constructing a custom environment can we return different actions values. For example, [1,2,3] in a state and [2,3,4] in another or do I have to handle available actions in execute().
Thanks for the answer in advance.
Hi i have been tinkering with the DQN agent on the BreakoutDeterministic-v4 environment, but i am running into the problem of the agent receiving low rewards and plateaus at around an episode reward of 2-6 after running 10k-40k episodes.
The network config i am currently using is:
keras_net_conf = [
{
"type": "keras",
"layer": "Conv2D",
"filters": 32,
"kernel_size": 8,
"strides": 4,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "keras",
"layer": "Conv2D",
"filters": 64,
"kernel_size": 4,
"strides": 2,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "keras",
"layer": "Conv2D",
"filters": 64,
"kernel_size": 3,
"strides": 1,
"activation": "relu",
"padding": "valid",
"kernel_initializer": 'VarianceScaling',
"use_bias": False,
},
{
"type": "flatten",
},
{
"type": "keras",
"layer": "Dense",
"units": 512,
"activation": "relu",
"use_bias": False,
"kernel_initializer": 'VarianceScaling',
}
]
With preprocessing and exploration set up as:
preproc = [
{
"type": "image",
"width": 50,
"height": 50,
"grayscale": True
},
{
"type": "sequence",
"length": 4,
"concatenate": True
}
]
st_exp = dict(type='decaying', unit='timesteps', decay='polynomial', decay_steps=1000000, initial_value=1.0,
final_value=EXPLORATION, power=1.0)
The actual agent creation is defined as:
agent = Agent.create(agent='dqn',
environment=env,
states=env.states(),
batch_size=32,
preprocessing=dict(
state=preproc,
reward=dict(type="clipping", upper=1.0)
),
learning_rate=LR,
memory=100000,
start_updating=50000,
discount=DISC,
exploration=st_exp,
network=keras_net_conf,
update_frequency=4,
target_sync_frequency=10000,
summarizer=summarizer,
huber_loss=1.0,
name='DQN_agent')
The learning rate is set to 1e-5 and the discount factor to 0.99. The other parameters such a s memory size, max_ep_steps, start_update etc are all set from other implementations that do not use Tensorforce but have managed to achieve comparable scores to the original paper.
So i am wondering whether somebody has come across this issue and if so managed to get it to properly learn and get higher rewards.
Regards.
Hi, I'm trying to understand how to use tensorforce but i think i am missing something. For example why when i try to run
runner = Runner(
agent="ppo",
environment="CartPole-v1",
num_parallel=2
)
runner.run(num_episodes=300)
it works fine, but if try
runner = Runner(
agent="a2c",
environment="CartPole-v1",
num_parallel=2
)
runner.run(num_episodes=300)
it raises
tensorforce.exception.TensorforceError: Invalid value for agent argument update given parallel_interactions > 1: {'unit': 'timesteps', 'batch_size': 10}.
what am i missing here?
state()
and actions()
gives would help. Are they just dicts that describe the type of the return value? What would it look like if the states or actions are not discreet, but continuous? A little more explanation in the comments to the example class would be helpful. The documentation for these methods has information on what I asked above, but again, a little language around the intent of these fields examples exercising these options would be helpful.
__init__()
definition of Environment when I run a dir()
on my created object.
level
is for standard openai gym envs. What i meant was that i am following the same structure as openai gym but made my custom env having same abstract class (which you are check here). So my doubt is how to i make it work with tensorforce
mmx
environment =OpenAIGym(level=mmx)
gym.Env
base class? Your environment class could be class MMX(gym.Env): ...
, and if you then pass it to the Tensorforce Gym interface, it should be compatible: env = Environment.create(environment='gym', level=MMX, ...)
. Or have you tried this before? The level
argument should certainly accept custom gym.Env
subclass objects, and in fact also instances.
environment.reset()
before starting to execute. Apart from that you shouldn't need to add attributes or so when using Environment.create(...)
(which, I'd say, is the preferred way of initializing an env). I will also add an attribute forwarding for the wrapper, however, it will be readonly which I think should be enough (environment logic should go into the env implementation itself).
examples/
:-)