Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 29 22:35

    AlexKuhnle on master

    Add trace-decay option to rewar… (compare)

  • Nov 29 22:06

    AlexKuhnle on master

    Improve environment output type… (compare)

  • Nov 26 22:37

    AlexKuhnle on master

    Add environment reward-shaping … (compare)

  • Nov 08 22:28

    AlexKuhnle on master

    Fix issue with value tracking f… (compare)

  • Nov 06 21:26

    AlexKuhnle on master

    Fixed an issue with unpacking e… Merge pull request #745 from ta… (compare)

  • Nov 01 14:41

    AlexKuhnle on master

    Re-add renamed unittest (compare)

  • Nov 01 14:30

    AlexKuhnle on master

    Function layer argument option … Add first version of tensor val… (compare)

  • Oct 24 18:38

    AlexKuhnle on master

    Fix/improve an exception messag… (compare)

  • Oct 24 13:35

    AlexKuhnle on master

    Minor fixes and docs improvemen… (compare)

  • Oct 19 20:54

    AlexKuhnle on ppo-revert

    Fix problem with NestedDict.pop Fix/improve an exception messag… (compare)

  • Oct 18 21:13

    AlexKuhnle on ppo-revert

    Change back optimizer and polic… (compare)

  • Oct 17 11:07

    AlexKuhnle on ppo-revert

    Experimental: revert some chang… (compare)

  • Oct 17 10:49

    AlexKuhnle on master

    Handle module nestings more con… (compare)

  • Oct 04 11:39

    AlexKuhnle on master

    Improve some exception messages… (compare)

  • Oct 03 16:00

    AlexKuhnle on 0.6.2

    (compare)

  • Oct 03 15:59

    AlexKuhnle on master

    Update to version 0.6.2 (compare)

  • Oct 03 12:06

    AlexKuhnle on master

    Remove a reward vs baseline hor… (compare)

  • Oct 01 14:39

    AlexKuhnle on master

    Add separate exponential decay … (compare)

  • Oct 01 13:30

    AlexKuhnle on master

    Fix critical bug for DQN varian… (compare)

  • Sep 30 19:19

    AlexKuhnle on master

    Use initial internals as defaul… (compare)

JanScheuermann
@JanScheuermann
Does anyone have an idea what the problem is? Any wrong assumptions that I made?
Alexander Kuhnle
@AlexKuhnle
Hi, a few comments:
Alexander Kuhnle
@AlexKuhnle
  • Environment looks good, however, I would recommend scaling the action as part of the environment and use as action space the interval [-1, 1]. Just to be sure, I can't say off the top of my head whether learning in the Tensorforce implementation will be completely robust to scaling (it may be).
  • I would definitely reduce the batch size etc (16, 32, or so), at least while you're testing simplified settings, and remove things like exploration. Also, your learning rate is very small, maybe better try 1e-3 to 1e-4.
  • Fixing the random seed can be dangerous, maybe you're just looking at a particularly unlucky seed.
  • Maybe start off with a categorical action to see whether it works better. Also, consider also trying the PPO agent, as it generally works quite well.
  • If nothing helps and you don't get anywhere, maybe it's worth temporarily moving to a Gym environment to get something working. It's hard to say with deep RL, for instance, whether it's actually easy to learn a completely fixed environment, as you describe it. With a standard environment, at least you don't have to worry about that part.
JanScheuermann
@JanScheuermann
Hi Alexander, thank you very much for your comments, I try to incorporate them today and let you know what worked and what not, best regards, Jan
JanScheuermann
@JanScheuermann
@AlexKuhnle when you talk about scaling the action space, do you really mean action space or the state I put into agent.act()? If you really mean action, can I simply scale it setting min_value=-1 and max_value=1 in the action definition in the environment? Best regards, Jan
JanScheuermann
@JanScheuermann
And should the reward also be scaled or is it maximized regardless of scale?
Alexander Kuhnle
@AlexKuhnle
Yes, I mean the action, but normalizing on environment side, so: If your actions are within [0, 1000], you can turn them into e.g. [-1,1] if you transform the agent output a via (a + 1) * 500 at the beginning of execute(). Not a bad idea generally if the scale of values is large (also in state), just to be sure, since NNs work better around zero.
While this is just about keeping values in a domain where the NN works effectively, the reward is a different issue -- "shaping" the reward will actively change the optimization problem, for better or worse, and it depends a lot on what algorithm you're using etc.
Alexander Kuhnle
@AlexKuhnle
There are features like reward normalization in TForce, but to the degree that you know how the reward of your environment is working and you can adapt it, there's no need to rely on this env-agnostic feature. I think it makes sense to also keep the rewards around 0, in particular not necessarily positive-only. Also worth thinking about what the "cumulative return" (in combination with discount) will look like for the agent, in particular if, for instance, you're using a policy gradient algorithm.
CristianMonea
@CristianMonea
Tensorforce allows to save an agent by simply calling agent.save(). Is is possible to save a Runner in a similar way? I am interested in storing information such as timesteps, episode_rewards after each run for futher processing/plotting.
Marius
@MariusHolm

Hi @AlexKuhnle ,
I've used the PPO agent of tensorforce for my masters project.
Implementation wise things have worked out nicely, and I'm not writing out my thesis.
In that regard I want to explain some of my code, and doing so I want to explain the PPO agent.

In the agent definition two neural networks are defined:

  1. "network"
  2. "critic_network" (this has confused me a bit, as I start thinking of actor-critic which to my understanding PPO is separate of..)

What is the difference between these networks?

E.g. is one used to approximate the Advantage function and the other to approximate the Policy?

Alexander Kuhnle
@AlexKuhnle
@CristianMonea The run.py script offers options to save some information, otherwise it should be straightforward to add it using the callbacks. But right now this functionality is not provided as part of the Runner class. Good idea, though.
@MariusHolm, not sure whether the "terminology" is 100% correct here, but the "critic_network" corresponds to a trained value function which acts similar to the "baseline" in the context of policy gradient algorithms or the "critic" in the context of actor-critic, in the sense that it is trained to approximate the state value V(s) as discounted cumulative sum, and the main policy uses the advantage, so empirical return - critic(s), instead of just the empirical return in the PG "loss".
Alexander Kuhnle
@AlexKuhnle
Does that help? I'm not sure "critic" has a more specific meaning or is just typically used for "baseline" in the context of only actor-critic.
CristianMonea
@CristianMonea
@AlexKuhnle OK. Thank you! Another suggestion is to add the possibility to automatically stop training if the reward threshold was exceeded (e.g., a mean reward of 200 over the last 100 episodes)
Marius
@MariusHolm
@AlexKuhnle Thank you. That definitely helps :)
杨子信
@yzx20160815_twitter
why updated = self.model_observe(parallel=parallel, **kwargs) always return False
@AlexKuhnle
Alexander Kuhnle
@AlexKuhnle
Ideally, it shouldn't always return False, but True if an update was performed (Which may be very infrequently). Could you check whether it really never returns True? What is your agent config?
杨子信
@yzx20160815_twitter
image.png
@AlexKuhnle
杨子信
@yzx20160815_twitter
sometimes return True
Qiao.Zhang
@qZhang88
@AlexKuhnle Does tensorforce support distributed training? That is several sampling machine and one parameter server. PS does gradient updating and other machine pull the most recent parameters and only run on exploration and sampling mode?
Alexander Kuhnle
@AlexKuhnle
@yzx20160815_twitter Does that mean the problem is solved and it does sometimes return True?
Alexander Kuhnle
@AlexKuhnle
@qZhang88 Yes and no. The parallelization mode which Tensorforce currently supports is based on one agent with multiple parallel input streams which interact with "remote" environments (via Python's multiprocessing or socket), instead of multiple remote worker agents and one central update agent.
Alexander Kuhnle
@AlexKuhnle
So the result is kind of the same, but the communication content is somewhat different. We've been using this approach in the context of computationally expensive simulations and it worked very well (see here, particularly diagrams on page 6).
杨子信
@yzx20160815_twitter
@AlexKuhnle yes ,thks
Qiao.Zhang
@qZhang88

@qZhang88 Yes and no. The parallelization mode which Tensorforce currently supports is based on one agent with multiple parallel input streams which interact with "remote" environments (via Python's multiprocessing or socket), instead of multiple remote worker agents and one central update agent.

what is the sync mechanism of tensorforce for multiprocessing or socket?

Alexander Kuhnle
@AlexKuhnle
There is the option to batch calls to the agent or to do them less/unbatched as they're requested. depending on the speed of the env, one or the other may be better. Is that what you mean?
Steven Tobias
@stobias123
Anyone around that can help a newbie w/ using a custom openAI gym environment?
I think my actions/state aren't being passed properly from openai gym to tensorflow environment and I'm not sure why
getting this error when running training
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot update variable with shape [2] using a Tensor with shape [4], shapes must be equal.
Steven Tobias
@stobias123
found the problem. I was trying to load old saved checkpoints/tensorboard data. just had to clear that
Matt Pettis
@mpettis

Hey @mpettis, I added some documentation for multi-input networks here (and multi-state/action specification here). All very minimal, but a start... :-)

Hi @AlexKuhnle , looking closer at the multi-input documentation you updated here for my benefit... when reading, you state in the documentation you use the special layers Register and Retrieve, but in the example, it looks like you are only referencing retrieve... is that correct? Am I missing something?

1 reply
Matt Pettis
@mpettis
@AlexKuhnle Also, where can I find documentation on the 'states' and 'policy' argument in that example? I'm looking at the main definition, and best that I can tell, they may be passed on in **kwargs to the Agent.create() method? Looking here: https://tensorforce.readthedocs.io/en/latest/agents/agent.html#tensorforce.agents.TensorforceAgent.create
2 replies
Matt Pettis
@mpettis
For anyone... if I create my environment with Environment.create(), I get an EnvironmentWrapper object. In my execute() method of the Environment class I make, what is the best way to access the timestep attribute of the wrapper? I'm a bit rusty on my python, and I think I can tell that the wrapper holds a reference to my environment object as an attribute of the wrapper instance, and the timestep is another attribute of the wrapper instance. And I'm struggling with how to access that timestep value from my Environment class definition. I'm currently cheating and keeping track of my own timestep, but that's not optimal.
Matt Pettis
@mpettis
To be honest -- I don't think you can. Since an environmentWrapper "has a" envinroment (in particular, the one I create) as a member, it can't inspect its peer members within the wrapper, which it would need to, with the current architecture.
Alexander Kuhnle
@AlexKuhnle
Yes, you can't access the wrapper attributes from the internal environment. Not sure whether there is a better way. However, I wouldn't say it's cheating to keep track of it yourself: some environments need to do this, others don't, and the ones which do should explicitly do it in their implementation. The wrapper keeps track of it for other reasons (to obey to "max_episode_timesteps" if set).
So I would say, what you're doing is what is what is recommended to happen. Different question if you want to access the attribute externally, which is currently not well supported. If that would be good to support, happy to think how to make it happen, probably just something like environment.get_additional_info() or so.
yanghaoxie
@yanghaoxie
Hi, everyone.
yanghaoxie
@yanghaoxie

I have a problem about setting network_spec.
I set network_spec as network_spec = [{"type": "dense", "size": 100, "activation": "relu"}].
However, I get the error message TensorforceError: Invalid value for Module.add_variable argument shape: 0,100.
After some experiments, I found out that, if the states is nested dict this error occur, otherwise, it doesn't.
For example, following states definition will cause this error,

    def states(self):
        states = {}
        states['foo'] = dict(type='int', shape=(3, ), num_values=6)
        states['bar'] = dict(type='int', shape=(3, ), num_values=6)
        return states

and following definition will not cause the error,

    def states(self):
        return dict(type='float', shape=(8,))

Could you please help me?

Alexander Kuhnle
@AlexKuhnle
Hi @yanghaoxie, the exception message is not very informative, will need to check whether it can be improved (or may just be an artifact of at what point the inconsistency in specification causes actual problems). But here two points you should look into:
The first state consists of two components, so a simple sequential network will not work (Tensorforce doesn't implicitly concatenate inputs or something like that). What you can do in this case is to use the "extended" multi-input network specification feature, which plugs together sequential "components" and retrieves state components via special "register" and "retrieve" layers.
Moreover, the first state consists of integers, which cannot be processed by a dense layer (again, Tensorforce doesn't do anything implicitly to take care of it). The simple way to address this problem is to use an embedding layer first (see here), to map each of the finite values to a corresponding embedding (equivalent to encoding as one-hot vectors and then applying a dense layer).
Hope that helps!
yanghaoxie
@yanghaoxie
@AlexKuhnle Thank you so much for your help :). I will investigate what you told me.
danthedolphin
@danthedolphin
Hi all, I'm trying to replicate the DQN paper by Mnih et al (2015) on Atari games and am trying to extract the Q values for each action but I'm not sure how to get them. I've already got the agent and environment training and everything but this is the last step I need. Is there a way to somehow get the Q values for each action every time I call agent.act(states=states)?
Alexander Kuhnle
@AlexKuhnle
Hi, you can retrieve additional tensors via the query argument -- 'action-distribution-values' (or alternatively your action name as first part) should work. Have a look here for an example in the unittests.
danthedolphin
@danthedolphin
@AlexKuhnle Thanks for the update! This module has helped me a lot
Alexander Kuhnle
@AlexKuhnle
No problem :-)