Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 29 22:35

    AlexKuhnle on master

    Add trace-decay option to rewar… (compare)

  • Nov 29 22:06

    AlexKuhnle on master

    Improve environment output type… (compare)

  • Nov 26 22:37

    AlexKuhnle on master

    Add environment reward-shaping … (compare)

  • Nov 08 22:28

    AlexKuhnle on master

    Fix issue with value tracking f… (compare)

  • Nov 06 21:26

    AlexKuhnle on master

    Fixed an issue with unpacking e… Merge pull request #745 from ta… (compare)

  • Nov 01 14:41

    AlexKuhnle on master

    Re-add renamed unittest (compare)

  • Nov 01 14:30

    AlexKuhnle on master

    Function layer argument option … Add first version of tensor val… (compare)

  • Oct 24 18:38

    AlexKuhnle on master

    Fix/improve an exception messag… (compare)

  • Oct 24 13:35

    AlexKuhnle on master

    Minor fixes and docs improvemen… (compare)

  • Oct 19 20:54

    AlexKuhnle on ppo-revert

    Fix problem with NestedDict.pop Fix/improve an exception messag… (compare)

  • Oct 18 21:13

    AlexKuhnle on ppo-revert

    Change back optimizer and polic… (compare)

  • Oct 17 11:07

    AlexKuhnle on ppo-revert

    Experimental: revert some chang… (compare)

  • Oct 17 10:49

    AlexKuhnle on master

    Handle module nestings more con… (compare)

  • Oct 04 11:39

    AlexKuhnle on master

    Improve some exception messages… (compare)

  • Oct 03 16:00

    AlexKuhnle on 0.6.2

    (compare)

  • Oct 03 15:59

    AlexKuhnle on master

    Update to version 0.6.2 (compare)

  • Oct 03 12:06

    AlexKuhnle on master

    Remove a reward vs baseline hor… (compare)

  • Oct 01 14:39

    AlexKuhnle on master

    Add separate exponential decay … (compare)

  • Oct 01 13:30

    AlexKuhnle on master

    Fix critical bug for DQN varian… (compare)

  • Sep 30 19:19

    AlexKuhnle on master

    Use initial internals as defaul… (compare)

Matt Pettis
@mpettis
last question for now... there's no facility for mixed dtypes for state, right? In my example, I could see that I have two dimensions -- one is the current temperature, a real number, and the second is a count, which is an integer. I'm casting the integer to a real for now, and it should work, but I was curious if I was missing something.
Alexander Kuhnle
@AlexKuhnle
There is, actually. Mixed states can be specified via a "dict of dicts", so e.g. states=dict(state1=dict(type='int', shape=()), state2=dict(type='float', shape=())). However, in that case you can't use a network as a simple stack of layers. Two options: the auto network can take care of it, it will just internally create a reasonable simple network (with some modification options), or you specify a "multi-input" network yourself. You can specify networks as "list of lists", where each of the inner lists is a layer stack, and the special register and retrieve layers are used to combine these stacks to a full network. I realise there is no good example currently. Will need to add one to the docs.
Something like [[dict(type='retrieve', tensors='state1'), ..., dict(type='register', tensor='state1-embedding')], [same for state2], [dict(type='retrieve', tensors=['state1-embedding', 'state2-embedding'], aggregation='concat'), ...]]
Hope that illustrates how the stacks are sticked together via register and retrieve
Alexander Kuhnle
@AlexKuhnle
Note that float is fine for generic numbers, however, if your int really represents a fixed finite set of choices, then turning it into a float is not a good idea, I'd say. A better way is to use an embedding layer to map each of the finite choices to a trainable embedding vector, similar to how words are treated in natural language processing.
(That's how auto treats int inputs, given that num_values specifies the number of embeddings required)
I should spend some time over the weekend to add more info to the docs...
:-)
Matt Pettis
@mpettis
Excellent, thanks!
Alexander Kuhnle
@AlexKuhnle
(Also, I can only repeat: involvement and contributions on documentation/examples/tutorial side are really welcome :-D There are lots of opportunities for improvement.)
Matt Pettis
@mpettis
Absolutely, and can do. Mine will be small and incremental, but will give what I can. Still working on just making sure types and shapes of arguments are not breaking things, but I know that it is useful to have examples that show those details too...
Alexander Kuhnle
@AlexKuhnle
Regarding multiple inputs, if you look into the AutoNetwork code, it's basically only the constructor in which happens what otherwise you can specify yourself as list of lists of layers (as mentioned above). So it may be somewhat informative...
Alexander Kuhnle
@AlexKuhnle
Hey @mpettis, I added some documentation for multi-input networks here (and multi-state/action specification here). All very minimal, but a start... :-)
Matt Pettis
@mpettis
@AlexKuhnle Thanks for the documentation, I'll read!
JanScheuermann
@JanScheuermann
Hello everyone, does anyone know if there is the possibility of training an agent with continuous action space (for regression problem) in tensorforce and has an example where this is done? Best regards and thanks in advance, Jan
JanScheuermann
@JanScheuermann
My problem is the following: I have a sales dataset of a bakery chain and want to predict the optimal order quantity for the next day with reinforcement learning. So in my understanding I have a continuous state space of 45 features that are either float or bool and one continuous action (the order quantity) as an output.
Consequently I created a custom environment (tailored to my bakery case) with states of shape (45,) and actions as a scala, both type float:
class CustomEnvironment(Environment): # for bakery data
def __init__(self):
    super().__init__()

def states(self):
    return dict(type='float', shape=(num_states,))

def actions(self):
    return dict(type='float', min_value=0, max_value=1000)#, shape=(1,))

def max_episode_timesteps(self):
    return super().max_episode_timesteps()

def close(self):
    super().close()

def reset(self):
    state = np.random.random(size=(num_states,)) # to be adjusted to random record of dataset
    return state
I then created actor critic agent:
agent = Agent.create(
agent=ActorCritic, #DeepQNetwork,
environment=environment,
batch_size=1000,
memory=10000,
update_frequency=1000,
learning_rate=0.00001,
exploration=dict(type='decaying', dtype='float', unit='updates', decay='polynomial',
initial_value=5.0, decay_steps=200, final_value=0.1, power=1.0
),
seed=42
)
JanScheuermann
@JanScheuermann
because according some articles, e.g. this post https://www.reddit.com/r/MachineLearning/comments/bfny3m/d_deep_q_learning_for_continuous_action_space/ the actor critic agent seems to be the go to agent
JanScheuermann
@JanScheuermann
So in this case, I assumed that the actor critic agent would give me the order quantity / the value that leads to the highest reward according to my reward function, however it just seems to pick random values in the defined range, even if I train and test it only on one single record (always the same state and always the same 'target value' for the order quantity in the reward function)
Does anyone have an idea what the problem is? Any wrong assumptions that I made?
Alexander Kuhnle
@AlexKuhnle
Hi, a few comments:
Alexander Kuhnle
@AlexKuhnle
  • Environment looks good, however, I would recommend scaling the action as part of the environment and use as action space the interval [-1, 1]. Just to be sure, I can't say off the top of my head whether learning in the Tensorforce implementation will be completely robust to scaling (it may be).
  • I would definitely reduce the batch size etc (16, 32, or so), at least while you're testing simplified settings, and remove things like exploration. Also, your learning rate is very small, maybe better try 1e-3 to 1e-4.
  • Fixing the random seed can be dangerous, maybe you're just looking at a particularly unlucky seed.
  • Maybe start off with a categorical action to see whether it works better. Also, consider also trying the PPO agent, as it generally works quite well.
  • If nothing helps and you don't get anywhere, maybe it's worth temporarily moving to a Gym environment to get something working. It's hard to say with deep RL, for instance, whether it's actually easy to learn a completely fixed environment, as you describe it. With a standard environment, at least you don't have to worry about that part.
JanScheuermann
@JanScheuermann
Hi Alexander, thank you very much for your comments, I try to incorporate them today and let you know what worked and what not, best regards, Jan
JanScheuermann
@JanScheuermann
@AlexKuhnle when you talk about scaling the action space, do you really mean action space or the state I put into agent.act()? If you really mean action, can I simply scale it setting min_value=-1 and max_value=1 in the action definition in the environment? Best regards, Jan
JanScheuermann
@JanScheuermann
And should the reward also be scaled or is it maximized regardless of scale?
Alexander Kuhnle
@AlexKuhnle
Yes, I mean the action, but normalizing on environment side, so: If your actions are within [0, 1000], you can turn them into e.g. [-1,1] if you transform the agent output a via (a + 1) * 500 at the beginning of execute(). Not a bad idea generally if the scale of values is large (also in state), just to be sure, since NNs work better around zero.
While this is just about keeping values in a domain where the NN works effectively, the reward is a different issue -- "shaping" the reward will actively change the optimization problem, for better or worse, and it depends a lot on what algorithm you're using etc.
Alexander Kuhnle
@AlexKuhnle
There are features like reward normalization in TForce, but to the degree that you know how the reward of your environment is working and you can adapt it, there's no need to rely on this env-agnostic feature. I think it makes sense to also keep the rewards around 0, in particular not necessarily positive-only. Also worth thinking about what the "cumulative return" (in combination with discount) will look like for the agent, in particular if, for instance, you're using a policy gradient algorithm.
CristianMonea
@CristianMonea
Tensorforce allows to save an agent by simply calling agent.save(). Is is possible to save a Runner in a similar way? I am interested in storing information such as timesteps, episode_rewards after each run for futher processing/plotting.
Marius
@MariusHolm

Hi @AlexKuhnle ,
I've used the PPO agent of tensorforce for my masters project.
Implementation wise things have worked out nicely, and I'm not writing out my thesis.
In that regard I want to explain some of my code, and doing so I want to explain the PPO agent.

In the agent definition two neural networks are defined:

  1. "network"
  2. "critic_network" (this has confused me a bit, as I start thinking of actor-critic which to my understanding PPO is separate of..)

What is the difference between these networks?

E.g. is one used to approximate the Advantage function and the other to approximate the Policy?

Alexander Kuhnle
@AlexKuhnle
@CristianMonea The run.py script offers options to save some information, otherwise it should be straightforward to add it using the callbacks. But right now this functionality is not provided as part of the Runner class. Good idea, though.
@MariusHolm, not sure whether the "terminology" is 100% correct here, but the "critic_network" corresponds to a trained value function which acts similar to the "baseline" in the context of policy gradient algorithms or the "critic" in the context of actor-critic, in the sense that it is trained to approximate the state value V(s) as discounted cumulative sum, and the main policy uses the advantage, so empirical return - critic(s), instead of just the empirical return in the PG "loss".
Alexander Kuhnle
@AlexKuhnle
Does that help? I'm not sure "critic" has a more specific meaning or is just typically used for "baseline" in the context of only actor-critic.
CristianMonea
@CristianMonea
@AlexKuhnle OK. Thank you! Another suggestion is to add the possibility to automatically stop training if the reward threshold was exceeded (e.g., a mean reward of 200 over the last 100 episodes)
Marius
@MariusHolm
@AlexKuhnle Thank you. That definitely helps :)
杨子信
@yzx20160815_twitter
why updated = self.model_observe(parallel=parallel, **kwargs) always return False
@AlexKuhnle
Alexander Kuhnle
@AlexKuhnle
Ideally, it shouldn't always return False, but True if an update was performed (Which may be very infrequently). Could you check whether it really never returns True? What is your agent config?
杨子信
@yzx20160815_twitter
image.png
@AlexKuhnle
杨子信
@yzx20160815_twitter
sometimes return True
Qiao.Zhang
@qZhang88
@AlexKuhnle Does tensorforce support distributed training? That is several sampling machine and one parameter server. PS does gradient updating and other machine pull the most recent parameters and only run on exploration and sampling mode?
Alexander Kuhnle
@AlexKuhnle
@yzx20160815_twitter Does that mean the problem is solved and it does sometimes return True?
Alexander Kuhnle
@AlexKuhnle
@qZhang88 Yes and no. The parallelization mode which Tensorforce currently supports is based on one agent with multiple parallel input streams which interact with "remote" environments (via Python's multiprocessing or socket), instead of multiple remote worker agents and one central update agent.
Alexander Kuhnle
@AlexKuhnle
So the result is kind of the same, but the communication content is somewhat different. We've been using this approach in the context of computationally expensive simulations and it worked very well (see here, particularly diagrams on page 6).
杨子信
@yzx20160815_twitter
@AlexKuhnle yes ,thks
Qiao.Zhang
@qZhang88

@qZhang88 Yes and no. The parallelization mode which Tensorforce currently supports is based on one agent with multiple parallel input streams which interact with "remote" environments (via Python's multiprocessing or socket), instead of multiple remote worker agents and one central update agent.

what is the sync mechanism of tensorforce for multiprocessing or socket?