Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Nov 10 20:00

    dependabot[bot] on pip

    (compare)

  • Nov 10 20:00

    AlexKuhnle on master

    Bump tensorflow from 2.6.0 to 2… Merge pull request #840 from te… (compare)

  • Nov 10 19:45

    dependabot[bot] on pip

    Bump tensorflow from 2.6.0 to 2… (compare)

  • Oct 20 20:50

    AlexKuhnle on master

    Update gym version requirement (compare)

  • Oct 20 20:48

    AlexKuhnle on master

    fix ZeroDivisionError in the pr… Merge pull request #836 from hr… (compare)

  • Oct 02 15:21

    AlexKuhnle on master

    Fix environments unittest (compare)

  • Oct 02 13:46

    AlexKuhnle on master

    Update gym dependency and fix e… (compare)

  • Oct 02 11:41

    AlexKuhnle on master

    Fix input format bug for agent.… (compare)

  • Oct 02 09:19

    AlexKuhnle on ppo-revert

    (compare)

  • Aug 30 20:20

    AlexKuhnle on 0.6.5

    (compare)

  • Aug 30 20:02

    AlexKuhnle on master

    Update requirements, various mi… Move/rename reward_preprocessin… Fix PyPI version 0.6.5 (compare)

  • Aug 25 21:48

    dependabot[bot] on pip

    (compare)

  • Aug 25 21:48

    AlexKuhnle on master

    Bump tensorflow from 2.5.0 to 2… Merge pull request #818 from te… (compare)

  • Aug 25 14:54

    dependabot[bot] on pip

    Bump tensorflow from 2.5.0 to 2… (compare)

  • Aug 23 22:08

    AlexKuhnle on master

    improve performance delete .idea dir Merge pull request #816 from DL… (compare)

  • Jul 31 17:45

    AlexKuhnle on master

    Fix seed unittest (compare)

  • Jul 31 14:17

    AlexKuhnle on master

    Fix argument docs-value mismatc… Support state-preprocessing as … (compare)

  • Jul 18 15:08

    AlexKuhnle on master

    Fix tune script (compare)

  • Jul 10 10:37

    AlexKuhnle on master

    Improve action-mask handling fo… (compare)

  • Jul 10 10:34

    AlexKuhnle on master

    Fix action mask handling for Gy… (compare)

Schade77
@Schade77
Hello, thanks for your great work ! I would like to know if you have more examples about multithreading ? All seems to work well without multithreading, however when I try to set num_parallele I get 'assert not isinstance(environment, Environment)'.
Alexander Kuhnle
@AlexKuhnle
Hi, could you post the relevant code and the last bit of the stacktrace? Otherwise hard to say what exactly the problem is
29 replies
Matt Pettis
@mpettis
For agents, is there documentation (or explanation) of what "memory" does and how it is used? I'm trying to make an example that learns not to turn the heater on too frequently in a thermostat example (basing off my previous example above), like, say, learns to not turn the heater on more than 3 times in 20 consecutive timesteps. It's not learning the way I think it should, when the reward signal is just the distance the temperature is away from the target temperature band, and a large 10x - 100x negative reward if it turns the heater on more than 3 times in 20 timesteps. I would assume that I would not have to expose a state back to the agent that tracks the available heater-on actions available at the current step, but assuming the agent would learn such a policy. I am using a tensorforce agent with a 'policy_gradient' objective, like in the 'getting started' section.
I am working on an example that I can link to that has more details if someone is interested in looking into this. But I think that my question about what's going on with memory can stand alone without the example.
Alexander Kuhnle
@AlexKuhnle
You've probably found the basic documentation but otherwise here. Basically, memory is the mechanism to store experience and how to sample batches of it for an update. recent is a simple buffering mechanism which samples the latest timesteps, replay randomly samples from a usually bigger pool of timesteps, as known from DQN. But this is not what you're looking for.
Matt Pettis
@mpettis
ok, thanks, I'll read that.
Alexander Kuhnle
@AlexKuhnle
It's actually not uncommon that people indeed expose a specially preprocessed state back to the agent -- and if it's possible and helps, why not. However, this is obvs unsatisfying, and the way I would say this should be solved is by using an RNN layer which is unrolled over the sequence of timesteps.
In Tensorforce there is, for instance, internal_lstm, and the internals arguments are related to that. They give the agent an internal state, and consequently the ability to remember what happened earlier in an episode (in theory -- and yes, many DRL models don't have this).
So that's what I would try. I would definitely be interested to hear how it goes. Unfortunately it's a rather involved feature, so it may not be super-easy to get it to train properly.
Alexander Kuhnle
@AlexKuhnle
If you use the auto network, for instance, you can check the argument internal_rnn, otherwise add an internal_lstm layer.
Matt Pettis
@mpettis
That makes a lot of sense, and the way I was thinking about it (but was unsure if some newer model in this catalog could take care of not having to expose external state), and knowing that others have to do the same state exposure is helpful. It's good to know that internal_lstm exists, and I'll keep it in mind, but I think I have to do a lot more practice on just the basics before I get to that. So I'm going to try and do as you suggested, and just explicitly expose that state myself. I think I will be able to contribute this as an example too, but it really won't be much different than my other example, and you probably want some examples that exercise other features of the framework...
... or I may try the auto network now...
last question for now... there's no facility for mixed dtypes for state, right? In my example, I could see that I have two dimensions -- one is the current temperature, a real number, and the second is a count, which is an integer. I'm casting the integer to a real for now, and it should work, but I was curious if I was missing something.
Alexander Kuhnle
@AlexKuhnle
There is, actually. Mixed states can be specified via a "dict of dicts", so e.g. states=dict(state1=dict(type='int', shape=()), state2=dict(type='float', shape=())). However, in that case you can't use a network as a simple stack of layers. Two options: the auto network can take care of it, it will just internally create a reasonable simple network (with some modification options), or you specify a "multi-input" network yourself. You can specify networks as "list of lists", where each of the inner lists is a layer stack, and the special register and retrieve layers are used to combine these stacks to a full network. I realise there is no good example currently. Will need to add one to the docs.
Something like [[dict(type='retrieve', tensors='state1'), ..., dict(type='register', tensor='state1-embedding')], [same for state2], [dict(type='retrieve', tensors=['state1-embedding', 'state2-embedding'], aggregation='concat'), ...]]
Hope that illustrates how the stacks are sticked together via register and retrieve
Alexander Kuhnle
@AlexKuhnle
Note that float is fine for generic numbers, however, if your int really represents a fixed finite set of choices, then turning it into a float is not a good idea, I'd say. A better way is to use an embedding layer to map each of the finite choices to a trainable embedding vector, similar to how words are treated in natural language processing.
(That's how auto treats int inputs, given that num_values specifies the number of embeddings required)
I should spend some time over the weekend to add more info to the docs...
:-)
Matt Pettis
@mpettis
Excellent, thanks!
Alexander Kuhnle
@AlexKuhnle
(Also, I can only repeat: involvement and contributions on documentation/examples/tutorial side are really welcome :-D There are lots of opportunities for improvement.)
Matt Pettis
@mpettis
Absolutely, and can do. Mine will be small and incremental, but will give what I can. Still working on just making sure types and shapes of arguments are not breaking things, but I know that it is useful to have examples that show those details too...
Alexander Kuhnle
@AlexKuhnle
Regarding multiple inputs, if you look into the AutoNetwork code, it's basically only the constructor in which happens what otherwise you can specify yourself as list of lists of layers (as mentioned above). So it may be somewhat informative...
Alexander Kuhnle
@AlexKuhnle
Hey @mpettis, I added some documentation for multi-input networks here (and multi-state/action specification here). All very minimal, but a start... :-)
Matt Pettis
@mpettis
@AlexKuhnle Thanks for the documentation, I'll read!
JanScheuermann
@JanScheuermann
Hello everyone, does anyone know if there is the possibility of training an agent with continuous action space (for regression problem) in tensorforce and has an example where this is done? Best regards and thanks in advance, Jan
JanScheuermann
@JanScheuermann
My problem is the following: I have a sales dataset of a bakery chain and want to predict the optimal order quantity for the next day with reinforcement learning. So in my understanding I have a continuous state space of 45 features that are either float or bool and one continuous action (the order quantity) as an output.
Consequently I created a custom environment (tailored to my bakery case) with states of shape (45,) and actions as a scala, both type float:
class CustomEnvironment(Environment): # for bakery data
def __init__(self):
    super().__init__()

def states(self):
    return dict(type='float', shape=(num_states,))

def actions(self):
    return dict(type='float', min_value=0, max_value=1000)#, shape=(1,))

def max_episode_timesteps(self):
    return super().max_episode_timesteps()

def close(self):
    super().close()

def reset(self):
    state = np.random.random(size=(num_states,)) # to be adjusted to random record of dataset
    return state
I then created actor critic agent:
agent = Agent.create(
agent=ActorCritic, #DeepQNetwork,
environment=environment,
batch_size=1000,
memory=10000,
update_frequency=1000,
learning_rate=0.00001,
exploration=dict(type='decaying', dtype='float', unit='updates', decay='polynomial',
initial_value=5.0, decay_steps=200, final_value=0.1, power=1.0
),
seed=42
)
JanScheuermann
@JanScheuermann
because according some articles, e.g. this post https://www.reddit.com/r/MachineLearning/comments/bfny3m/d_deep_q_learning_for_continuous_action_space/ the actor critic agent seems to be the go to agent
JanScheuermann
@JanScheuermann
So in this case, I assumed that the actor critic agent would give me the order quantity / the value that leads to the highest reward according to my reward function, however it just seems to pick random values in the defined range, even if I train and test it only on one single record (always the same state and always the same 'target value' for the order quantity in the reward function)
Does anyone have an idea what the problem is? Any wrong assumptions that I made?
Alexander Kuhnle
@AlexKuhnle
Hi, a few comments:
Alexander Kuhnle
@AlexKuhnle
  • Environment looks good, however, I would recommend scaling the action as part of the environment and use as action space the interval [-1, 1]. Just to be sure, I can't say off the top of my head whether learning in the Tensorforce implementation will be completely robust to scaling (it may be).
  • I would definitely reduce the batch size etc (16, 32, or so), at least while you're testing simplified settings, and remove things like exploration. Also, your learning rate is very small, maybe better try 1e-3 to 1e-4.
  • Fixing the random seed can be dangerous, maybe you're just looking at a particularly unlucky seed.
  • Maybe start off with a categorical action to see whether it works better. Also, consider also trying the PPO agent, as it generally works quite well.
  • If nothing helps and you don't get anywhere, maybe it's worth temporarily moving to a Gym environment to get something working. It's hard to say with deep RL, for instance, whether it's actually easy to learn a completely fixed environment, as you describe it. With a standard environment, at least you don't have to worry about that part.
JanScheuermann
@JanScheuermann
Hi Alexander, thank you very much for your comments, I try to incorporate them today and let you know what worked and what not, best regards, Jan
JanScheuermann
@JanScheuermann
@AlexKuhnle when you talk about scaling the action space, do you really mean action space or the state I put into agent.act()? If you really mean action, can I simply scale it setting min_value=-1 and max_value=1 in the action definition in the environment? Best regards, Jan
JanScheuermann
@JanScheuermann
And should the reward also be scaled or is it maximized regardless of scale?
Alexander Kuhnle
@AlexKuhnle
Yes, I mean the action, but normalizing on environment side, so: If your actions are within [0, 1000], you can turn them into e.g. [-1,1] if you transform the agent output a via (a + 1) * 500 at the beginning of execute(). Not a bad idea generally if the scale of values is large (also in state), just to be sure, since NNs work better around zero.
While this is just about keeping values in a domain where the NN works effectively, the reward is a different issue -- "shaping" the reward will actively change the optimization problem, for better or worse, and it depends a lot on what algorithm you're using etc.
Alexander Kuhnle
@AlexKuhnle
There are features like reward normalization in TForce, but to the degree that you know how the reward of your environment is working and you can adapt it, there's no need to rely on this env-agnostic feature. I think it makes sense to also keep the rewards around 0, in particular not necessarily positive-only. Also worth thinking about what the "cumulative return" (in combination with discount) will look like for the agent, in particular if, for instance, you're using a policy gradient algorithm.
CristianMonea
@CristianMonea
Tensorforce allows to save an agent by simply calling agent.save(). Is is possible to save a Runner in a similar way? I am interested in storing information such as timesteps, episode_rewards after each run for futher processing/plotting.
Marius
@MariusHolm

Hi @AlexKuhnle ,
I've used the PPO agent of tensorforce for my masters project.
Implementation wise things have worked out nicely, and I'm not writing out my thesis.
In that regard I want to explain some of my code, and doing so I want to explain the PPO agent.

In the agent definition two neural networks are defined:

  1. "network"
  2. "critic_network" (this has confused me a bit, as I start thinking of actor-critic which to my understanding PPO is separate of..)

What is the difference between these networks?

E.g. is one used to approximate the Advantage function and the other to approximate the Policy?

Alexander Kuhnle
@AlexKuhnle
@CristianMonea The run.py script offers options to save some information, otherwise it should be straightforward to add it using the callbacks. But right now this functionality is not provided as part of the Runner class. Good idea, though.
@MariusHolm, not sure whether the "terminology" is 100% correct here, but the "critic_network" corresponds to a trained value function which acts similar to the "baseline" in the context of policy gradient algorithms or the "critic" in the context of actor-critic, in the sense that it is trained to approximate the state value V(s) as discounted cumulative sum, and the main policy uses the advantage, so empirical return - critic(s), instead of just the empirical return in the PG "loss".
Alexander Kuhnle
@AlexKuhnle
Does that help? I'm not sure "critic" has a more specific meaning or is just typically used for "baseline" in the context of only actor-critic.
CristianMonea
@CristianMonea
@AlexKuhnle OK. Thank you! Another suggestion is to add the possibility to automatically stop training if the reward threshold was exceeded (e.g., a mean reward of 200 over the last 100 episodes)
Marius
@MariusHolm
@AlexKuhnle Thank you. That definitely helps :)