Where communities thrive

• Join over 1.5M+ people
• Join over 100K+ communities
• Free without limits
Activity
• Oct 20 12:17
thomasschmied closed #144
• Oct 20 12:17
thomasschmied commented #144
• Oct 19 18:53
orrivlin opened #145
• Oct 19 12:15
takuseno commented #144
• Oct 18 11:28
thomasschmied labeled #144
• Oct 18 11:28
thomasschmied opened #144
• Oct 10 14:05

takuseno on master

Fix CI error due to no Pendulum… (compare)

• Oct 10 13:14

takuseno on master

• Oct 10 05:49

takuseno on implement_edac

• Oct 10 05:45

takuseno on implement_edac

Implement EDAC (compare)

• Oct 10 00:33
takuseno closed #142
• Oct 09 13:24
masaya611 commented #142
• Oct 09 12:54
takuseno commented #142
• Oct 09 12:46
masaya611 commented #142
• Oct 09 12:42
masaya611 commented #142
• Oct 09 12:40
masaya611 commented #142
• Oct 09 12:38
takuseno commented #142
• Oct 09 12:33
masaya611 commented #142
• Oct 09 08:58
TsuTikgiau closed #143
• Oct 09 08:51
TsuTikgiau commented #143
@navidmdn
Another reason I think there might be a bug:
1- I train model offline
2- I collect observations in a 1M buffer with ConstantEpsilonGreedy(epsilon=0.05)
3- I fit_online but initialize buffer with my step 2
4- avg return suddenly drops from offline checkpoint (around 2k) to something around 50!
Takuma Seno
@takuseno
https://arxiv.org/abs/2006.09359
It seems that the performance drop is something difficult to prevent.
Takuma Seno
@takuseno
And, you should not use ConstantEpsilonGreedy with continuous control algorithms because it’s designed for discrete algorithms.
I guess you don’t need to use explorers since AWAC is a stochastic policy algorithm. But, if you really want to use, please use NormalNoise instead.
https://github.com/takuseno/d3rlpy/blob/6d65134498bf3d0ab3474107ec43c03d25050fe0/d3rlpy/online/explorers.py#L106
@navidmdn
That is right I should use NormalNoise there. Although the problem stands.

https://arxiv.org/abs/2006.09359
It seems that the performance drop is something difficult to prevent.

I agree with you. But what is the difference between these two cases:
1- collecting buffer, saving it as MDPDataset and training offline
2- collecting buffer and fitting online on that.
both using same explorers ( or even no explorer!)

Because the first method does not suffer from performance drop. And the training method is kinda similar

Julius Frost
@juliusfrost
@takuseno Hi do you have any scripts to convert the reproduction results to normalized return like in the papers? Right now I've run CQL on the pybullet environments and I want to compare to the paper results.
Takuma Seno
@takuseno

@navidmdn Could you try this?
step1. train AWAC with static dataset
step2. train AWAC with environment and the buffer initialized with the static dataset

I guess that the changes of distribution in the dataset might have a significant impact. Also, I believe you don’t need to use explorers during online training since the policy is stochastic.

Takuma Seno
@takuseno
@juliusfrost Hello! This problem is a kind of headache. I believe we can infer the parameters we need to calculate this from this paper.
https://arxiv.org/abs/2004.07219
But, so far I don’t have the script for this. Possibly I’ll add it later.
Jose Antonio Martin H.
@jamartinh
Hello, anyone having troubles with min_max scaler in the new version ?

In all algorithms I can use the "standard" scaler without problem but using "min_max" I get an error
...
/data/conda/envs/python39/lib/python3.9/site-packages/torch/distributions/distribution.py in init(self, batch_shape, event_shape, validate_args)
51 continue # skip checking lazily-constructed args
52 if not constraint.check(getattr(self, param)).all():
---> 53 raise ValueError("The parameter {} has invalid values".format(param))
54 super(Distribution, self).init()
55

ValueError: The parameter loc has invalid values

Takuma Seno
@takuseno
Hello @jamartinh , thanks for reporting this. Could you share the minimum example to reproduce this?
Jose Antonio Martin H.
@jamartinh
Hi @takuseno another thing: the conda-forge channel for python39 Windows says : Specifications:
• d3rlpy -> python[version='>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.8,<3.9.0a0']
Takuma Seno
@takuseno
@jamartinh Yeah, this is an issue. For now, I don’t built it for Windows with Python 3.9 due to some package confliction. It works for Linux and macOS.
Jose Antonio Martin H.
@jamartinh

@takuseno I have developed some extension to q_functions in [https://github.com/jamartinh/d3rlpy-addons/blob/master/d3rlpy_addons/models/torch/q_functions.py] the thing is that I am assuming that every network in the ensemble will have its frozen twin. Do you see any troubles in the ensemble ? the main thing is that now the outputs of the q_functions networks are :

(self._fc(h) - self._fc0(h)) + self._q_value_offset
where fc0 is a frozen network

Takuma Seno
@takuseno
Yes, every Q-function in the ensemble has its frozen target. And, I think your code is fine.
Jose Antonio Martin H.
@jamartinh
Thanks @takuseno for the check !
Jose Antonio Martin H.
@jamartinh
@takuseno do you consider to create an additional repo "d3rlpy-contrib" so that people can include its working ideas and let the community tests that contribit adds before adding them to the main library? This is a usual practice.
Takuma Seno
@takuseno
Hello, I’m working on creating the performance table of d3rlpy. Here is the result in progress. If you want to check how well it should work, please check the table.csv.
https://github.com/takuseno/d3rlpy-benchmarks/blob/main/table.csv
Takuma Seno
@takuseno
@jamartinh Thanks for the proposal! For now, I’m not sure how many of contrib features we have. If we have many of experimental features, I’m inclining to have contrib directory in d3rlpy.
pstansell
@pstansell
@takuseno, should it possible for discounted_sum_of_advantage_scorer to return negative values? I ask because from the equations in the documentation it seems that the returned values should always be positive, but I have all negative values in the csv file written by the scorer. I should also say that returns are positive, not negative.
Takuma Seno
@takuseno
@pstansell Sorry for the late response. It’s modified in this commit.
takuseno/d3rlpy@4ba44c4
Regarding the saved csv file, there will be no changes with this commit (the saved value is already non-negative scale). To be specific, the value of discounted_sum_of_advantage_scorercan be both negative and positive.
pstansell
@pstansell
@takuseno, thanks for you reply. I've been using v0.61 of your code recently because I'm having problems with the current version (I'll return to it later to try to understand what's going on, but for now I have deadlines to meet). I see you've removed the (in negative scale) from the documentation for discounted_sum_of_advantage_scorer between v0.61 and v0.91, but going on the definition, I don't understand how sum[Q(s,a) - max_b(Q(a,b))] can ever be positive as surely max_b(Q(s,b)) >= Q(s,a) always. (The paper you reference on your documentation is quite complicated and is not helping me.)
Takuma Seno
@takuseno
I think I should update the equation. This could be more helpful.
This means that if the value is negative, the policy action has the larger expected return than the dataset actions.
In discrete action case, this is smaller or equal to zero. But, in continuous action case, it might be negative since it’s hard to extract max Q
pstansell
@pstansell
Something like that?
Takuma Seno
@takuseno
I feel you should take expectation at the first term. But, the first term is Q-value for the dataset action and the second term is the Q-value for the policy action.
pstansell
@pstansell
Does the following make sense:
The difference is that my two Q's (actually one Q and one V) have different policy scripts, ie, a \beta in my first, and a \pi in my second. Both your Q's have the same script, \theta.
pstansell
@pstansell
@takuseno, I have another quick note, regarding what you call the "average TD error" here https://d3rlpy.readthedocs.io/en/latest/references/generated/d3rlpy.metrics.scorer.td_error_scorer.html, is it actually the "average of the square of the TD error"?
Jose Antonio Martin H.
@jamartinh
Hi, I have a question I tried to follow the execution but it is being difficult, What is the consequence on DDPG, TD3 and SAC of setting gamma=0? Should this algos converge to the same thing? Does SAC still use entropy? Does DDPG and TD3 behave the same? I noticed not in practice, thanks for any help!
3 replies
Jose Antonio Martin H.
@jamartinh
Another question, I want to push to contrib the Q-functions trick in https://github.com/jamartinh/d3rlpy-addons/blob/master/d3rlpy_addons/models/torch/q_functions.py. I am not sure why the sign of q_value_offset should be negated, are the Q-values of actor negated?
6 replies
Tomás Lara
@tmsl31
Hi, I have a question. Is it possible to use d3rlpy library on non episodic tasks? Reading the MDPDataset Documentation, I'm not sure about it.
22 replies
pstansell
@pstansell
and this way to create the reply buffer
I found that this did not work well
but this did work well
pstansell
@pstansell
There is a typo in the algorithm, sorry, the t_0 inside the while loop should be t_n.
pstansell
@pstansell
@jamartinh, here are the images I'm referring to (ie, directly above this message).
Jose Antonio Martin H.
@jamartinh
Hi @takuseno don't you think the class EncoderWithAction(metaclass=ABCMeta) should be of type nn.Module? Why is the reason behind this class not being a torch nn.Module?
2 replies
Jose Antonio Martin H.
@jamartinh
is it possible to use n_frames without image observations, i.e., use n_frames>1 with vector encoder? If not what things should be refactored to allow this ?