Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 27 07:25
    ankurankan commented #1465
  • Sep 24 14:51
    ankurankan commented #1465
  • Sep 24 10:50
    ankurankan commented #1465
  • Sep 24 10:41
    Jsevillamol opened #1465
  • Sep 23 15:34

    ankurankan on dev

    Adds appoximate inference using… (compare)

  • Sep 23 15:34
    ankurankan closed #1464
  • Sep 23 15:34
    ankurankan closed #1455
  • Sep 23 15:34
    codecov[bot] commented #1464
  • Sep 23 15:32
    codecov[bot] commented #1464
  • Sep 23 15:22
    codecov[bot] commented #1464
  • Sep 23 15:22
    codecov[bot] commented #1464
  • Sep 23 15:19
    codecov[bot] commented #1464
  • Sep 23 15:17
    codecov[bot] commented #1464
  • Sep 23 15:15
    codecov[bot] commented #1464
  • Sep 23 15:12
    codecov[bot] commented #1464
  • Sep 23 15:10
    codecov[bot] commented #1464
  • Sep 23 15:08
    codecov[bot] commented #1464
  • Sep 23 15:08
    codecov[bot] commented #1464
  • Sep 23 15:07
    ankurankan synchronize #1464
  • Sep 23 15:02
    ankurankan synchronize #1464
Meteore
@iameteore314
@ankurankan Hi, I've been looking for resources - documentations, threads - on how to deploy a pgmpy BN/DBN on Google Cloud Platform in order to run predictions (at scale). Any recommendations? Would mean a lot, thanks 😊
*by predictions I mean inferences
Ankur Ankan
@ankurankan
@iameteore314 I don't know if anyone has deployed pgmpy at scale, so I don't really know if it would bring any challenges. I think it should work normally as any python package. One point to keep in mind would be that predictions would be quite slow compared to normal machine learning algorithms, but it might help if you do batch predictions (because of result caching).
Meteore
@iameteore314
@ankurankan thanks for your quick reply! I’ll try it out, and document this project to let you know what comes up. All the best, Meteore.
MathijsPost
@MathijsPost

Good morning, I have a quick question. I'm using structure learning to find structures with the K2Score method and BICScore method. When inspecting the scores afterwards, one gives a higher score for K2, and the other gives a higher score for BIC, which makes sense. But, how do I know which model fits the dataset better? Is there a common score both have to compare the two model and find the better one?

@ankurankan do you know how to approach this?

Ankur Ankan
@ankurankan
@MathijsPost I would suggest doing something like what we discussed in #1361. I don't know of any better methods for model comparison. I also recently worked on a paper for testing model structures against data, might be helpful: https://currentprotocols.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/cpz1.45
MathijsPost
@MathijsPost
Thank you! @ankurankan
zjuwormer
@zjuwormer
@ankurankan Hi, I don't know if pgmpy has methods to do parameter learning using EM algorithm in not fully observed situation. If not, how can I implement it in a discrete bayesian network based on pgmpy.
Ankur Ankan
@ankurankan
@zjuwormer No, pgmpy currently doesn't have an implementation of the EM algorihtms. To implement it, two things need to be done: 1) Adds a way to specify latent nodes in BayesianModel (could be as simple as just adding another property to the class). 2) The EM algorithm (should be able to extend the ParameterEstimator class).
MathijsPost
@MathijsPost
@ankurankan Hi, I have been looking into the K2 structure learning algorithm in pgmpy. Does it assume variable ordering as in the original variable? Or does it define the entire search space at the beginning and compute which edges increase the model score?
Ankur Ankan
@ankurankan
@MathijsPost I am not sure what you mean by "variable ordering as in the original variable". Could you please elaborate? The HillClimb estimator starts with an empty graph ( with all the variables in the dataset) and then iteratively makes modifications to it (add, delete, reverse edges) based on whether the modification improves the score of model.
MathijsPost
@MathijsPost
@ankurankan Hi, sorry I made a typo, what I meant is that the original K2 algorithm uses a variable ordering and the outcome of the K2 algorithm is different for a different order of variables. What I understood from the paper from 1992 is that the variable order limits the parents of the variables, i.e., for a variable order [A, B, C, D, E], the only parents for C can be B and A. I believe pgmpy does not use a variable ordering right?
I mean the scoring function stays the same, I think not using the variable ordering allows you to find a better structure in the end as the search space is enlarged.
Ankur Ankan
@ankurankan
@MathijsPost That's correct, the implementation doesn't have any limits on the parents for any of the variables.
MathijsPost
@MathijsPost
@ankurankan Great, thank you!
Gaoxiang Zhou
@Gavin_Chou1994_twitter
@ankurankan Hi, I was wondering if there is an easier way of implementing DBN of multiple-slice. For example, I have a model of just 3 elements, (A, B, C), for A, B: their states at time T+1 are only dependent on their own states at time T, saying (A0,A1),(A1,A2)...(B0,B1),(B1,B2)...; for C, its state at time T+1 is only dependent on the state of (A,B,C) at time T. And the inter-slice CPD is stationary for all time steps. If I wish to infer the distribution of C at time step 100 given the initial A0,B0,C0 distribution, do I have to create identical but separate TabularCPD objects for (A0,C1), (A1,C2), (A2,C3), ..., (A98,C99), (A99,C100)? Or there is a way to repeatedly call TabularCPD object (At,Ct+1)? Does DBNInference support something like this?
Ankur Ankan
@ankurankan
@Gavin_Chou1994_twitter For defining the DBN you just need to specify the initial CPDs and the first interslice CPDs (as that is constant for rest of the time slices). Then in the query method you can specify which time slice you want to query for. In the example here: https://github.com/pgmpy/pgmpy/blob/dev/pgmpy/inference/dbn_inference.py#L470, you can change [('X', 0)] to [('X', 100)] to query the CPD of X in the the 100th time slice.
almiradivasanya
@almiradivasanya
how do i use pgmpy for predicting information diffusion probabilities
Mojtaba Ebadi
@mojtabaebadi
hi
I need to provide virtual evidence to some nodes and do inference.
How can I achieve this in pgmpy?
here's saying to create virtual nodes, I couldn't find how to do that.
pgmpy/pgmpy#1013
Ankur Ankan
@ankurankan
@mojtabaebadi It is not possible to specify virtual or soft evidence in pgmpy yet.
Lisakestens
@Lisakestens
Hi, I am trying to use HillClimb to learn a structure from input data. How can I modify the equivalent sample size when using Bdeu as a scoring function? The HillClimb.estimate function takes the scoring function as a string (e.g. 'bdeuscore'), so I am missing how to alternate the input of the scoring function in this case? Thanks a lot for you help!
Ankur Ankan
@ankurankan
@Lisakestens Hi, for specifying a different equivalent sample size you will need to create an instance of BDeuScore and pass it to HillClimb. Here's an example:
from pgmpy.utils import get_example_model
from pgmpy.sampling import BayesianModelSampling
from pgmpy.estimators import BDeuScore, HillClimbSearch

alarm = get_example_model('alarm')
s = BayesianModelSampling(alarm)
data = s.forward_sample(int(1e4))

score = BDeuScore(data=data, equivalent_sample_size=100) # The size can be specified here
est = HillClimbSearch(data)
dag = est.estimate(scoring_method=score)
Hari M Koduvely
@harik68
Hi, I have a BN with large number of nodes (of the order of 250). Some nodes have parents of the order of 100. When I specify CPD using pgmpy.factors.discrete.CPD.TabularCPD, I need to give values as a 2D array. However with ~ 100 nodes as parents, the dimension of the array would become 2^100. This brings memory issues. Is it possible to use SciPy sparse matrices csc_matrix or csr_matrix in TabularCPD? If not is there any other way to deal with BNs of this size? Thank You.
Ankur Ankan
@ankurankan
@harik68 Hi, I don't think there's any way to work with networks of that size yet. Currently, you can work with max 32 parents as numpy doesn't support more that 32 dim arrays.
Hari M Koduvely
@harik68
@ankurankan I see. Thank you for the prompt reply.
SachSpace
@SachSpace

Hi all, I'm new to the library and PGMs. I'm having 2 concerns as described below related to the BIC score and graph comparison.
I have a dataset of 1000 samples and it is created using graph (I->S), (I->G) (Consider this as the ground truth G*). Then using that dataset, I found scores for 2 graphs as :
G1 : print('bic ', BicScore(student_df).score(BayesianModel([['I','G'], ['I','S']])))
G2 : print('bic ', BicScore(student_df).score(BayesianModel([['I','G'], ['I','S'],['S','G']])))

I received scores as :
G1 : bic -1879.6944727803666
G2 : bic -1891.6599324181361

My questions are ;

  1. Why score is negative? I saw negative scores in some other threads. But, I could not find the answer for my question. I thought the score should be positive since the idea is to maximize the score of a graph. I was thinking which graph is better G1 or G2 in terms of score. Technically I was expecting, G1 to have the higher score since it is G*. When there is only one edge score was something higher value like -907.5766555294274. Then this question arises due to negative value.
  2. Is there a better way to measure the graphs' matchness to the dataset in addition to score? Appreciate any suggestions.
    Thanks in advance!
Ankur Ankan
@ankurankan
@SachSpace I think the scores are as expected. The values are log-likelihood (with some regularization), so they can be negative. Also since the BIC score is a bit biased towards simpler structures, you are seeing the behaviour that simpler models have higher score (the amount of data can also affect it though). If you are looking to test your model structure, I would suggest going for independence testing. Bascially check whether the implied conditional independencies of the network hold in the data. This approach would give you a better idea of exactly what part of your network isn't consistent with your data. And you can also use that information to make modifications to your network.
3 replies
SachSpace
@SachSpace
@ankurankan Thanks a lot ! Now it is clear. Also thanks for the suggestion.
SachSpace
@SachSpace
Hi, I tried to check the I-equivalent of 3 simple graphs using is_iequivalent But, I'm getting an unexpected results as highlited in figure. I was expecting G2 to be not an I-equivalent graph of G, because of diferent skeleton (here I'm getting True). I 'm aware that same skeleton and same V-structure are required to be an I-equivlaent graph. But, this result seems to be a little bit confusing to me. I would highly appreciate any help to understand this. May be is it something wrong with my code ? Thanks ! https://drive.google.com/file/d/16f_IUfrpvsCVt-1PjfX23SDeg3psO7AV/view?usp=sharing
Ankur Ankan
@ankurankan
@SachSpace This is indeed a bug in the is_iequivalent method. It wasn't taking the node names into account when comparing the structure. I have fixed it now and if you install the latest dev branch from github, your code should work fine. Thanks for reporting the issue :)
1 reply
Toby Drane
@TobyDrane
Can I know exactly what is causing the "Found unknown state name. Trying to switch to using all state names as state numbers" when I use .predict(...)?
Ankur Ankan
@ankurankan
@TobyDrane That could happen when there is a state name difference (different name or a new state which wasn't in training data) between the trained model and the data you are trying to predict. If that's not the case, would it be possible to share a minimal reproducible code, so I can check what's going on?
Toby Drane
@TobyDrane
What is exactly a state name? Is that column names or row values?
Ankur Ankan
@ankurankan
It's the row values. Essentially the different values that variables can take.
Toby Drane
@TobyDrane
So say I have a binary field (0,1) I train the model but the train dataset never see's a 1, the test contains a 1 it will through this error up?
Ankur Ankan
@ankurankan
Yes. Although in such cases, you can also pass a state_names argument to the fit method with all the states/values you expect a variable to have.
Toby Drane
@TobyDrane
Right makes sense now (conditional probability and all), and you beat me to my next question which was what can one do to avoid such issues
How does state_names works theoretically however? When you fit with a state name does it only fit with one such situation of all of the names
Ankur Ankan
@ankurankan
I am not sure if I am correctly understanding your quesiton. If you mean how the estimator deals with missing states, in such cases since the parameters are only dependent on the frequency of occurence, it makes sure that all the state_names are present in the data otherwise marks a 0 as the frequency for the missing states.
Toby Drane
@TobyDrane
Morning, is there a good way to see which state name is throwing the error, I don't want to just spit out all the unique values for every single feature I have and pass that as the dict, but alas, I don't know what exact variable is throwing the error
Ankur Ankan
@ankurankan
No, that's not possible in the current implementation.
Meteore
@iameteore314
Hi @ankurankan, I've been modeling BNs with old school softwares. They seem to only export BNs in deprecated formats like .dsc, .dsl, .net, that aren't supported by pgmpy. Are you aware of any converter tool to quickly convert these old formats into some supported by pgmpy (.bif, .uai, .xmlbif)? Thanks a lot for your attention, that would save me a lot of time!
Ankur Ankan
@ankurankan
Hi @iameteore314 , R package bnlearn support dsc, net, and bif, so you can use that to convert to bif.
Meteore
@iameteore314
The export takes so much time, but it works perfectly. Thanks a lot !
Yunjiang Wu
@JarvisIsFriday
image.png

@ankurankan hello, I'm trying to learn parameters from DBN, so I follow your DBN fit function example, but I got some problem:
AttributeError: 'DynamicBayesianNetwork' object has no attribute 'fit'

so i means that the fit function can not work now, though DBN class has the source code of fit function?
thanks!

Ankur Ankan
@ankurankan
@JarvisIsFriday Sorry for the late reply. It's probably because you have install pgmpy using pip. The fit function was added recently and wasn't included in the previous release. You can install the dev branch from github and it should work fine: https://pgmpy.org/started/install.html
javadbahman
@javadbahman
image.png

image.png

Hi @ankurankan it is the network i should simulate with discrete random variables . is it possible with pgmpy??

Ankur Ankan
@ankurankan
@javadbahman Yes, I think it should be possible. Though you will need to tweak the model a bit as pgmpy considers the model to have the same variables and structure in each time slice. So, in your case you can just add the Z, E_C, N_f, and E_S variables in the 0th time slice as well. And since they are all endogenous variables, doing the modification should result in an equivalent model.