Where to chat about modeling with Bayesian Networks and integrating with agrum/pyAgrum (https://www.agrum.org)
Hi @Mikailo , sorry but I did not find time to check your last example. You may have found a bug but I am quite surprised since, once you set the value of the decision node, this ID mainly is a bayesian network. I will try to check as soon as possible.
(and moreover there are plenty of inference in IDs that are validated during the tests : maybe your ID will be a new test case :-) )
Hi again @Mikailo ,
with this code (in a single-cell-notebook) :
%load_ext autoreload
%autoreload 2
import os
from pylab import *
import matplotlib.backends._backend_agg
from IPython.display import display,HTML
import math
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
import matplotlib.pyplot as plt
import math
%matplotlib inline
def variation(model,ev,re=None):
nbr=model.variable(ev).domainSize()
l = []
for i in range(101):
ie=gum.ShaferShenoyLIMIDInference(model)
ie.addEvidence(ev, [1-i/100] + [i/100]*(nbr-1))
if re is not None:
ie.addEvidence("re", re)
ie.makeInference()
l.append(ie.MEU())
med = [x['mean'] for x in l]
mi = [x['mean'] - math.sqrt(x['variance']) for x in l]
ma = [x['mean'] + math.sqrt(x['variance']) for x in l]
fig = figure(figsize =(3, 2))
ax = fig.add_subplot(1, 1, 1)
ax.fill_between(range(101),mi,ma, alpha=0.2)
ax.plot(med, 'g')
if re is None:
ax.set_title(f"Evidence on {ev}")
else:
ax.set_title(f"Evidence on {ev} with {re=}")
return fig
diag=gum.loadID("res/Normal - 1.bifxml")
vars=['te1','mat1','con4','imp1','obj1']
gnb.flow.clear()
gnb.flow.add_html(gnb.getInfluenceDiagram(diag,size="5!"))
gnb.flow.new_line()
for v in vars:
gnb.flow.add_plot(variation(diag,v))
gnb.flow.new_line()
for v in vars:
gnb.flow.add_plot(variation(diag,v,re=0))
gnb.flow.new_line()
for v in vars:
gnb.flow.add_plot(variation(diag,v,re=1))
gnb.flow.display()
I produce this figure :
Hi Benjamin,
1) You mean "keep all the particles in a csv-like file "? No, there is no way to do so. You can generate a csv (forward sampling without evidence) from a BN with gum.generateCSV()
.
It could be a nice feature to add, though (but not an easy one : do we keep only the particles, or also the weight ? Gibbs (especially) would create very large files, etc.)
2) Gibbs is just slow for large BNs and in this notebook, we do not let it converge (always quickly stopped by timeout). But you're right, this notebook is not very gentle with Gibbs. The only Gibbs's advantage we can see here is that its behavior does not depend on the position of the evidence (unlike the others sampling algorithms).
In Approximate inference, Gibbs is better valued :-).
What is the upper limit for the number of variables and arcs, assuming binary variables, for pyAgrum? I have seen some talks showing off very complex networks with large number of variables 2,127 variables:
https://r13-cbts-sgl.engr.tamu.edu/wp-content/uploads/2021/08/CBTS-SGL-Webinar-Cool-Things-That-One-Can-Do-With-Graphical-Probabilistic-Models-Dr.-Marek-Drudzel.pdf
https://youtu.be/9_l9dpvezOc?t=417
I know of gobnilp which has a paper titled Learning Bayesian Networks with Thousands of Variables (https://people.idsia.ch/~zaffalon/papers/2015nips-jena.pdf)
I think there is also BayeSuite which advertises massive network learning. The paper also has a nice table of alternative Bayesian network software. Maybe their list is a nice place to start to make some comparisons ;) (https://www.sciencedirect.com/science/article/pii/S0925231220318609)
Hello @bdatko_gitlab , the answer is not unique depending on what you are looking for.
1- the short answer : the limit is your memory
2- less short answer, but still short : the number of nodes/arcs are not necessarily the most relevant parameters for understanding the size of the BN. The max number of parents for a node is certainly more relevant because the memory complexity of a Bayes net is mainly the size of the CPTs, dominated by the size of the biggest one : 2^(nbrMaxParent+1)
(for binary variables) ... For instance, a BN as a chain (which models a Markov Chain actually) can be veryyyy long :-)
3- The model can fit in the memory, but inference may not. The good parameter here is the treewidth of the graph (more or less the size of the biggest clique in the junction tree). In that case, you still have access to approximated inference (sampling, loopy belief propagation)
If you have enough memory to fit your model and your inference, the issue is now the time. As you may know, aGrUM/pyAgrum try hard to parallelize its algorithms in order to speed up long process ... Still time can be prohibitive in very large models.
4- Particurlarly true for learning : the process for learning a large model has an increasing quadratic (at least) complexity w.r.t the number of nodes. There is specific algorithms dedicated to learn such large models (based on fast approximation of Markov Blanket for each node for instance), but there are not implemented in aGrUM (but could be implemented quite easily). The size of the database has of course an impact for the time (and the memory needed).
Concretely, for pyAgrum, we test our algorithms with graph sizes up to 900 nodes/1250 arcs (some of them do not allow exact inference). We have seen bioinformatics teams which learn BNs of much more than 1000 nodes.
Thanks for the links, there is really a big job to do indeed in term of comparisons & benchmarks ... No resource for that for now, unfortunately :-(
Thanks for the links, there is really a big job to do indeed in term of comparisons & benchmarks ... No resource for that for now, unfortunately :-(
I understand. I think pyAgrum deserves a lot more attention compared to other libraries, I am not sure why pyAgrum doesn't have more popularity. =)
Has the team ever considered mirroring the repo to GitHub? I think it would make pyAgrum more discoverable for SEO