Currently in a book club for this interesting book that goes over Causal Inference & how it synthesizes with Machine Learning: https://www.manning.com/books/causal-ai
Causal Inference is a weird little branch of math! It's the result of at least three distinct traditions all realizing they were all "blind sages touching different parts of the same elephant". The main streams were Statistics, Economics (specifically the Econometrics subdiscipline), and Computer Science (specifically the subdiscipline that was about the old meaning of "AI" - think stuff that overlaps with Applied Math, like Operations Research algos).
The Causal Inference code ecosystem is interesting, and clearly has followed a somewhat different "evolutionary track" than the standard Python data stack.
One of the canonical package is pgmpy, which is pretty neat! The interface for some of its main tools didn't quite agree with my constitution, though. Here's an example from the official docs:
from pgmpy.factors.discrete.CPD import TabularCPD
cpd = TabularCPD(
variable="grade",
variable_card=3,
values=[
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
[0.8, 0.8, 0.8, 0.8, 0.8, 0.8],
],
evidence=["diff", "intel"],
evidence_card=[2, 3],
state_names={
"diff": ["easy", "hard"],
"intel": ["low", "mid", "high"],
"grade": ["A", "B", "C"],
},
)
Now, if I'm just copying & pasting examples from a book, it's not such a big deal - but I knew that if I was ever going to actually use this for anything, I'd want a different interface. I can see several mistakes I'd easily make - the most important of which being how one maps state combinations in state_names to the different probabilities. Each column represents the probability of one of the three Output Variable states, given a combination of the Input Values - and the later variables change faster, which I bet you didn't guess! ie, column 0 is ("easy", "low"), and column 1 is ("easy", "mid").
I had an idea of what I was looking for, but the Coding Hours of the day were waning and I knew I wouldn't have much time before Book Club the next day - plus it seemed kinda tedious!
So I asked Claude
Working through a book on Causal Inference and it uses this package calledpgmpy. I hate hate hate its interface. For instance, this is part of how we set up a Monty Haul problem:
You're supposed to remember the order of the state names & the probabilities, instead of these things being bundled together. Another example from the official docs:
Write me a simple helper function that inputs a sane dictionary and outputs aTabularCPDobject
It gave me some code I didn't...love.
from itertools import product as cartprod
from pgmpy.factors.discrete import TabularCPD
def cpd(
variable: str,
table: dict,
parents: list[str] | None = None,
) -> TabularCPD:
"""
Build a TabularCPD from a readable dictionary.
Prior:
cpd("Car Door", {"1st": 1/3, "2nd": 1/3, "3rd": 1/3})
Conditional (parents listed in same order as the tuple keys):
cpd("grade", {
("easy", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("easy", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
("easy", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
}, parents=["diff", "intel"])
"""
first_key = next(iter(table))
# --- Prior (no parents) ---
if not isinstance(first_key, tuple):
states = list(table.keys())
values = [[p] for p in table.values()]
return TabularCPD(
variable=variable,
variable_card=len(states),
values=values,
state_names={variable: states},
)
# --- Conditional ---
if parents is None:
raise ValueError(
"Pass parents=['name1', 'name2', ...] for conditional CPDs"
)
# Discover parent states in order of first appearance
n_parents = len(first_key)
parent_states: list[list[str]] = [
[] for _ in range(n_parents)
]
for combo in table:
for i, s in enumerate(combo):
if s not in parent_states[i]:
parent_states[i].append(s)
child_states = list(next(iter(table.values())).keys())
# Columns in pgmpy order: last parent changes fastest
combos = list(cartprod(*parent_states))
values = [
[table[combo][cs] for combo in combos]
for cs in child_states
]
state_names = {variable: child_states}
for name, states in zip(parents, parent_states):
state_names[name] = states
return TabularCPD(
variable=variable,
variable_card=len(child_states),
values=values,
evidence=parents,
evidence_card=[len(s) for s in parent_states],
state_names=state_names,
)
One part of my LLM-assisted dev process is to try to only use code that I could see myself modifying. So I asked
functional style, comprehensions wherever possible, use thetoolzlibrary for dict & list operations (along withitertools)
And I got back something I liked a lot better:
from itertools import product as cartprod
from pgmpy.factors.discrete import TabularCPD
from toolz import first, merge, valmap
from toolz.curried import get
def cpd(
variable: str,
table: dict,
parents: list[str] | None = None,
) -> TabularCPD:
"""
Build a TabularCPD from a readable dictionary.
Prior:
cpd("Car Door", {"1st": 1/3, "2nd": 1/3, "3rd": 1/3})
Conditional:
cpd("grade", {
("easy", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("easy", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
}, parents=["diff", "intel"])
"""
if not isinstance(first(table), tuple):
return TabularCPD(
variable=variable,
variable_card=len(table),
values=[[p] for p in table.values()],
state_names={variable: list(table.keys())},
)
if parents is None:
raise ValueError(
"Pass parents=['name1', 'name2', ...] for conditional CPDs"
)
n_parents = len(first(table))
parent_states = [
list(dict.fromkeys(combo[i] for combo in table))
for i in range(n_parents)
]
child_states = list(first(table.values()).keys())
combos = list(cartprod(*parent_states))
values = [
[table[combo][cs] for combo in combos]
for cs in child_states
]
return TabularCPD(
variable=variable,
variable_card=len(child_states),
values=values,
evidence=parents,
evidence_card=[len(s) for s in parent_states],
state_names=merge(
{variable: child_states},
dict(zip(parents, parent_states)),
),
)
along with this nice explanation!
Cleaned up withtoolz:first(table)/first(table.values())instead of thenext(iter(...))noisemerge(dict1, dict2)to combine state_names instead of mutating a dictdict.fromkeys(...)for the order-preserving unique-states extraction — this replaces the imperativeif s not in seenloop with a one-liner comprehension
I leftcartprodand the list comprehensions forvalues/parent_statesas-is since they're already declarative andtoolzdoesn't have anything that would make them cleaner. Thegetimport is there if you want to use it downstream but I didn't force it in where it wouldn't help.
I then interactively double-checked the objects constructed in both of the "canonical" ways for equality, and it passed!
cpd("grade", {
("easy", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("easy", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
("easy", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "low"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "mid"): {"A": 0.1, "B": 0.1, "C": 0.8},
("hard", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
}, parents=["diff", "intel"]) == TabularCPD(
variable="grade",
variable_card=3,
values=[
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
[0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
[0.8, 0.8, 0.8, 0.8, 0.8, 0.8],
],
evidence=["diff", "intel"],
evidence_card=[2, 3],
state_names={
"diff": ["easy", "hard"],
"intel": ["low", "mid", "high"],
"grade": ["A", "B", "C"],
},
)
True
I then passed Claude the official docs to see if it spotted any non-obvious ways that it might not be equivalent:
here's the official docs. Double check that this works https://pgmpy.org/examples/How%20to%20define%20CPDs.html
Which didn't come up with anything. I'm confident enough for a tool that I'll be using interactively, though not even close if this was supposed to go into production.
Anyway, already much handier! But I think I could be even lazier - that dictionary sure looks annoying to write by hand. So I asked Claude
give me another little helper function that, given a dictionary with a list of the states and the hierarchy, will give me a blank dictionary (so entries/tuples with keys beingNone) that I can fill in by hand to pass to the new function. ie, I would pass it thestate_namesvar from the Package's object
So your workflow becomes:
Copy-paste that output, replace theNones with probabilities, and pass it tocpd(). The parents list is justlist(parent_states.keys()).
Much nicer!
(credit to the great Simon Willison for giving some cool examples of "blog posts about using an LLM to do something"! https://simonwillison.net/)