Currently in a book club for this interesting book that goes over Causal Inference & how it synthesizes with Machine Learning: https://www.manning.com/books/causal-ai

Causal Inference is a weird little branch of math! It's the result of at least three distinct traditions all realizing they were all "blind sages touching different parts of the same elephant". The main streams were Statistics, Economics (specifically the Econometrics subdiscipline), and Computer Science (specifically the subdiscipline that was about the old meaning of "AI" - think stuff that overlaps with Applied Math, like Operations Research algos).

The Causal Inference code ecosystem is interesting, and clearly has followed a somewhat different "evolutionary track" than the standard Python data stack.

One of the canonical package is pgmpy, which is pretty neat! The interface for some of its main tools didn't quite agree with my constitution, though. Here's an example from the official docs:

from pgmpy.factors.discrete.CPD import TabularCPD

cpd = TabularCPD(
     variable="grade",
     variable_card=3,
     values=[
         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
         [0.8, 0.8, 0.8, 0.8, 0.8, 0.8],
     ],
     evidence=["diff", "intel"],
     evidence_card=[2, 3],
     state_names={
         "diff": ["easy", "hard"],
         "intel": ["low", "mid", "high"],
         "grade": ["A", "B", "C"],
     },
)

Now, if I'm just copying & pasting examples from a book, it's not such a big deal - but I knew that if I was ever going to actually use this for anything, I'd want a different interface. I can see several mistakes I'd easily make - the most important of which being how one maps state combinations in state_names to the different probabilities. Each column represents the probability of one of the three Output Variable states, given a combination of the Input Values - and the later variables change faster, which I bet you didn't guess! ie, column 0 is ("easy", "low"), and column 1 is ("easy", "mid").

I had an idea of what I was looking for, but the Coding Hours of the day were waning and I knew I wouldn't have much time before Book Club the next day - plus it seemed kinda tedious!

So I asked Claude

Working through a book on Causal Inference and it uses this package called pgmpy. I hate hate hate its interface. For instance, this is part of how we set up a Monty Haul problem:

You're supposed to remember the order of the state names & the probabilities, instead of these things being bundled together. Another example from the official docs:

Write me a simple helper function that inputs a sane dictionary and outputs a TabularCPD object

It gave me some code I didn't...love.

from itertools import product as cartprod
from pgmpy.factors.discrete import TabularCPD


def cpd(
    variable: str,
    table: dict,
    parents: list[str] | None = None,
) -> TabularCPD:
    """
    Build a TabularCPD from a readable dictionary.

    Prior:
        cpd("Car Door", {"1st": 1/3, "2nd": 1/3, "3rd": 1/3})

    Conditional (parents listed in same order as the tuple keys):
        cpd("grade", {
            ("easy", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("easy", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("easy", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
            ("hard", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("hard", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("hard", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
        }, parents=["diff", "intel"])
    """
    first_key = next(iter(table))

    # --- Prior (no parents) ---

    if not isinstance(first_key, tuple):
        states = list(table.keys())
        values = [[p] for p in table.values()]
        return TabularCPD(
            variable=variable,
            variable_card=len(states),
            values=values,
            state_names={variable: states},
        )
    # --- Conditional ---

    if parents is None:
        raise ValueError(
            "Pass parents=['name1', 'name2', ...] for conditional CPDs"
        )
    # Discover parent states in order of first appearance

    n_parents = len(first_key)
    parent_states: list[list[str]] = [
        [] for _ in range(n_parents)
    ]
    for combo in table:
        for i, s in enumerate(combo):
            if s not in parent_states[i]:
                parent_states[i].append(s)
    child_states = list(next(iter(table.values())).keys())

    # Columns in pgmpy order: last parent changes fastest

    combos = list(cartprod(*parent_states))
    values = [
        [table[combo][cs] for combo in combos]
        for cs in child_states
    ]

    state_names = {variable: child_states}
    for name, states in zip(parents, parent_states):
        state_names[name] = states
    return TabularCPD(
        variable=variable,
        variable_card=len(child_states),
        values=values,
        evidence=parents,
        evidence_card=[len(s) for s in parent_states],
        state_names=state_names,
    )

One part of my LLM-assisted dev process is to try to only use code that I could see myself modifying. So I asked

functional style, comprehensions wherever possible, use the toolz library for dict & list operations (along with itertools)

And I got back something I liked a lot better:

from itertools import product as cartprod

from pgmpy.factors.discrete import TabularCPD
from toolz import first, merge, valmap
from toolz.curried import get


def cpd(
    variable: str,
    table: dict,
    parents: list[str] | None = None,
) -> TabularCPD:
    """
    Build a TabularCPD from a readable dictionary.

    Prior:
        cpd("Car Door", {"1st": 1/3, "2nd": 1/3, "3rd": 1/3})

    Conditional:
        cpd("grade", {
            ("easy", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("easy", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("hard", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
            ("hard", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
        }, parents=["diff", "intel"])
    """
    if not isinstance(first(table), tuple):
        return TabularCPD(
            variable=variable,
            variable_card=len(table),
            values=[[p] for p in table.values()],
            state_names={variable: list(table.keys())},
        )

    if parents is None:
        raise ValueError(
            "Pass parents=['name1', 'name2', ...] for conditional CPDs"
        )

    n_parents = len(first(table))

    parent_states = [
        list(dict.fromkeys(combo[i] for combo in table))
        for i in range(n_parents)
    ]

    child_states = list(first(table.values()).keys())
    combos = list(cartprod(*parent_states))

    values = [
        [table[combo][cs] for combo in combos]
        for cs in child_states
    ]

    return TabularCPD(
        variable=variable,
        variable_card=len(child_states),
        values=values,
        evidence=parents,
        evidence_card=[len(s) for s in parent_states],
        state_names=merge(
            {variable: child_states},
            dict(zip(parents, parent_states)),
        ),
    )

along with this nice explanation!

Cleaned up with toolz:first(table) / first(table.values()) instead of the next(iter(...)) noisemerge(dict1, dict2) to combine state_names instead of mutating a dictdict.fromkeys(...) for the order-preserving unique-states extraction — this replaces the imperative if s not in seen loop with a one-liner comprehension

I left cartprod and the list comprehensions for values / parent_states as-is since they're already declarative and toolz doesn't have anything that would make them cleaner. The get import is there if you want to use it downstream but I didn't force it in where it wouldn't help.

I then interactively double-checked the objects constructed in both of the "canonical" ways for equality, and it passed!

cpd("grade", {
    ("easy", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
    ("easy", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
    ("easy", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
    ("hard", "low"):  {"A": 0.1, "B": 0.1, "C": 0.8},
    ("hard", "mid"):  {"A": 0.1, "B": 0.1, "C": 0.8},
    ("hard", "high"): {"A": 0.1, "B": 0.1, "C": 0.8},
}, parents=["diff", "intel"]) == TabularCPD(
     variable="grade",
     variable_card=3,
     values=[
         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
         [0.1, 0.1, 0.1, 0.1, 0.1, 0.1],
         [0.8, 0.8, 0.8, 0.8, 0.8, 0.8],
     ],
     evidence=["diff", "intel"],
     evidence_card=[2, 3],
     state_names={
         "diff": ["easy", "hard"],
         "intel": ["low", "mid", "high"],
         "grade": ["A", "B", "C"],
     },
)
True

I then passed Claude the official docs to see if it spotted any non-obvious ways that it might not be equivalent:

here's the official docs. Double check that this works https://pgmpy.org/examples/How%20to%20define%20CPDs.html

Which didn't come up with anything. I'm confident enough for a tool that I'll be using interactively, though not even close if this was supposed to go into production.

Anyway, already much handier! But I think I could be even lazier - that dictionary sure looks annoying to write by hand. So I asked Claude

give me another little helper function that, given a dictionary with a list of the states and the hierarchy, will give me a blank dictionary (so entries/tuples with keys being None) that I can fill in by hand to pass to the new function. ie, I would pass it the state_names var from the Package's object


So your workflow becomes:

Copy-paste that output, replace the Nones with probabilities, and pass it to cpd(). The parents list is just list(parent_states.keys()).

Much nicer!

(credit to the great Simon Willison for giving some cool examples of "blog posts about using an LLM to do something"! https://simonwillison.net/)