# Learning Probabilistic Deterministic Automata

Implementation of the (Palmer & Goldberg, 2005) PAC algorithm.

These days I’ve been working on a problem that required the learning
of a Probabilistic Deterministic Automata (PDFA)

## PDFA

A *Probabilistic Deterministic Finite Automaton*

- $Q$ is the set of states.
- $\Sigma$ is the alphabet.
- $q_0 \in Q$ is the initial state.
- $q_f \not\in Q$ is the final state.
- $\tau: Q \times \Sigma \to Q$ is the transition function.
- $\gamma: Q \times \Sigma \to [0, 1]$ is a function that associates a transition $(q, \sigma)$ to the probability that that transition can be taken.

Other requirements for a valid PDFA are:

- $\forall q.\sum_{\sigma \in \Sigma} \gamma(q, \sigma) = 1$, that is, the probabilities of the outgoing transitions from a given state $q$ must sum to $1$.
- when $\tau(q, \sigma)$ is undefined, $\gamma(q, \sigma) = 0$.
- $\forall q. \exists s\in\Sigma^*. \tau(q, s) = q_f \wedge \gamma(q, s) > 0$, i.e. $q_f$ is reachable with probability non-zero from any state.

A PDFA $\mathcal{A}$ defines a probability distribution over strings in $\Sigma^*$. Notice that the reachability of the final state $q_f$ ensures that any execution of $\mathcal{A}$ will halt with probability $1$.

It is useful to extend $\tau$ and $\gamma$ for succinteness purposes, i.e.:

- $\tau(q, s) = \tau(\tau(q, \sigma_1), \sigma_2…\sigma_k$
- $\gamma(q, s) = \gamma(q, \sigma_1) \cdot \gamma(\tau(q, \sigma_1), \sigma_2…\sigma_k)$

Let $D_{\mathcal{A}}(s)$ be the probability distribution over strings $\Sigma^*$.

We have:

$D_{\mathcal{A}}(s) = \gamma(q_0,s)$ for $s$ such that $\tau(q_0,s) = q_f$.

Moreover, we define the *variation distance* $L_1$ between two distributions
$D_1$ and $D_2$ over $\Sigma^*$ as
$L_1(D_1, D_2) = \sum_{s\in\Sigma^*} | D_1(s) - D_2(s) |$.

Later, it will be useful the following definition of $L_{\infty}$-norm between two distributions $D$, $D^\prime$:

$L_{\infty}(D, D^\prime) = \max_{s\in\Sigma^*} |D(s) - D^\prime(s)|$

## PDFAs in Python

I implemented a tiny Python library

In particular:

- The set of states is the set of integers $\{ 0, …, nb\_states - 1\}$ (with $nb\_states > 0$)
- The alphabet is the set of integers $\{0, …, alphabet\_size - 1\}$
- The initial state is always $q_0 = 0$
- The final state is always $q_f = nb\_states$
- The transition function is a nested dictionary:
- at the first level, we have states as keys and the dict of outgoing $transition\_dict$ as values
- a dict of outgoing transitions has characters $\sigma$ as keys and a tuple of next state and probability $(q^\prime, p)$ as value.

For example, to create an object representing the PDFA showed above (with $\zeta = 0.4$ and $\zeta^\prime = 0.7$):

There is a helper function to render the automaton with Graphviz:

That results in:

## Learning a PDFA

As stated earlier, the library also implements a PDFA learning algorithm, proposed in

There exist other approaches (*acyclic* PDFAs,
and apply the algorithm to speech and handwriting recognition. In

$L(D, \hat{D}) = \sum_{s\in\Sigma^*} |D(s) - \hat{D}(s)| \le \varepsilon$

All the mentioned algorithms are *Probably Approximately Correct* for some probability $\delta$ to fail and some error
tolerance $\varepsilon$

An important concept in the context of the learning of a PDFA is the *distinguishability* of
any pair of states. Formally, we say that a pair of nodes
$(q_1,q_2)$ are *$\mu$-distinguishable* if

$L_{\infty}(D_{q_1}, D_{q_2}) = \max_{s\in\Sigma^*} |D_{q_1}(s) - D_{q_2}(s)| \ge \mu$

where $D_{q}$ is the distribution represented by the automaton starting from $q$. The idea is that $q_1$ and $q_2$ should have sufficiently different suffix distributions in order to be considered separate states.

The algorithm is quite complex to be explained in detail here. It takes in input the following parameters:

- $\Sigma$: the alphabet size.
- $n$: an upper bound on the number of states of the target automaton.
- $\mu$: a lower bound on distinguishability.

Plus the PAC parameters $\delta$ and $\varepsilon$.

It is composed in two parts:

- Estimation of a
*subgraph*$H$ of the underlying graph of a PDFA; - Estimation of the probabilities on the edges.

Here the pseudocode of both algorithms:

Find a subgraph $H$ of $\mathcal{A}$ |

Find the probabilities $\gamma(q,\sigma)$ forall $q,\sigma$ |

For both algorithms, we need to generate a certain amount of samples from the true PDFA $\mathcal{A}$.
However, despite the number of samples is polynomial
in terms of all the parameters of the algorithm,
the amount of samples needed for the PAC guarantees is *computationally prohibitive*
for even reasonable assignment of such parameters.

For example, say we want to learn an automaton with $n=2$, $|\Sigma| = 2$, $\mu = 0.1$, and PAC parameters $\varepsilon = 0.1$ and $\delta^\prime = \delta^{\prime\prime} = 0.1$ ($\delta^\prime$ is the probability of failure of Algorithm 1, whereas $\delta^{\prime\prime}$ is the probability of failure of Algorithm 2. They are such that $\delta^\prime + \delta^{\prime\prime} = \delta$).

The following code snippet compute the number of samples for Algorithm 1:

This one does the same but for Algorithm 2:

The number of samples for Algorithm 1 is given by:

Where `N`

is `68173280`

($\approx 68 \cdot 10^6$ number of samples).

For Algorithm 2:

The result in `N`

is `294072829470708`

($\approx 3 \cdot 10^{15}$ number of samples).

As said, these numbers are not practical for any personal computer. However, for some examples, a much lower number of samples can be enough, as shown in the next section.

## Implementation of the learning algorithm

In the same repository (in src/learning-pdfas/), you can find the implementation of the (Puterman & Goldberg, 2005) algorithm.

You should execute the code snippet in the previous sections
in order to get the `automaton`

variable.

The `generator`

is an object that generates the samples from a `PDFA`

instance.
`SimpleGenerator`

is just a wrapper to an automaton, whereas
`MultiprocessedGenerator`

parallelizes the sampling using multiprocessing.

Notice the parameters `n1_max_debug1`

, `n2_max_debug`

and `m0_max_debug`

.
They are used to cap the number of samples for Algorithm 1,
the number of samples for Algorithm 2, and
the value $m_0$, respectively.

You can see the same example in this notebook.

## Conclusion

This library provides a simple (Python) implementation of Probabilistic Deterministic Finite Automata,
and a PAC-learning algorithm