Steven Morse personal website and research notes

Python class for Hawkes processes

This post is about a stochastic process called the Hawkes process. I offer some Python code for generating synthetic sequences and doing parameter estimation, and also cover some theoretical preliminaries.

I am using the Hawkes process in some on-going research — I found that it is popular enough to have a large, interdisciplinary literature, but specialized enough that the available software packages are few and far between. At the end of this blog, I link to several repos already out there for this family of models. They are high quality, production-level packages, and miles beyond what I offer here, but are in my opinion overkill for someone just getting started and needing some basic code to fiddle with.

I’m hoping to bridge this gap by offering both a beginner-friendly math exposition and an immediately useable, if relatively basic, code library in Python.

Hawkes Poisson comparison


We can define a point process as a random and finite series of events governed by a probablistic rule. For example, the ubiquitous Poisson process is a series of points along the nonnegative real line such that the probability of \(k\) points on any interval length \(n\) is given by a Poisson distribution with parameter \(\lambda n\). The bottom series of points in the figure above is an example of a Poisson process (with rate \(\lambda=0.1\)).

We may even consider a \(U\)-dimensional Poisson process, with \(U\) different Poisson processes generating events in \(\mathbb{R}^U\), each with a different rate \(\lambda_u\). Now, the overall number of points in a particular interval (or now more appropriately, volume) is again given by a Poisson distribution with parameter \(\lambda n\), where the rate \(\lambda = \lambda_1 + ... + \lambda_u\) (the Poisson superposition theorem). This additive property allows us to compute the probability that a particular event originated from a particular dimension \(u_i\) as \(\lambda_{u_i} / \lambda\).

You can also show that the number of events in disjoint subsets are independent of each other. This leads to the critical observation that the Poisson process is memoryless. Another way to state this is that the interarrival time between two successive events is an exponential random variable, and therefore the probability of the next interarrival time is independent of the previous. This memoryless property makes the Poisson process an extremely tractable modeling tool.

Hawkes process

However, the memoryless property of Poisson processes means that it is unable to capture a dependence on history, or in other words, interaction between events. For example, we may want the event of an arrival to increase the probability of arrivals in the next small interval of time.

For this, we introduce the Hawkes process, which gives an additive, decaying increase to the intensity function for each new arrival. Now, the intensity function is only conditionally Poisson: that is, given the history of events \(\{t_i\}\) up to \(t\), the conditional intensity at \(t\), that is \(\lambda(t ; t_i < t)\), is Poisson.

Definition. Hawkes process. Consider a sequence of events \(\{(t_i, u_i)\}_{i=1}^n\) consisting of a time \(t_i\) and dimension \(u_i\) (i.e. the \(i\)-th event occurred at time \(t_i\) in dimension \(u_i\)), for \(t_i\in\mathbb{R}^+\) and \(u_i\in \mathcal{U}=\{1,2,...,U\}\). This sequence is a Hawkes process if the conditional intensity function has the parameterized form

\[\lambda_u (t; \Theta) = \mu_u + \sum_{i:t_i < t} h_{uu_i}(t-t_i; \theta_{uu_i})\]

where \(\Theta=(\mu, \theta)\) are the model parameters and \(H=[h_{ij}]\), \(h_*(t):\mathbb{R}^+ \rightarrow \mathbb{R}^+\) is the matrix of triggering kernels (also sometimes called the excitation function or decay kernel) which is varying with \(u\) and \(u_i\).

In other words, imagine \(U\) streams of events, all happening simultaneously. And imagine that, within a stream \(u\), any previous event \(t_i\) has the capability of increasing the likelihood of a future event occurring at time \(t\), according to some function \(h_{uu}(t-t_i)\) which we control with parameters \(\theta_{uu}\). Further, imagine that streams can influence not only themselves, but also each other, according to the same function, and parameters \(\theta_{u_i u_j}\) (for the effect of stream \(j\) on \(i\)).

We’d like to model the triggering kernel \(h(\cdot)\) as something that decays with time: as previous events become more and more distant, they have less and less effect on the present. A natural choice is the exponential function \(h(t)=\omega e^{-\omega t}\). We may also separate the exponential function itself from a scaling factor, so we write:

\[h_{uu'}(t) = \alpha_{uu'} \omega_{uu'} e^{-\omega_{uu'} t}\]

and we might even set \(\omega_{uu'} = \omega\) globally, and only tune the \(\alpha\)’s (which still gives us \(U^2+U+1\) parameters to learn).

An example

A univariate example is given above, and compared to a Poisson process with the same base rate \(\mu\). In the Hawkes process, we see that events cause a spike in the intensity \(\lambda(t)\), which frequently leads to more events, more spikes … this results in a burst of activity (which happens to well-model the actual way many event sequences behave, such as the communication habits of people).

Multivariate example

In the multivariate example above, we have 3 different streams of events, call them \(e_0\), \(e_1\), and \(e_2\). We have engineered it by setting the parameters such that \(e_0\) has an influence on itself and \(e_1\), and \(e_1\) has an influence on \(e_2\). We also set the background rate to zero for all the streams except for \(e_0\) — this way, if we see an event on \(e_1\) or \(e_2\) we know it came from a previous event on the parent stream, not a random event.

This creates a cascading effect that we see almost immediately: a background event in \(e_0\) causes a “child” event in \(e_1\), which then leads to a child event in \(e_2\). Notice that in the second cascade, there is no event in \(e_2\) even though it experienced a spike in probability of an event happening. Because \(e_0\) is self-exciting, we see temporal clustering there which translates to bursts in its child processes.

Generating synthetic sequences

How did we generate the data in the figures above? A well-worn approach is known as Ogata’s thinning method, which essentially generates new events from an exponential distribution parameterized by the Hawkes intensity at that time, but then rejects some events with some probability that decreases as the time since the last event increases.

In the multivariate case, this is only slightly more complicated, since we must also attribute each generated event to a particular dimension based on the proportional likelihood the new event came from that dimension. In expositions like this excellent slide deck, these two steps are called Reject-Attribute.

The details are apparent in the repo, but let us mention two important modifications.

The algorithm as typically described requires \(O(n^2 U^2)\) operations to draw \(n\) samples over \(U\) dimensions, which is prohibitive for large \(n\) or \(U\). Instead, we modify an approach mentioned in this paper. Given the rates at the last event \(t_k\) (which note do not include effects of \(t_k\)), we can calculate \(\lambda(t)\) for \(t>t_k\) by

\[\lambda_u(t) = \mu_u + e^{-\omega(t-t_k)}\big(a_{uu_k} \omega + (\lambda_u(t_k) - \mu_u)\big)\]

which we can do in \(O(1)\), and only requires saving the rates at the most recent event. Note also that when \(0 < t-t_k < \epsilon\), this reduces to

\[\lambda_u(t) = \lambda_u(t_k) + a_{uu_k} \omega\]

or in other words, the previous rate plus the maximum contribution the event \(t_k\) can make since it has just occurred.

Secondly, we find that texts describing the algorithm typically frame the attribution/rejection test as finding an index \(n_0\) such that a uniformly random number on \([0,1]\) is between the normalized successive sums of intensities around that index. But note that this entire procedure amounts to a weighted random sample over the integers \(1,2,...,U+1\) where the probabilities are the normalized rates, and selecting \(U+1\) is equivalent to the “rejection” condition. This allows us to use optimized package software for weighted random samples, instead of something like a for-loop (as is present in even production-level Hawkes process software, like the hawkes package in R see below).

Parameter estimation

Lastly, we will mention a few approaches for estimating the parameters of a Hawkes process. It would be impossible to reasonably cover the entire literature, or even a single method, in any depth here, but I’ll try to cover the major “streams” (haha) of technique (including the approach I take in the repo’d code).

LL contours

The sheer number of parameters also leads to massive overfitting and problems of sparsity. As a result, most methods apply some type of regularization (typically on the \(\alpha\)’s), such as an L2-norm (here), or L1-norm and nuclear norm (here), etc.

This repo includes this MAP EM approach. It is not very sophisticated: it treats \(\omega\) as a global hyperparameter, does not incorporate a prior on the background rates \(\mu\), and I think the expression I am using for the “complete data log-likelihood” in the multivariate case is actually a tight lower bound on the real value as they show here.

Hopefully it is helpful/interesting to you! Please consider citing our preprint if you find the MAP EM approach useful to your own research. Feel free to send any feedback/questions, via Twitter or an email (see my about page).

Further resources

Some of what’s out there, that I’ve found:

Many of Nan Du’s colleagues/collaborators at GA Tech have done work in this area too — I particularly liked Ke Zhou’s ADMM-MM technique for parameter estimation — but I was unable to find code repositories for this other work.

If you know of other useful repositories, I would love to hear from you and at a minimum, add to this list!