I like to write short tutorials on applied math concepts, with a large dose of tinkering, in Jupyter/IPython notebooks. These are mostly to help re-educate future me, but are also hopefully helpful to other random people on the internet. Instead of just sharing the actual notebook on my Github (which would probably actually be more helpful to some…), I reformat them as blog posts and put them on this blog.
There are prefab libraries out there that already do this – the main one is nbconvert
– but they are, in my experience, way more horsepower than I actually need, a little finnicky, require lots of dependencies, and are not necessarily easily customizable. Okay they’re great actually, but I like to roll my own and so I wrote my own converter that I’ll share the bones of here. Most of this is fairly straightforward, but the image conversion I thought was particularly clever and worth sharing.
The first amazing thing to note is that a Jupyter notebook is just a big JSON file. Try opening a notebook as raw text and you’ll see something like:
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [ ... ]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import scipy.stats\n",
"from scipy.spatial import distance_matrix"
]
}, ...
In the cells
entry, is a list of the cells of the notebook, coded as markdown
or code
, with the source
entry giving the text. When you run the notebook, it interprets these cells and for code puts the output in outputs
. Simple as that.
So, at a high level, all we have to do to convert the notebook is to iteratively append the markdown (directly) and the code (with some formatting) from this JSON into a markdown file. We’ll hit a wrinkle with the image outputs (i.e. plots), but more on this later.
To start, we’ll create a simple Python script named converter.py
with the starting structure:
import json # to parse the notebook's JSON
import os # to deal with the filesystem
import sys # to grab command-line arguments
import re # for some REGEX
import base64 # to deal with images
def main(fname):
# fname is the filename of our notebook
f = open(fname)
book = json.load(f)
# grab the name of the notebook to use as the name of the md file
name = fname[:-6] # -6 cuts off the .ipynb extension
md_file_name = name + '.md'
# make a subdirectory for this markdown file and embedded images
os.mkdir(name)
# `text` will be a huge string with the entire markdown of our file
text = ''
# TODO: build the markdown header and append to `text`
# iterate through `cells`
for cell in book['cells']:
# TODO: convert code/markdown input and output to `text`
with open(name + '/' + md_file_name, 'w') as f:
f.write(text)
# standard boilerplate to run `main` when `converter.py` is run from
# the command line
if __name__ == "__main__":
# sys.argv[1] will give us the first argument passed to converter.py
# note: argv[0] is "converter.py" itself
main(sys.argv[1])
Now all we have to do is fill in the TODO’s above for the header, code sections, and markdown sections, which we’ll do now.
This is the easiest part. My blog uses Jekyll and so needs a short header with pre-set labels to define the layout, title, tags, etc for the post, which Jekyll then compiles into the static HTML for the actual post.
The header looks something like this:
---
layout: post
title: "Fourier Transforms (with Python examples)"
categories: journal
date: 2024-04-06
tags: ['python', 'mathematics']
---
We can use Python’s f-strings and some educated guesses at what the different entries will be, to do a pretty good first swipe at this, knowing we’ll inevitably want to go in and manually edit afterward.
# (in main)
# it's important to avoid extra spaces (or even tabs!)
# in this as they'll show up in the md
post_header = '---\n\
layout: post\n\
title: "{title}"\n\
categories: journal\n\
date: {date}\n\
tags: {tags}\n\
---'
post_title = name
post_date = '2024-xx-xx'
post_tags = ['python', 'mathematics']
text += post_header.format(title=post_title, date=post_date, tags=post_tags)
text += '\n\n'
That’s it! You could probably hang some bells and whistles on this by using today’s date (datetime
), or some fancy word frequency count for the tags … or just adjust manually later.
Ok recall in our main
function we’re looping through the list of cells
in the notebook’s JSON. Inside that loop, we’ll add an if/elif
block to select on whether we’ve got code or markdown.
# (in main)
for cell in book['cells']:
# code block
if cell['cell_type'] == 'code':
# TODO
elif cell['cell_type'] == 'markdown':
# TODO
That markdown block we’ll handle now. Really all we have to do is something like this:
# get content of markdown
temp = ''.join(e for e in cell['source'])
# add it to the `text` master
text += '\n'
text += temp
text += '\n'
But for my blog, I have two additional things to deal with. First, for whatever reason I decided to set Mathjax to require two $
s, whereas Jupyter just wants one, so I have to add in this insane Regex maneuver to add in the extra dollar signs:
# replace $...$ with $$...$$ for my jekyll build :|
# REGEX explainer of r'([^\$])(\$[^\$]+\$)([^\$])'
# grabs 3 groups: preceding char, $...$, trail char. ignores $$
temp = re.sub(
r'([^\$])(\$[^\$]+\$)([^\$])',
lambda mo: mo.group(1) + '$' + mo.group(2) + '$' + mo.group(3),
temp
)
Another little quirk I discovered is that Mathjax doesn’t like the |
symbol, so I convert that to the explicit \vert
, more Regex:
temp = re.sub(
r'([\|]+)(.)',
lambda mo: r'\vert'*len(mo.group(1)) + (' ' if mo.group(2) == ' ' else ' ' + mo.group(2)),
temp
)
which is actually a bit buggy I’ve found, but mostly works – anyway you add these Regex maneuvers in and then go through the text += temp
stuff.
For the last bit, we’ll convert the code. I saved this for last because we’ve also got to handle images, which are tricky.
For the inputs, it’s easy enough, you can just do something like:
text += '\n```python\n'
text += ''.join(e for e in cell['source'])
text += '\n```\n\n'
For the outputs, first note there could be multiple (for example text and image), so we’ll do a loop and handle both. I use a try/except
because sometimes the output will have an empty or near-empty entry and this seemed the easiest way to handle that.
for o in cell['outputs']:
try:
# handle code outputs
if 'text/plain' in o['data']:
text += '\n```\n'
text += '\n'.join(e for e in o['data']['text/plain'])
text += '\n```\n'
# handle images
if 'image/png' in o['data']:
# TODO
except KeyError:
pass
Finally, for images, let’s first note that images are actually stored in the JSON as a giant string! Take a look at an example:
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtEAAAHSCAYAAAAqtZc0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4 ...
It turns out, this is a byte-string representing the vectorized image. Through a little StackOverflowing, I found you can convert this to an image file (like a PNG) like this:
# grab raw image byte string
s = o['data']['image/png']
# save image
ifname = f'image{img_k}.png'
with open(name + '/' + ifname, 'wb') as f:
# encode converts string->bytes, decode converts to img
f.write(base64.decodebytes(s.encode('latin-1')))
img_k += 1
where img_k
is just a counter I defined earlier in the function to keep running track of which image we’re on. So this saves the file to our working direction. We now just need to put a little HTML in our markdown to display the image.
# add image include to markdown text
img_embed_stem = r'https://stmorse.github.io/images/2024/' + name + '/'
img_format = '<img align="center" width="90%" \
src="{stem}{filename}" alt="{alttext}">'
text += '\n'
text += img_format.format(stem=img_embed_stem, filename=ifname, alttext=ifname)
text += '\n'
And that’s it!
To run this, just make sure the converter.py
script is in the same folder as your notebook (or modify the filepath accordingly) and do
~/yourfolder $ python3 converter.py yournotebook.ipynb
And it will create a folder called yournotebook
with a yournotebook.md
file and (if applicable) a bunch of .png
files.
You’ll probably have to do some editing of the markdown before it’s ready to publish, but I’d say this script gets you 90 percent of the way there. And, I dunno, I just like knowing exactly what the script is doing.
Hope this was helpful to you!
Here’s the full code just for good measure:
import json
import re
import base64
import os
import sys
def main(fname):
f = open(fname)
book = json.load(f)
name = fname[:-6]
post_title = name
post_date = '2023-xx-xx'
post_tags = ['python', 'mathematics']
post_header = '---\n\
layout: post\n\
title: "{title}"\n\
categories: journal\n\
date: {date}\n\
tags: {tags}\n\
---'
md_file_name = name + '.md'
img_embed_stem = r'https://stmorse.github.io/images/2024/' + name + '/'
img_format = '<img align="center" width="90%" \
src="{stem}{filename}" alt="{alttext}">'
# make subdirectory
os.mkdir(name)
text = ''
img_k = 0
text += post_header.format(title=post_title, date=post_date, tags=post_tags)
text += '\n\n'
for cell in book['cells']:
# code block
if cell['cell_type'] == 'code':
text += '\n```python\n'
text += ''.join(e for e in cell['source'])
text += '\n```\n\n'
# handle text or image outputs
for o in cell['outputs']:
try:
# handle code outputs
if 'text/plain' in o['data']:
text += '\n```\n'
text += '\n'.join(e for e in o['data']['text/plain'])
text += '\n```\n'
# handle images
if 'image/png' in o['data']:
# grab raw image byte string
s = o['data']['image/png']
# save image
ifname = f'image{img_k}.png'
with open(name + '/' + ifname, 'wb') as f:
# encode converts string->bytes, decode converts to img
f.write(base64.decodebytes(s.encode('latin-1')))
img_k += 1
# add image include to markdown text
text += '\n'
text += img_format.format(stem=img_embed_stem, filename=ifname, alttext=ifname)
text += '\n'
except KeyError:
pass
# markdown block
elif cell['cell_type'] == 'markdown':
# get content of markdown
temp = ''.join(e for e in cell['source'])
# replace $...$ with $$...$$ for my jekyll build :|
# REGEX explainer of r'([^\$])(\$[^\$]+\$)([^\$])'
# grabs 3 groups: preceding char, $...$, trail char. ignores $$
temp = re.sub(
r'([^\$])(\$[^\$]+\$)([^\$])',
lambda mo: mo.group(1) + '$' + mo.group(2) + '$' + mo.group(3),
temp
)
# TODO: this is buggy
# my jekyll build also doesn't like |
# replace with \vert
temp = re.sub(
r'([\|]+)(.)',
lambda mo: r'\vert'*len(mo.group(1)) + (' ' if mo.group(2) == ' ' else ' ' + mo.group(2)),
temp
)
text += '\n'
text += temp
text += '\n'
with open(name + '/' + md_file_name, 'w') as f:
f.write(text)
if __name__ == "__main__":
main(sys.argv[1])