Pinecast’s Renumbering Tool: How it works

Matt Basta
Pinecast
Published in
8 min readMar 19, 2020

--

The Pinecast Episode Renumbering tool is a fairly recent feature that podcasters have at their disposal. It allows podcasters to update their episode titles in bulk to account for Apple’s 2019 guidelines.

The renumbering tool in action, automatically removing prefixes from episode titles

What does it do?

Apple’s guidelines essentially boil down to a couple things:

  1. Start using the season and episode number tags in RSS feeds
  2. Remove your season and episode numbers from your episode titles

While this is a straightforward ask on paper, there’s a great deal of technical shizzle-wizzle that needs to happen behind the scenes.

The big challenge is that most podcast apps—including old versions of Apple’s own podcast apps—don’t support the new episode and season number tags. If we remove the season and episode numbers from the episode titles, they’ll be gone for all apps, not just Apple Podcasts.

Fortunately, Apple provides the <itunes:title> tag in their RSS spec. This allows us to set a title for Apple-compatible apps. This might look like this:

Before<item>
<title>Episode 1: The first episode</title>
...
</item>
After<item>
<title>Episode 1: The first episode</title>
<itunes:title>The first episode</itunes:title>
<itunes:episode>1</itunes:episode>
...
</item>

We could just as easily expose the value of the <itunes:title> field as an extra text box (as some podcast hosts do), but this is an extra maintenance burden and opens the door to awkward bugs and support issues.

How we handle it

We support a notion of episode title prefixes, which are formatted with season and/or episode numbers. We’ll automatically take the title of the episode, and prefix the old-fashioned <title> tag with the prefix, formatted with the season and episode numbers. In the example above, the prefix might look like Episode {episode}: .

Why it’s a challenge

If you started your podcast before Apple’s guidelines went into effect—or you weren’t following their guidance—it’s very toilsome to change over each and every episode of your podcast. For shows with lots of episodes, this can take hours of manual effort.

What we did

To make this process easier, Pinecast now features a three-step wizard.

  1. Set up the prefix
  2. Apply season and episode numbers to episodes
  3. Remove season and episode numbers from episode titles

The first step, setting the prefix, is easy: for us, it’s the same settings form that we already have on the Settings tab of podcast dashboards.

The prefix settings step of the tool

For the second step, there’s a little bit more involved. We need to apply season and episode numbers to all of the existing episodes in the feed.

Here, we order all episodes from the podcast chronologically (oldest to newest). We start with season 1 episode 1 and count up from there to populate the default episode numbers, which is a pretty close approximation!

The tricks that makes this view useful are how the user interacts with it:

  1. If the user sets the episode number for an episode, we continue counting upwards from the new episode for the next episode.
  2. If the user sets the season number for an episode, we reset the episode number to 1. All subsequent episodes inherit the new season number.
  3. If an episode is marked as a trailer or bonus episode, it doesn’t get an episode number, allowing some numbers to be skipped.

With relatively few tweaks, a podcaster can make their way through an entire podcast in a matter of seconds, rather than minutes or hours.

The last step, the removal of the prefixes from the episode titles, is the most technically complicated.

Here, we show each episode along with the newly set season and episode number for each episode. It includes a text box for editing the title manually. And, if we’re able to detect the prefix, a suggestion and button to apply the suggestion.

In many cases, the suggestions are pretty darn good (most podcasts we tested had their prefixes correctly detected 70% of the time) and can allow the podcaster to avoid lots of manual futzing. The whole process takes just a minute or two—even for an episode with hundreds of episodes.

Detecting prefixes

Detecting and removing prefixes was a major point of interest for us when designing the episode renumbering tool. If we couldn’t automate this process—at least partially—the tool would be very underwhelming and doesn’t save nearly as much time.

So, how do we do it? Well, there’s some tricks.

Step 1: Lexing

Lexing is a technique for taking a string of text and breaking it into smaller tagged chunks. It’s essential for parsing languages, but in this case, we’re using the output of the lexer to analyze the title.

To perform the lexing process, we categorize chunks of text using these categories, in order:

  1. Anything that’s not a digit, up to the end of the title
  2. Numbers followed by spaces
  3. Numbers followed by symbols and spaces
  4. Numbers not followed by spaces or symbols
  5. Characters (letters)
  6. Spaces
  7. Symbols
  8. Any other character

These chunks of text are called “tokens” and let us understand what is in the title of a show.

To the left, you can see an illustration of this on a sample title. We match chunks of the text against the rules above by scanning left to right. Each resulting chunk (or “token”) is highlighted with a color. The color represents the type of token, which is the useful part for this analysis—we’ll talk about that in a minute.

What you can see is the “ABTS” at the start of the title gets matched as characters, followed by a symbol token, followed by some numbers suffixed with spaces and symbols. After that, we have “anything that’s not a digit up to the end of the title”.

We perform this tokenization for every title in the podcast. Our next step is to figure out which tokens at the start of a title might be a prefix.

To figure this out, we need to look at all of the episodes together: by themselves, each individual title doesn’t contain enough information for us to decide whether something is a prefix or not. If we compare the episodes, we can find patterns, and use those patterns to identify a common prefix.

Where this becomes important is in podcasts that change their prefix. This is very common: for season 1, a podcast might be formatted like “Ep1: Title” but season 2 is formatted as “S2E1 - Title”. Bonus episodes which don’t have a season or episode number might have their own format as well.

We identify these patterns by creating a tree structure: each node in the tree represents a token type. Each “layer” of the tree represents the next token in a title. This data structure is very spiritually similar to a trie. Unlike a trie, however, we are storing token types rather than characters.

To build the tree, we take each title and its tokenized representation as a 2-tuple (a “title pair”) and put it into the root of the tree. If we’ve seen another title pair that starts with the same token type, we create a new sub-node of the root, put the new and existing pairs into the sub-node, and remove the two from the root. We repeat this process for every title pair. The result ends up looking like this:

The constructed tree

By constructing this tree, we can now look for “branches” that carry enough weight to be considered a pattern. The tree here in our simplified example has two branches: green-indigo-orange and green-red-green-red-orange.

We have a few rules for deciding which branches are meaningful enough:

  1. There must be at least two nodes in a branch.
  2. After the third node, there must be at least two title pairs.
  3. There must be a number in one of the first four tokens of a branch.

In this case, both of our branches qualify: both have at least three nodes, and both have at least two title pairs.

In non-invented examples, these trees can be quite hairy! Inconsistencies between episodes (e.g., leaving out a space or a colon or other symbol) cause the tree to have more branches (and branches to have branches). However, we’re able to infer that while a main branch might have multiple smaller branches, all of the title pairs in the smaller branches can be rolled up to the main branch.

We now walk the tree. We start at the root and go through each of the branches in order. At each node, we decide whether—at that point—we’ve found a prefix. When we’ve found one, we mark it and move on to the next branch. When we’re finished, we have a list of all of the prefixes, formatted as token types (illustrated above as colors).

Now, the only thing left to do is apply this back to our original goal: stripping prefixes. For each title, we look to see whether the tokenized representation of the title starts with one of the prefixes that we identified. “Does green-indigo-orange-gray start with either green-indigo-orange or green-red-green-red-orange?” “Yes, green-indigo-orange!” “Great, take the ‘green-indigo-orange’ part of your title and offer to trim it.”

Through this process, we’ve identified that the pattern of “characters-symbols-numbers+symbols+spaces” is a prefix (because it appears as a prefix for a number of other titles), and we can offer to remove that automatically.

Wrap up

All of the code for this process is open source and available on Github.

Pinecast is a podcast hosting service that offers great features like the episode renumbering tool. We’re releasing new features just like this all the time, making it easier to manage your show and keep it up to date with the latest and greatest recommendations from podcast authorities. Check it out at https://www.pinecast.com/

--

--