Podcasts, Hosts, and Encoding

Published in

Pinecast

16 min readJun 15, 2017

For a long time now, I’ve been wanting to write an article about how to properly encode an MP3 file for use in a podcast. Something so fundamental to the medium is a very difficult task: most easy-to-use tools are expensive, and free tools are overcomplicated and confusing to set up.

Recently I got a Pinecast support ticket from a customer with quite large MP3 files. The files were larger than his plan would allow, and he was out of surge. When I made some suggestions about his encoding settings, I discovered that the tool he was using didn’t even have those options! Beyond my surprise that a paid piece of software lacked these features, I was astounded that there are no good, free pieces of software that made this process simple. The software that I did find wasn’t cross-platform, and couldn’t be used by my customer anyway.

I set out to create a tool to do this work for you, now released in beta as the Pinecoder, but I needed to know how exactly we should be crafting these MP3s. I wanted to have data to back up my choices for what a properly encoded MP3 looks like. To do that, I crawled the list of top podcasts on iTunes (or Apple Podcasts, depending on who you ask) and did a bit of an analysis.

Getting Data to Analyze

First, I scraped the iTunes charts with a simple bit of JavaScript in my devtools console:

Array.from($('[target=itunes]')).map(a => a.href)

This yielded the URLs of each show’s iTunes page. I copied the resulting JSON to a text file. An iTunes podcast URL looks like this:

https://itunes.apple.com/us/podcast/sworn/id1243525941?mt=2

The important part here is the id... bit towards the end. That’s the Id of the show in iTunes, and allows us to get the feed URL by passing it to this endpoint:

https://itunes.apple.com/lookup?id=1243525941&entity=podcast

Notice the Id from above in the id query string parameter. This endpoint returns a JSON blob containing the URL of the podcast’s feed.

Next, I simply extracted the URL, MIME type, and content length of the first<enclosure> tag in each feed. From here, it’s simple to cURL each file.

import json
import re
from urllib.request import urlopenimport requests
from defusedxml.minidom import parseString as parseXMLString# RegExp to extract IDs from iTunes URLs
id_extractor = re.compile(r'(?:/id)(\w+)\b')# top100.json is the JSON from itunescharts.com
top100 = json.load(open('top100.json'))
for itunes_url in top100:
    itunes_id = id_extractor.search(itunes_url).group(1)
    lookup_url = 'https://itunes.apple.com/lookup?id=%s&entity=podcast' % itunes_id
    output = requests.get(lookup_url).json()
    feed_url = output['results'][0]['feedUrl']
    print('feed:', feed_url)    feed = requests.get(feed_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X)'}).text    try:
        parsed_feed = parseXMLString(feed)
    except Exception as e:
        print(feed_url)
        print(e)
        break
    enclosure = parsed_feed.getElementsByTagName('enclosure')[0]
    audio_url = enclosure.getAttribute('url')
    print('audio:', audio_url)

Interesting Metadata

The output of the script above looks something like this:

feed: https://url.to/the/feed
audio: https://url.to/the/audio
...

Using a combination of grep, awk, and wc, I did all of the counting shown below.

The first interesting tidbit is that one of the feeds failed to download because I was accessing it using Python Requests’ default user agent string. This is probably a bad practice on the part of the website owner (feeds are cheap to generate, and banning programmatic access to them is unwise). It was simple to update the script to pass a custom user agent string.

Next, out of the 100 podcasts I crawled, 44 of them are using Podtrac. Podtrac offers podcast analytics, though it’s unclear how exactly they do a better job than other analytics platforms. In my own experience, the numbers behind Podtrac are useful because they’re trusted by advertisers, though perhaps not because of technical merit.

Almost a quarter of the podcasts were hosted on Libsyn. 15 of the 100 are hosted by NPR. Three of the 100 are using SoundCloud (I honestly expected more). Only one podcast of the bunch uses Podbean. These number might be higher because of feed proxies like Feedburner, though I did not investigate further.

22 of the 100 podcasts were behind Feedburner. This is very curious, as Feedburner provides little value when used with most modern hosting services. Feedburner prevents hosts from measuring subscribers in any meaningful way, since it proxies the feed. At Pinecast, we recommend that users do not use Feedburner in conjunction with our service or any other podcast hosting service. From a technical perspective, Feedburner is an old Google acquisition (2007) that has fallen by the wayside — little has gone into its upkeep, and given Google’s track record with minor services, it could go away at any time.

Analyzing the Audio

I went about investigating the encoding of the audio files. To do this, I used the Python library ffprobe3, which conveniently wraps the ffmpeg tool ffprobe. Running ffprobe on a file produces output like this:

> ffprobe audio/246.mp3 -hide_banner -show_streams
Input #0, mp3, from 'audio/246.mp3':
  Metadata:
    title           : #246: My Pen Pal
    artist          : This American Life
    album_artist    : Chicago Public Media
    TS2             : Chicago Public Media
    genre           : Podcast
    comment         : © 1995-2017 Ira Glass
    TSP             : This American Life
    date            : 2017
  Duration: 01:00:16.55, start: 0.025056, bitrate: 64 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, mono, s16p, 64 kb/s
    Metadata:
      encoder         : LAME3.99r
    Side data:
      replaygain: track gain - -3.300000, track peak - unknown, album gain - unknown, album peak - unknown, 
    Stream #0:1: Video: png, rgb24(pc), 3000x3000, 90k tbr, 90k tbn, 90k tbc
    Metadata:
      comment         : Other
[STREAM]
index=0
codec_name=mp3
codec_long_name=MP3 (MPEG audio layer 3)
profile=unknown
codec_type=audio
codec_time_base=1/44100
codec_tag_string=[0][0][0][0]
codec_tag=0x0000
sample_fmt=s16p
sample_rate=44100
channels=1
channel_layout=mono
bits_per_sample=0
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/14112000
start_pts=353600
start_time=0.025057
duration_ts=51036733440
duration=3616.548571
bit_rate=64000
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
TAG:encoder=LAME3.99r
[SIDE_DATA]
side_data_type=Replay Gain
[/SIDE_DATA]
[/STREAM]
...

One initial problem that I encountered is ffprobe3 being unable to parse the [SIDE_DATA] blocks at the end of the above snippet. I don’t really care about them for the purposes of this article, so I monkey-patched ffprobe3 to ignore them:

import ffprobe3orig = ffprobe3.ffprobe.FFStream.__init__
def replacement(self, data_lines):
    data_lines = [x for x in data_lines if not x.startswith('[')]
    return orig(self, data_lines)
ffprobe3.ffprobe.FFStream.__init__ = replacement

Simple enough.

I also used exiftool, a common file analysis tool, to get more information about the audio files. Ffprobe doesn’t report information about constant versus variable bitrate or the channel mode used by a file, so exiftool was necessary. I simply used subprocess.Popen to invoke exiftool and scrape relevant information from stdout.

I used Python’s collections.Counter to tally up the bitrates, max bitrates, and codecs of each of the audio files. I also used csv.Writer to output a CSV for use in a spreadsheet.

Codecs

First, I wanted to look at the most basic information: which codec the files used. I wasn’t sure what to expect here. On one hand, I fully expected 100% of the audio files encoded as MP3. MP3 is overwhelmingly dominant in podcasting. On the other hand, I expected to see some unusual entries, like AAC.

The result was whelming. 99 of the 100 files were MP3. The last, as it turns out, was an M4V file containing H.264 video and AAC audio. It contained a recording of the Apple WWDC event, unsurprisingly from the Apple Keynotes feed.

It was a bit surprising to see a video podcast coming from iTunes. Support for video in podcasts is dodgy at best, but I suppose if you’re Apple and you only list it in Apple Podcasts, you can be fairly sure it’s going to play on your users’ devices.

For the sake of sanity, I excluded the video file from the rest of the analysis.

Conclusion: Use MP3 to encode your podcasts. Avoid AAC.

CBR vs. VBR

The next thing I was interested in is the type of encoding used for those MP3 files. There are two options:

Constant Bitrate (CBR)
Variable Bitrate (VBR)

The difference is simple: with CBR encoding, one second of audio will always take the same amount of data, regardless of where in the audio file you find it. For example, a 128kbps audio file will take 128 kilobits to store one second of audio. Want to find the start of the audio at 00:00:05? Skip to the 5 × 128 kilobits (640 kilobits) mark and you’ll find it. VBR, on the other hand, allows the encoder to turn the bitrate up and down depending on what the audio contains. A second of silence might be encoded at low quality while music immediately after it would be bumped up to a higher quality. Adjusting the bitrate dynamically allows the files to be smaller by only increasing quality for audio that needs it.

With a CBR file, skipping forward or backward is easy because you can calculate exactly where to jump to. With VBR, skipping ahead ten seconds might mean skipping up to 1280 kilobits — but that might be too much if the quality is lowered within those ten seconds. This also means that the duration of the audio file can’t be determined by looking at the file size. With CBR, you simply divide the file size by the bitrate: that’s the number of seconds long the audio is. With VBR, the same calculation will overestimate the audio’s length substantially. Instead, VBR-encoded files need to list their duration in the file’s metadata, though this can be complicated and difficult to do with most encoding tools.

I used exiftool to extract this information, and it was done very simply: it will only output the string “VBR” if the MP3 is encoded with VBR:

from subprocess import Popen, PIPE# Run the tool
proc = Popen('exiftool "%s"' % path, stdout=PIPE, shell=True)# Check for VBR in the stdout
stdout_lines = iter(proc.stdout.readline, b'')
is_vbr = any(': VBR' in a.decode('UTF-8') for a in stdout_lines)# Close the output
proc.stdout.close()

98 of the 99 audio files were CBR-encoded. Only one was VBR-encoded.

Update: Due to a bug in the analysis script, this post originally claimed that fifteen podcasts used VBR encoding. After correcting the code, only one was found to be VBR-encoded.

In my own experience, VBR can provide dramatic savings over CBR. VBR is well-documented as a good practice. If you don’t believe me, take Jeff Atwood’s word for it.

Update: This is a controversial viewpoint. You should read my followup post about VBR.

Conclusion: Consider using VBR if the trade-offs aren’t offensive to you.

Channels

Within an audio file, a channel is something akin to an “audio feed.” Each channel produces sound. A mono audio file has a single channel, while a stereo audio file has two (one for each ear). I have always suggested that podcasters encode their content as mono rather than stereo. The reasoning is simple: most listeners just won’t be able to tell, you probably aren’t mixing your audio for two channels, and increasing the channel count requires more bits to achieve the same quality. Even still, three quarters of the files were encoded with two channels.

How the number of channels affects file size is a bit complicated. A two-channel audio file encoded at 128kbps consumes the same amount of space as a one-channel audio file encoded at 128kbps. Each channel in the two-channel stereo file, though, effectively gets half of the bitrate — the result is lower quality audio. The rules are a bit fuzzy here: besides stereo (where each ear’s audio is a channel) and mono (where there is a single channel for both ears’ audio) there is a channel mode called “joint stereo.” Joint stereo generally stores the sum of the left and right channels and the difference between the two. Since the left and right channels are likely very similar, more bits can be spent on the sum and fewer bits can be spent on the difference. The result is — usually — higher quality audio at the same bitrate as “vanilla” stereo encoding.

At the end of the day, channels are a complicated matter. Minimizing the number of channels is ultimately best, but how do we know for sure? There are a few scenarios for how an MP3’s channels can be put together:

One Channel: This is simplest, and easiest to get right.
Two Distinct Channels: This is simple, but bad for podcasts. A stereo track requires twice the space to achieve the same quality as the equivalent mono audio.
Two Identical Channels (Faux-Stereo): This is almost certainly the result of a mistake. Faux-stereo is when a single audio channel is duplicated as the left and right channels of a stereo MP3. A faux-stereo audio file is audibly indistinguishable from a mono audio file, but is encoded as two channels instead of one.

Let’s figure out what’s going on with all of these two-channel audio files and see whether there are any obvious errors.

Testing for Faux-Stereo

It turns out that there are exactly zero tools for checking whether a two-channel MP3 file is faux-stereo. One such tool exists for WAV files, though: zrtstr from indiscipline on Github. For this experiment, the plan is to convert MP3 to WAV, then use zrtstr to compare the two channels for faux stereo.

The first step is getting zrtstr to run. Since I’m not on Windows, there is no binary. I installed the Rust compiler, but the compilation failed with an error in one of the dependencies. After some investigation, I found that the offending code had been rewritten in a newer version of the package. Bumping the version and deleting the Cargo.lock file made the compilation process succeed.

Once I got the decoding and analysis process automated with another Python script, I was surprised at the results. Of the 99 audio files, 5 of them were indeed faux stereo as reported by zrtstr.

The first file was a 128kbps stereo MP3. To my surprise, the file sounded fine and was not an unreasonable size. The trick here is that the file used joint stereo: when a faux-stereo file is encoded as joint stereo, the sum of the left and right channels is numerically double that of a mono channel, and the difference channel is just a bunch of zeroes. That second channel is essentially silence, which is easily compressed to almost nothing. In the end, there is a minor amount of overhead from encoding the second joint stereo channel, but not enough to matter too much. The producers could probably increase the quality of the audio marginally by simply encoding as mono instead, or encode as mono and decrease the bitrate slightly.

The second, third, and fourth files were the same. The fifth, though, exhibited the exact characteristics of faux-stereo. It uses the “vanilla” stereo channel mode with identical left and right channels. The file is small, clocking in at around 20MB at 256kbps, but this is because the audio itself was only a few minutes long. Encoded at half the bitrate with a single channel, it would easily fit in 10MB instead.

Conclusion: Never use vanilla stereo as your channel mode. Joint stereo will produce better quality audio and save heartache if you make a mistake. Mono will never do you wrong.

Almost-Faux-Stereo

It seemed unlikely to me that any podcasts in the iTunes charts would make such a mistake, but seeing that at least one did, I’m inclined to think that there are others.

Looking at the other two-channel audio files, many looked like this going into zrtstr:

857053 / 85705344 [=>---------------------] 1.00 % 65071217067.80/s 0s 
File is not double mono, channels are different!

zrtstr takes chunk of each channel of the file (in blobs of 1% of the file’s duration) and does a comparison. If it finds any substantial differences, it’ll bail at that point, like in the example above. Consider examples like this, though:

16763124 / 79824431 [======>-----------------------] 21.00 % 30667033.39/s 2s 
File is not double mono, channels are different!

In that instance, we got through 21% of the file (16 megabytes) before we found any differences between the channels! This could mean a few things:

Glitches in the audio, or corruption in the file caused a difference between channels.
Ads injected by the hosting platform are stereo, while the rest of the episode is mono.
Certain clips of background music or other imported audio is stereo, while the rest of the episode is mono.
Numeric rounding during the MP3 to WAV conversion led to just enough difference between the channels to cause zrtstr to detect a difference.

zrtstr has a function which allows you to specify the amount of tolerance in amplitude difference that’s allowed when comparing channels. I increased the tolerance by 10x and some files progressed further, but the tool found no additional faux-stereo files.

Because some of the files progressed further, it’s not impossible that the stereo component to the two-channel files is background audio. I attempted to note the start and end of background audio to try to pinpoint where stereo audio might start or end, but many podcasts blend multiple tracks together making it very difficult to identify by ear. I could not find any audio files (through manual listening) that appeared to be stereo as a result of an injected advertisement.

Of the 68-odd non-faux-stereo two-channel audio files, 20 did not contain true stereo audio for at least one percent of the file. That is, zrtstr made it more than one percent of the way through the file before it found a difference between the left and right channels. One audio file made it 77% through before zrtstr found a difference!

Conclusion: Encode your audio as mono unless your primary source audio is stereo. If you do not have multiple microphones or pan tracks to one channel or the other, stereo encoding will only decrease the quality of your output.

Let’s talk about bitrate

As mentioned above, bitrate represents the number of bits required to encode one second of audio. Calculating the effects of bitrate on file size is difficult, but determining its impact on quality is tricky also. Bitrate isn’t a great measure of audio quality. Having a second channel decreases the quality of the audio at a particular bitrate, but the amount that it decreases depends on the channel mode and the contents of the audio itself.

In my analysis, the most common bitrate was 128kbps with a majority of 57 audio files. 192kbps came in with 12 and 64kbps had 10. A number of other strange bitrates had fewer than five each.

I also broke down bitrates by the number of channels. 54 of the 73 two-channel files were 128kbps (54/57 128kbps audio files were two-channel). 9 of the two-channel files were 192kbps (or 75% of 192kbps files). Unexpectedly, 64kbps was the dominant bitrate for single-channel audio files with eight files, followed by 48kbps, 128kbps, 96kbps, and 192kbps.

As-is, this doesn’t mean a lot. The juicy details are in the breakdown of bitrates by channel mode. That will tell us a few things:

Are particular channel modes biased towards higher or lower quality?
Are certain bitrates necessary to compensate for quality issues introduced by the chosen channel mode?
What are the most common bitrates?

To break this down, I’ve created something of a histogram showing the percentage of audio files encoded with each channel mode at all of the notable bitrates that I encountered. That sounds crazy, but it’ll make a bit more sense in chart form:

What we can see here is that mono audio files (yellow) are generally encoded at a lower bitrate. Over half of the mono audio files are encoded at less than 100kbps. Curiously, there are a few encoded at 192kbps. With a single channel, this is considered “transparent” audio: that is, you couldn’t tell the difference between 192kbps single-channel MP3 audio and the uncompressed original. This is ostensibly overkill.

Stereo audio files (green) are exclusively above 100kbps, which makes sense since the bitrate is effectively halved. Only 12% of files were encoded as vanilla stereo. Given the availability of joint stereo, there is no good reason to use vanilla stereo encoding.

Joint stereo audio files (blue) are almost exclusively encoded at 128kbps. This is a great balance of quality and file size. Unlike vanilla stereo, joint stereo quality decreases slower than stereo as you decrease the bitrate. The handful of files encoded at bitrates greater than 128kbps (11 of 61) could probably be safely lowered to 128kbps without any substantial decrease in quality.

To answer our questions from above:

Are particular channel modes biased towards higher or lower quality?
It certainly seems that way. Mono audio overwhelmingly leans towards lower bitrates. It’s tough to know whether this is just savvy producers using best practices or whether it’s folks just using the lowest bitrate that sounds good. Seeing that vanilla stereo is exclusively 128kbps and up, though, the theory is all but confirmed.
Are certain bitrates necessary to compensate for quality issues introduced by the chosen channel mode?
Yes, absolutely. There’s no reason to encode anything at 256kbps MP3 for podcasting purposes, unless you’re not encoding your audio correctly. The stereo and joint stereo files that use 192kbps bitrates and up (16% of stereo and 12% of joint stereo) are clearly attempting to compensate for their channel mode choice.
What are the most common bitrates?
128kbps is the hands-down defacto standard across the board. For mono audio, 64kbps and 96kbps are also quite popular, though 64kbps might be a little rough on the ears for discerning listeners.

Conclusion: Use mono encoding with a bitrate at 128kbps or 96kbps, or joint stereo encoding at 128kbps. Don’t use vanilla stereo.

Wrap Up

Without the raw source audio, it’s impossible to make many real conclusions about whether any of files are optimally encoded or not. I could, for example, re-encode the CBR files as VBR and vise-versa, but this would yield inaccurate results as the input for the re-encoded file would be at a lower quality than they themselves were encoded at.

That said, given the results of my analysis and research, I’d make the following recommendations:

Use a single channel if possible.
Use VBR if you understand and don’t mind the tradeoffs.
If you have two-channel audio, check whether you really need it. If you do, use the joint stereo channel mode.
Use 128kbps for your bitrate. If you don’t mind a few stray audio artifacts and you’ve got single-channel audio, give 96kbps a try.

All of this has fed into the choices we’ve made with Pinecoder, and will be our guide as we continue to improve it in the future.

I started this exercise thinking I knew how MP3 encoding worked, and how podcasters were encoding their content. I left with a completely new understanding and appreciation for all of this, and hopefully the technical fruits of this post will help podcasters and other tinkerers alike.

If you enjoyed this post and want to check out Pinecast, you can sign up with no credit card required to try it out for as long as you like. If you decide that a paid plan is right for you, you can use the code pinecoder for 50% off your first two months of service on any plan through the end of July.