This lesson has passed peer-review! See the publication in JOSE.

Defensive programming

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I make my programs more reliable?

Objectives
  • Explain what an assertion is.

  • Add assertions that check the program’s state is correct.

  • Debug Python scripts using the pdb library.

  • Identify sources of more advanced lessons on code testing.

Scientist’s nightmare

If you needed any motivation to learn and employ the principles of defensive programming, look no further than this article. It documents the experience of a researcher who had to retract five published papers - three of which were in Science - because his code had inadvertently switched the rows and columns of a data table.

In the code/ directory, we have written plot_precipitation_climatology_mask.py, which allows one to mask either land or ocean. How can we be sure that it’s producing reliable results?

The first step toward getting the right answers from our programs is to assume that mistakes will happen and to guard against them. This is called defensive programming, and the most common way to do it is to add assertions to our code so that it checks itself as it runs. An assertion is simply a statement that something must be true at a certain point in a program. When Python sees one, it evaluates the assertion’s condition. If it’s true, Python does nothing, but if it’s false, Python halts the program immediately and prints the error message if one is provided.

To demonstrate an assertion in action, consider this piece of code that halts as soon as the loop encounters a rainfall observation value that isn’t positive:

rainfall_obs = [1.5, 2.3, 0.7, -0.2, 4.4]
total = 0.0
for ob in rainfall_obs:
    assert ob >= 0.0, 'Rainfall observations should only contain positive values'
    total += ob
print('total rainfall is:', total)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-19-33d87ea29ae4> in <module>()
      2 total = 0.0
      3 for ob in rainfall_obs:
----> 4     assert ob > 0.0, 'Rainfall observations should only contain positive values'
      5     total += ob
      6 print('total rainfall is:', total)

AssertionError: Rainfall observations should only contain positive values

Programs like the Firefox browser are full of assertions: 10-20% of the code they contain are there to check that the other 80-90% are working correctly.

To see how assertions might be useful in the context of the plot_precipitation_climatology.py script, let’s try the following at the command line:

$ python plot_precipitation_climatology_mask.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc JJA pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-JJA-clim_land-mask.png --mask data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc Land

If we view the resulting image, we can see that the ocean has been masked, even though we specified the land at the command line.

Ocean masked rainfall plot

When confronted with perplexing code behaviour like this, it can be useful to insert a tracer into your scripts using the Python debugger:

import pdb

...
def apply_mask(darray, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
    
    Args:
      darray (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
    
    """
   
    dset = xr.open_dataset(sftlf_file)
    pdb.set_trace()    
    if realm == 'land':
        masked_darray = darray.where(dset['sftlf'].data < 50)
    else:
        masked_darray = darray.where(dset['sftlf'].data > 50)   
   
    return masked_darray

...

When we run the script, it will stop at the tracer and allow us to interrogate the code:

$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc JJA pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-JJA-clim_land-mask.png --mask data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc Land
> /Users/irv033/Desktop/data-carpentry/plot_precipitation_climatology.py(40)apply_mask()
-> if realm == 'land':
(Pdb) print(realm)
Land
(Pdb) 'Land' == 'land'
False

The problem appears to be that Python strings are case sensitive, which means we should have entered land as opposed to Land at the command line. We can fix this issue while in debug mode and then step through the code line by line (using n) to make sure the correct where statement is executed.

(Pdb) realm = 'land'
(Pdb) n
> /Users/irv033/Desktop/data-carpentry/plot_precipitation_climatology.py(41)apply_mask()
-> masked_darray = darray.where(dset['sftlf'].data < 50)

Once we’re satisfied, we can enter c to run the remainder of the script (it’s q to quit at any time).

To avoid making this case sensitive mistake in future, we should now remove the debugging tracer and replace it with an assertion to catch invalid inputs,

...
def apply_mask(darray, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
    
    Args:
      darray (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
    
    """
   
    dset = xr.open_dataset(sftlf_file)
    assert realm in ['land', 'ocean'], """Valid realms are 'land' or 'ocean'"""   
    if realm == 'land':
        masked_darray = darray.where(dset['sftlf'].data < 50)
    else:
        masked_darray = darray.where(dset['sftlf'].data > 50)   
   
    return masked_darray

...

test to make sure it’s working,

$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412.nc JJA pr_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_201001-201412-JJA-clim_land-mask.png --mask data/sftlf_fx_ACCESS-CM2_historical_r1i1p1f1_gn.nc Land
Traceback (most recent call last):
  File "plot_precipitation_climatology.py", line 120, in <module>
    main(args)
  File "plot_precipitation_climatology.py", line 91, in main
    clim = apply_mask(clim, sftlf_file, realm)
  File "plot_precipitation_climatology.py", line 39, in apply_mask
    assert realm in ['land', 'ocean'], """Valid realms are 'land' or 'ocean'"""
AssertionError: Valid realms are 'land' or 'ocean'

and then commit the changes to git and push to GitHub.

$ git add plot_precipitation_climatology.py
$ git commit -m "Added realm check"
$ git push origin master

Testing and continuous integration

An assertion checks that something is true at a particular point in the program. For programs that are more complex (or research critical) than plot_precipitation_climatology.py, it’s a good idea to take the next step and check the overall behavior of entire pieces (or units) of code. Related concepts like unit testing and continuous integration are beyond the scope of this lesson, but Research Software Engineering With Python has a chapter on testing that is well worth a read.

Add your own assertions

Add some more assertions to your copy of plot_precipitation_climatology.py. Once you’re done, commit the changes to git and push to GitHub.

Solution

There are many examples of assertions that could be added, but the most critical is to check the units of the input data before converting from kg m-2 s-1 to mm day-1.

...

def convert_pr_units(darray):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      darray (xarray.DataArray): Precipitation data
   
    """
   
    assert darray.units == 'kg m-2 s-1', "Program assumes input units are kg m-2 s-1"
   
    darray.data = darray.data * 86400
    darray.attrs['units'] = 'mm/day'
   
    return darray

...

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

import pdb
import argparse

import xarray as xr
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np
import cmocean


def convert_pr_units(darray):
    """Convert kg m-2 s-1 to mm day-1.
    
    Args:
      darray (xarray.DataArray): Precipitation data
   
    """
   
    assert darray.units == 'kg m-2 s-1', "Program assumes input units are kg m-2 s-1"

    darray.data = darray.data * 86400
    darray.attrs['units'] = 'mm/day'
   
    return darray


def apply_mask(darray, sftlf_file, realm):
    """Mask ocean or land using a sftlf (land surface fraction) file.
   
    Args:
      darray (xarray.DataArray): Data to mask
      sftlf_file (str): Land surface fraction file
      realm (str): Realm to mask
   
    """
  
    dset = xr.open_dataset(sftlf_file)
    assert realm in ['land', 'ocean'], """Valid realms are 'land' or 'ocean'"""
    if realm == 'land':
        masked_darray = darray.where(dset['sftlf'].data < 50)
    else:
        masked_darray = darray.where(dset['sftlf'].data > 50)   
   
    return masked_darray


def create_plot(clim, model, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
    
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model (str): Name of the climate model
      season (str): Season
     
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
       
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(ax=ax,
                                          levels=levels,
                                          extend='max',
                                          transform=ccrs.PlateCarree(),
                                          cbar_kwargs={'label': clim.units},
                                          cmap=cmocean.cm.haline_r)
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = f'{model} precipitation climatology ({season})'
    plt.title(title)


def main(inargs):
    """Run the program."""

    dset = xr.open_dataset(inargs.pr_file)
    
    clim = dset['pr'].groupby('time.season').mean('time', keep_attrs=True)
    clim = convert_pr_units(clim)

    if inargs.mask:
        sftlf_file, realm = inargs.mask
        clim = apply_mask(clim, sftlf_file, realm)

    create_plot(clim, dset.attrs['source_id'], inargs.season,
                gridlines=inargs.gridlines, levels=inargs.cbar_levels)
    plt.savefig(inargs.output_file, dpi=200)


if __name__ == '__main__':
    description='Plot the precipitation climatology for a given season.'
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument("--gridlines", action="store_true", default=False,
                        help="Include gridlines on the plot")
    parser.add_argument("--cbar_levels", type=float, nargs='*', default=None,
                        help='list of levels / tick marks to appear on the colorbar')
    parser.add_argument("--mask", type=str, nargs=2,
                        metavar=('SFTLF_FILE', 'REALM'), default=None,
                        help="""Provide sftlf file and realm to mask ('land' or 'ocean')""")

    args = parser.parse_args()
  
    main(args)

Key Points

  • Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do.

  • Put assertions in programs to check their state as they run, and to help readers understand how those programs are supposed to work.

  • The pdb library can be used to debug a Python script by stepping through line-by-line.

  • Software Carpentry has more advanced lessons on code testing.