Data provenance
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How can keep track of my data processing steps?
Objectives
Automate the process of recording the history of what was entered at the command line to produce a given data file or image.
We’ve now successfully created a command line program - plot_precipitation_climatology.py
-
that calculates and plots the precipitation climatology for a given season.
The last step is to capture the provenance of that plot.
In other words, we need a log of all the data processing steps
that were taken from the intial download of the data file to the end result
(i.e. the PNG image).
The simplest way to do this is to follow the lead of the NCO and CDO command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file. If fact, they were both used in the pre-processing of the data files used in these lessons:
import xarray as xr
accessesm15_pr_file = 'data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc'
dset = xr.open_dataset(accessesm15_pr_file)
print(dset.attrs['history'])
Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc
Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc
2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards.
Fortunately, there is a Python package called cmdline-provenance that creates NCO/CDO-style records of what was executed at the command line. We can use it to add to the command log, before inserting the updated log into the metadata associated with our output image file.
To start, let’s import the cmdline_provenance
library with the other imports at the top of our
plot_precipitation_climatology.py
script,
import cmdline_provenance as cmdprov
and then make the following updates to the original line of code responsible for saving the image to file:
new_log = cmdprov.new_log(infile_logs={inargs.pr_file: dset.attrs['history']})
pdb.set_trace()
plt.savefig(inargs.output_file, metadata={'History': new_log}, dpi=200)
When we execute plot_precipitation_climatology.py
,
cmdprov.new_log
will create a record of what was entered at the command line.
The name of the input precipitation file and its history attribute
have been provided using the infile_logs
argument,
so that cmdprov.new_log
can append the history of that file to the new command log.
The metadata
argument for plt.savefig
has then been used to save the new command log to the image metadata.
To see this in action, we’ve added a Python debugger tracer to the script:
$ python plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png
> /Users/damien/Desktop/data-carpentry/plot_precipitation_climatology.py(98)main()
-> plt.savefig(inargs.output_file, metadata={'History': new_log}, dpi=200)
(Pdb) new_log
'Mon Feb 08 09:45:17 2021: /Users/damien/opt/anaconda3/envs/pyaos-lesson/bin/python plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png\nTue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc\nTue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc\n2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards. \n'
The log has been successfully updated, so let’s have the debugger continue to the end of the script:
(Pdb) c
Now that we’ve written the command log to our PNG file,
we need a way to view the metadata of image files.
There are a number of different programs available to do this,
but they can often be tricky to install.
Fortunately,
conda
is used to install programs written in many different languages,
not just Python.
There are installation recipes
for a command line program called exiftool
on anaconda.org,
so let’s go ahead and install that:
$ conda install exiftool
Once installed, we can use it to view the metadata associated with our image file:
$ exiftool pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png
ExifTool Version Number : 12.17
File Name : pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png
Directory : .
File Size : 315 KiB
File Modification Date/Time : 2021:02:08 10:16:56+11:00
File Access Date/Time : 2021:02:08 10:16:58+11:00
File Inode Change Date/Time : 2021:02:08 10:16:56+11:00
File Permissions : rw-r--r--
File Type : PNG
File Type Extension : png
MIME Type : image/png
Image Width : 2400
Image Height : 1000
Bit Depth : 8
Color Type : RGB with Alpha
Compression : Deflate/Inflate
Filter : Adaptive
Interlace : Noninterlaced
Software : Matplotlib version3.3.3, https://matplotlib.org/
History : Mon Feb 08 09:45:17 2021: /Users/damien/opt/anaconda3/envs/pyaos-lesson/bin/python code/plot_precipitation_climatology.py data/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc SON pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412-SON-clim.png.Tue Jan 12 14:50:35 2021: ncatted -O -a history,pr,d,, pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.Tue Jan 12 14:48:10 2021: cdo seldate,2010-01-01,2014-12-31 /g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r1i1p1f1/Amon/pr/gn/latest/pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_185001-201412.nc pr_Amon_ACCESS-ESM1-5_historical_r1i1p1f1_gn_201001-201412.nc.2019-11-15T04:32:57Z ; CMOR rewrote data to be consistent with CMIP6, CF-1.7 CMIP-6.2 and CF standards. .
Pixels Per Unit X : 7874
Pixels Per Unit Y : 7874
Pixel Units : meters
Image Size : 2400x1000
Megapixels : 2.4
Now that we’ve successfully added a command log to our PNG image, we might want to think about how our script should handle other image formats (e.g. PDF, EPS)? For PNG files you can pick whatever metadata keys you like (hence we picked “History”), but other formats only allow specific keys. For now we can add an assertion so that the program halts if someone tries to generate a format that isn’t PNG, and in the exercises we’ll add more valid formats to the script.
image_format = inargs.output_file.split('.')[-1])
assert image_format == 'png', 'Only valid output format is .png'
Putting this altogether
(and removing the debugging tracer),
here’s what the main
function in plot_precipitation_climatology.py
looks like:
def main(inargs):
"""Run the program."""
dset = xr.open_dataset(inargs.pr_file)
clim = dset['pr'].groupby('time.season').mean('time', keep_attrs=True)
clim = convert_pr_units(clim)
if inargs.mask:
sftlf_file, realm = inargs.mask
clim = apply_mask(clim, sftlf_file, realm)
create_plot(clim, dset.attrs['source_id'], inargs.season,
gridlines=inargs.gridlines, levels=inargs.cbar_levels)
image_format = inargs.output_file.split('.')[-1]
assert image_format == 'png', 'Only valid image format is .png'
new_log = cmdprov.new_log(infile_logs={inargs.pr_file: dset.attrs['history']})
plt.savefig(inargs.output_file, metadata={'History': new_log}, dpi=200)
Handling different image formats
The
plt.savefig
documentation provides information on the metadata keys accepted by PNG, PDF, EPS and PS image formats.Using that information as a guide, add a new function called
get_log_and_key
to theplot_precipitation_climatology.py
script. It should take the precipitation file name, history attribute, and plot type as arguments and return the updated command log and the appropriate metadata key for PNG, PDF, EPS or PS image formats.When you’re done, the final lines of the
main
function should read as follows:log_key, new_log = get_log_and_key(inargs.pr_file, dset.attrs['history'], inargs.output_file.split('.')[-1]) plt.savefig(inargs.output_file, metadata={log_key: new_log}, dpi=200)
Solution
The new function could read as follows:
def get_log_and_key(pr_file, history_attr, plot_type): """Get key and command line log for image metadata. Different image formats allow different metadata keys. Args: pr_file (str): Input precipitation file history_attr (str): History attribute from pr_file plot_type (str): File format for output image """ valid_keys = {'png': 'History', 'pdf': 'Title', 'eps': 'Creator', 'ps' : 'Creator'} assert plot_type in valid_keys.keys(), f"Image format not one of: {*[*valid_keys],}" log_key = valid_keys[plot_type] new_log = cmdprov.new_log(infile_logs={pr_file: history_attr}) return log_key, new_log
Writing the command log to netCDF files
Instead of calculating the seasonal climatology and creating a plot all in the one script, we could have decided to make it a two step process. The first script in the process could take the original precipitation data file and output a new netCDF file containing the seasonal climatology, and the second script could take the seasonal climatology file and create a plot from that.
If the first script reads as follows, how would you edit it so that an updated command log is included in the history attribute of the output netCDF file?
import argparse import numpy as np import xarray as xr def convert_pr_units(darray): """Convert kg m-2 s-1 to mm day-1. Args: darray (xarray.DataArray): Precipitation data """ assert darray.units == 'kg m-2 s-1', "Program assumes input units are kg m-2 s-1" darray.data = darray.data * 86400 darray.attrs['units'] = 'mm/day' return darray def main(inargs): """Run the program.""" in_dset = xr.open_dataset(inargs.pr_file) clim = in_dset['pr'].groupby('time.season').mean('time', keep_attrs=True) clim = convert_pr_units(clim) out_dset = clim.to_dataset() out_dset.attrs = in_dset.attrs out_dset.to_netcdf(inargs.output_file) if __name__ == '__main__': description='Calculate the seasonal precipitation climatology.' parser = argparse.ArgumentParser(description=description) parser.add_argument("pr_file", type=str, help="Precipitation data file") parser.add_argument("output_file", type=str, help="Output file name") args = parser.parse_args() main(args)
Solution
The beginning of the script would need to be updated to import the
cmdline_provenance
library:import cmdline_provenance as cmdprov
The body of the
main
function would then need to be updated to write the new command log to the history attribute of the output dataset:out_dset.attrs = in_dset.attrs new_log = cmdprov.new_log(infile_logs={inargs.pr_file: in_dset.attrs['history']}) out_dset.attrs['history'] = new_log out_dset.to_netcdf(inargs.output_file)
plot_precipitation_climatology.py
At the conclusion of this lesson your
plot_precipitation_climatology.py
script should look something like the following:import pdb import argparse import xarray as xr import cartopy.crs as ccrs import matplotlib.pyplot as plt import numpy as np import cmocean import cmdline_provenance as cmdprov def convert_pr_units(darray): """Convert kg m-2 s-1 to mm day-1. Args: darray (xarray.DataArray): Precipitation data """ assert darray.units == 'kg m-2 s-1', "Program assumes input units are kg m-2 s-1" darray.data = darray.data * 86400 darray.attrs['units'] = 'mm/day' return darray def apply_mask(darray, sftlf_file, realm): """Mask ocean or land using a sftlf (land surface fraction) file. Args: darray (xarray.DataArray): Data to mask sftlf_file (str): Land surface fraction file realm (str): Realm to mask """ dset = xr.open_dataset(sftlf_file) assert realm in ['land', 'ocean'], """Valid realms are 'land' or 'ocean'""" if realm == 'land': masked_darray = darray.where(dset['sftlf'].data < 50) else: masked_darray = darray.where(dset['sftlf'].data > 50) return masked_darray def create_plot(clim, model, season, gridlines=False, levels=None): """Plot the precipitation climatology. Args: clim (xarray.DataArray): Precipitation climatology data model (str): Name of the climate model season (str): Season Kwargs: gridlines (bool): Select whether to plot gridlines levels (list): Tick marks on the colorbar """ if not levels: levels = np.arange(0, 13.5, 1.5) fig = plt.figure(figsize=[12,5]) ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180)) clim.sel(season=season).plot.contourf(ax=ax, levels=levels, extend='max', transform=ccrs.PlateCarree(), cbar_kwargs={'label': clim.units}, cmap=cmocean.cm.haline_r) ax.coastlines() if gridlines: plt.gca().gridlines() title = f'{model} precipitation climatology ({season})' plt.title(title) def get_log_and_key(pr_file, history_attr, plot_type): """Get key and command line log for image metadata. Different image formats allow different metadata keys. Args: pr_file (str): Input precipitation file history_attr (str): History attribute from pr_file plot_type (str): File format for output image """ valid_keys = {'png': 'History', 'pdf': 'Title', 'eps': 'Creator', 'ps' : 'Creator'} assert plot_type in valid_keys.keys(), f"Image format not one of: {*[*valid_keys],}" log_key = valid_keys[plot_type] new_log = cmdprov.new_log(infile_logs={pr_file: history_attr}) return log_key, new_log def main(inargs): """Run the program.""" dset = xr.open_dataset(inargs.pr_file) clim = dset['pr'].groupby('time.season').mean('time', keep_attrs=True) clim = convert_pr_units(clim) if inargs.mask: sftlf_file, realm = inargs.mask clim = apply_mask(clim, sftlf_file, realm) create_plot(clim, dset.attrs['source_id'], inargs.season, gridlines=inargs.gridlines, levels=inargs.cbar_levels) plt.savefig(inargs.output_file, dpi=200) log_key, new_log = get_log_and_key(inargs.pr_file, dset.attrs['history'], inargs.output_file.split('.')[-1]) plt.savefig(inargs.output_file, metadata={log_key: new_log}, dpi=200) if __name__ == '__main__': description='Plot the precipitation climatology for a given season.' parser = argparse.ArgumentParser(description=description) parser.add_argument("pr_file", type=str, help="Precipitation data file") parser.add_argument("season", type=str, help="Season to plot") parser.add_argument("output_file", type=str, help="Output file name") parser.add_argument("--gridlines", action="store_true", default=False, help="Include gridlines on the plot") parser.add_argument("--cbar_levels", type=float, nargs='*', default=None, help='list of levels / tick marks to appear on the colorbar') parser.add_argument("--mask", type=str, nargs=2, metavar=('SFTLF_FILE', 'REALM'), default=None, help="""Provide sftlf file and realm to mask ('land' or 'ocean')""") args = parser.parse_args() main(args)
Key Points
It is possible (in only a few lines of code) to record the provenance of a data file or image.