Xarray for multidimensional gridded data¶

In last week's lecture, we saw how Pandas provided a way to keep track of additional "metadata" surrounding tabular datasets, including "indexes" for each row and labels for each column. These features, together with Pandas' many useful routines for all kinds of data munging and analysis, have made Pandas one of the most popular python packages in the world.

However, not all Earth science datasets easily fit into the "tabular" model (i.e. rows and columns) imposed by Pandas. In particular, we often deal with multidimensional data. By multidimensional data (also often called N-dimensional), I mean data with many independent dimensions or axes. For example, we might represent Earth's surface temperature $T$ as a three dimensional variable

$$ T(x, y, t) $$

where $x$ is longitude, $y$ is latitude, and $t$ is time.

The point of xarray is to provide pandas-level convenience for working with this type of data.

xarray data model

Learning Goals for Xarray¶

Because of the importance of xarray for data analysis in geoscience, we are going to spend a long time on it. The goals of the next three lessons include the following.

Lesson 1: Xarray Fundamentals¶

Dataset Creation¶

Describe the core xarray data structures, the DataArray and the Dataset, and the components that make them up, including: Data Variables, Dimensions, Coordinates, Indexes, and Attributes
Create xarray DataArrays and DataSets out of raw numpy arrays
Create xarray objects with and without indexes
Load xarray datasets from netCDF files and openDAP servers
View and set attributes

Basic Indexing and Interpolation¶

Select data by position using .isel with values or slices
Select data by label using .sel with values or slices
Select timeseries data by date/time with values or slices
Use nearest-neighbor lookups with .sel
Mask data with .where
Interpolate data in one and several dimensions

Basic Computation¶

Do basic arithmetic with DataArrays and Datasets
Use numpy universal function on DataArrays and Datasets, or use corresponding built-in xarray methods
Combine multiple xarray objects in arithmetic operations and understand how they are broadcasted / aligned
Perform aggregation (reduction) along one or multiple dimensions of a DataArray or Dataset

Basic Plotting¶

Use built-in xarray plotting for 1D and 2D DataArrays
Customize plots with options

Lesson 2: Advanced Usage¶

Xarray's groupby, resample, and rolling¶

Split xarray objects into groups using groupby
Apply reduction operations to groups (e.g. mean)
Apply non-reducing functions to groups (e.g. standardize)
Use groupby with time coordinates (e.g. to create climatologies)
Use artimetic between GroupBy objects and regular DataArrays / Datasets
Use groupby_bins to aggregate data in bins
Use resample on time dimensions
Use rolling to apply rolling aggregations

Merging Combining Datasets¶

Concatentate DataArrays and Datasets along a new or existing dimension
Merge multiple datasets with different variables
Add a new data variable to an existing Dataset

Reshaping Data¶

Transpose dimension order
Swap coordinates
Expand and squeeze dimensions
Convert between DataArray and Dataset
Use stack and unstack to transform data

Advanced Computations¶

Use differentiate to take derivatives of data
Use apply_ufunc to apply custom or specialized operations to data

Plotting¶

Show multiple line plots over a dimension using the hue keyword
Create multiple 2D plots using faceting

Lesson 1: Xarray Fundamentals¶

Xarray data structures¶

Like Pandas, xarray has two fundamental data structures:

a DataArray, which holds a single multi-dimensional variable and its coordinates
a Dataset, which holds multiple variables that potentially share the same coordinates

DataArray¶

A DataArray has four essential attributes:

values: a numpy.ndarray holding the array’s values
dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
attrs: an OrderedDict to hold arbitrary metadata (attributes)

Let's start by constructing some DataArrays manually

In [1]:

import numpy as np
import xarray as xr
from matplotlib import pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (8,5)

A simple DataArray without dimensions or coordinates isn't much use.

In [2]:

da = xr.DataArray([9, 0, 2, 1, 0])
da

Out[2]:


array([9, 0, 2, 1, 0])
Dimensions without coordinates: dim_0

We can add a dimension name...

In [3]:

da = xr.DataArray([9, 0, 2, 1, 0], dims=['x'])
da

Out[3]:


array([9, 0, 2, 1, 0])
Dimensions without coordinates: x

But things get most interesting when we add a coordinate:

In [4]:

da = xr.DataArray([9, 0, 2, 1, 0],
                  dims=['x'],
                  coords={'x': [10, 20, 30, 40, 50]})
da

Out[4]:


array([9, 0, 2, 1, 0])
Coordinates:
  * x        (x) int64 10 20 30 40 50

Xarray has built-in plotting, like pandas.

In [5]:

da.plot(marker='o')

Out[5]:

[]

Multidimensional DataArray¶

If we are just dealing with 1D data, Pandas and Xarray have very similar capabilities. Xarray's real potential comes with multidimensional data.

Let's go back to the multidimensional ARGO data we loaded in the numpy lesson. If you haven't already downloaded it, you can do so at the command line with

curl -O http://www.ldeo.columbia.edu/~rpa/argo_float_4901412.npz

We reload this data and examine its keys.

In [6]:

argo_data = np.load('argo_float_4901412.npz')
argo_data.keys()

Out[6]:

['S', 'T', 'levels', 'lon', 'date', 'P', 'lat']

The values of the argo_data object are numpy arrays.

In [7]:

S = argo_data.f.S
T = argo_data.f.T
P = argo_data.f.P
levels = argo_data.f.levels
lon = argo_data.f.lon
lat = argo_data.f.lat
date = argo_data.f.date
print(S.shape, lon.shape, date.shape)

(78, 75) (75,) (75,)

Let's organize the data and coordinates of the salinity variable into a DataArray.

In [8]:

da_salinity = xr.DataArray(S, dims=['level', 'date'],
                           coords={'level': levels,
                                   'date': date},)
da_salinity

Out[8]:


array([[ 35.638939,  35.514957,  35.572971, ...,  35.820938,  35.777939,
         35.668911],
       [ 35.633938,  35.521957,  35.573971, ...,  35.810932,  35.583897,
         35.667912],
       [ 35.681946,  35.525959,  35.572971, ...,  35.795929,  35.662907,
         35.665913],
       ..., 
       [ 34.915859,  34.923904,  34.923904, ...,  34.934811,  34.940811,
         34.946808],
       [ 34.915859,  34.923904,  34.921906, ...,  34.932808,  34.93681 ,
         34.94381 ],
       [ 34.917858,  34.923904,  34.923904, ...,        nan,  34.93681 ,
               nan]])
Coordinates:
  * level    (level) int64 0 1 2 3 4 5 6 7 8 9 ... 68 69 70 71 72 73 74 75 76 77
  * date     (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400

In [9]:

da_salinity.plot(yincrease=False)

Out[9]:

Attributes can be used to store metadata. What metadata should you store? The CF Conventions are a great resource for thinking about climate metadata. Below we define two of the required CF-conventions attributes.

In [10]:

da_salinity.attrs['units'] = 'PSU'
da_salinity.attrs['standard_name'] = 'sea_water_salinity'
da_salinity

Out[10]:


array([[ 35.638939,  35.514957,  35.572971, ...,  35.820938,  35.777939,
         35.668911],
       [ 35.633938,  35.521957,  35.573971, ...,  35.810932,  35.583897,
         35.667912],
       [ 35.681946,  35.525959,  35.572971, ...,  35.795929,  35.662907,
         35.665913],
       ..., 
       [ 34.915859,  34.923904,  34.923904, ...,  34.934811,  34.940811,
         34.946808],
       [ 34.915859,  34.923904,  34.921906, ...,  34.932808,  34.93681 ,
         34.94381 ],
       [ 34.917858,  34.923904,  34.923904, ...,        nan,  34.93681 ,
               nan]])
Coordinates:
  * level    (level) int64 0 1 2 3 4 5 6 7 8 9 ... 68 69 70 71 72 73 74 75 76 77
  * date     (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
Attributes:
    units:          PSU
    standard_name:  sea_water_salinity

Datasets¶

A Dataset holds many DataArrays which potentially can share coordinates. In analogy to pandas:

pandas.Series : pandas.Dataframe :: xarray.DataArray : xarray.Dataset

Constructing Datasets manually is a bit more involved in terms of syntax. The Dataset constructor takes three arguments:

data_vars should be a dictionary with each key as the name of the variable and each value as one of:
- A DataArray or Variable
- A tuple of the form (dims, data[, attrs]), which is converted into arguments for Variable
- A pandas object, which is converted into a DataArray
- A 1D array or list, which is interpreted as values for a one dimensional coordinate variable along the same dimension as it’s name
coords should be a dictionary of the same form as data_vars.
attrs should be a dictionary.

Let's put together a Dataset with temperature, salinity and pressure all together

In [11]:

argo = xr.Dataset(
        data_vars={'salinity':    (('level', 'date'), S),
                   'temperature': (('level', 'date'), T),
                   'pressure':    (('level', 'date'), P)},
        coords={'level': levels,
                'date': date})
argo

Out[11]:


Dimensions:      (date: 75, level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
Data variables:
    salinity     (level, date) float64 35.64 35.51 35.57 35.4 ... nan 34.94 nan
    temperature  (level, date) float64 18.97 18.44 19.1 19.79 ... nan 3.714 nan
    pressure     (level, date) float64 6.8 6.1 6.5 5.0 ... 2e+03 nan 2e+03 nan

What about lon and lat? We forgot them in the creation process, but we can add them after the fact.

In [12]:

argo.coords['lon'] = lon
argo

Out[12]:


Dimensions:      (date: 75, level: 78, lon: 75)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
  * lon          (lon) float64 -39.13 -37.28 -36.9 ... -33.83 -34.11 -34.38
Data variables:
    salinity     (level, date) float64 35.64 35.51 35.57 35.4 ... nan 34.94 nan
    temperature  (level, date) float64 18.97 18.44 19.1 19.79 ... nan 3.714 nan
    pressure     (level, date) float64 6.8 6.1 6.5 5.0 ... 2e+03 nan 2e+03 nan

That was not quite right...we want lon to have dimension date:

In [13]:

del argo['lon']
argo.coords['lon'] = ('date', lon)
argo.coords['lat'] = ('date', lat)
argo

Out[13]:


Dimensions:      (date: 75, level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
    lon          (date) float64 -39.13 -37.28 -36.9 ... -33.83 -34.11 -34.38
    lat          (date) float64 47.19 46.72 46.45 46.23 ... 42.6 42.46 42.38
Data variables:
    salinity     (level, date) float64 35.64 35.51 35.57 35.4 ... nan 34.94 nan
    temperature  (level, date) float64 18.97 18.44 19.1 19.79 ... nan 3.714 nan
    pressure     (level, date) float64 6.8 6.1 6.5 5.0 ... 2e+03 nan 2e+03 nan

Coordinates vs. Data Variables¶

Data variables can be modified through arithmentic operations or other functions. Coordinates are always keept the same.

In [14]:

argo * 10000

Out[14]:


Dimensions:      (date: 75, level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
    lat          (date) float64 47.19 46.72 46.45 46.23 ... 42.6 42.46 42.38
    lon          (date) float64 -39.13 -37.28 -36.9 ... -33.83 -34.11 -34.38
Data variables:
    salinity     (level, date) float64 3.564e+05 3.551e+05 ... 3.494e+05 nan
    temperature  (level, date) float64 1.897e+05 1.844e+05 ... 3.714e+04 nan
    pressure     (level, date) float64 6.8e+04 6.1e+04 6.5e+04 ... nan 2e+07 nan

Clearly lon and lat are coordinates rather than data variables. We can change their status as follows:

In [15]:

argo = argo.set_coords(['lon', 'lat'])
argo

Out[15]:


Dimensions:      (date: 75, level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
    lon          (date) float64 -39.13 -37.28 -36.9 ... -33.83 -34.11 -34.38
    lat          (date) float64 47.19 46.72 46.45 46.23 ... 42.6 42.46 42.38
Data variables:
    salinity     (level, date) float64 35.64 35.51 35.57 35.4 ... nan 34.94 nan
    temperature  (level, date) float64 18.97 18.44 19.1 19.79 ... nan 3.714 nan
    pressure     (level, date) float64 6.8 6.1 6.5 5.0 ... 2e+03 nan 2e+03 nan

The * symbol in the representation above indicates that level and date are "dimension coordinates" (they describe the coordinates associated with data variable axes) while lon and lat are "non-dimension coordinates". We can make any variable a non-dimension coordiante.

Alternatively, we could have assigned directly to coords as follows:

In [16]:

argo.coords['lon'] = ('date', lon)
argo.coords['lat'] = ('date', lat)

Working with Labeled Data¶

Xarray's labels make working with multidimensional data much easier.

Selecting Data (Indexing)¶

We can always use regular numpy indexing and slicing on DataArrays

In [17]:

argo.salinity[2].plot()

Out[17]:

[]

In [18]:

argo.salinity[:, 10].plot()

Out[18]:

[]

However, it is often much more powerful to use xarray's .sel() method to use label-based indexing.

In [19]:

argo.salinity.sel(level=2).plot()

Out[19]:

[]

In [20]:

argo.salinity.sel(date='2012-10-22').plot(y='level', yincrease=False)

Out[20]:

[]

.sel() also supports slicing. Unfortunately we have to use a somewhat awkward syntax, but it still works.

In [21]:

argo.salinity.sel(date=slice('2012-10-01', '2012-12-01')).plot()

Out[21]:

.sel() also works on the whole Dataset

In [22]:

argo.sel(date='2012-10-22')

Out[22]:


Dimensions:      (date: 1, level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
  * date         (date) datetime64[ns] 2012-10-22T02:50:32.006400
    lon          (date) float64 -32.97
    lat          (date) float64 44.13
Data variables:
    salinity     (level, date) float64 35.47 35.47 35.47 ... 34.93 34.92 nan
    temperature  (level, date) float64 17.13 17.13 17.13 ... 3.736 3.639 nan
    pressure     (level, date) float64 6.4 10.3 15.4 ... 1.9e+03 1.951e+03 nan

Computation¶

Xarray dataarrays and datasets work seamlessly with arithmetic operators and numpy array functions.

In [23]:

temp_kelvin = argo.temperature + 273.15
temp_kelvin.plot(yincrease=False)

Out[23]:

We can also combine multiple xarray datasets in arithemtic operations

In [24]:

g = 9.8
buoyancy = g * (2e-4 * argo.temperature - 7e-4 * argo.salinity)
buoyancy.plot(yincrease=False)

Out[24]:

Broadcasting¶

Broadcasting arrays in numpy is a nightmare. It is much easier when the data axes are labeled!

This is a useless calculation, but it illustrates how perfoming an operation on arrays with differenty coordinates will result in automatic broadcasting

In [25]:

level_times_lat = argo.level * argo.lat
level_times_lat

Out[25]:


array([[    0.   ,     0.   ,     0.   , ...,     0.   ,     0.   ,     0.   ],
       [   47.187,    46.716,    46.45 , ...,    42.601,    42.457,    42.379],
       [   94.374,    93.432,    92.9  , ...,    85.202,    84.914,    84.758],
       ..., 
       [ 3539.025,  3503.7  ,  3483.75 , ...,  3195.075,  3184.275,  3178.425],
       [ 3586.212,  3550.416,  3530.2  , ...,  3237.676,  3226.732,  3220.804],
       [ 3633.399,  3597.132,  3576.65 , ...,  3280.277,  3269.189,  3263.183]])
Coordinates:
  * level    (level) int64 0 1 2 3 4 5 6 7 8 9 ... 68 69 70 71 72 73 74 75 76 77
  * date     (date) datetime64[ns] 2012-07-13T22:33:06.019200 ... 2014-07-24T03:02:33.014400
    lon      (date) float64 -39.13 -37.28 -36.9 -36.89 ... -33.83 -34.11 -34.38
    lat      (date) float64 47.19 46.72 46.45 46.23 ... 42.72 42.6 42.46 42.38

In [26]:

level_times_lat.plot()

Out[26]:

Reductions¶

Just like in numpy, we can reduce xarray DataArrays along any number of axes:

In [27]:

argo.temperature.mean(axis=0).dims

Out[27]:

('date',)

In [28]:

argo.temperature.mean(axis=1).dims

Out[28]:

('level',)

However, rather than performing reductions on axes (as in numpy), we can perform them on dimensions. This turns out to be a huge convenience

In [29]:

argo_mean = argo.mean(dim='date')
argo_mean

Out[29]:


Dimensions:      (level: 78)
Coordinates:
  * level        (level) int64 0 1 2 3 4 5 6 7 8 ... 69 70 71 72 73 74 75 76 77
Data variables:
    salinity     (level) float64 35.91 35.9 35.9 35.9 ... 34.94 34.94 34.93
    temperature  (level) float64 17.6 17.57 17.51 17.42 ... 3.789 3.73 3.662
    pressure     (level) float64 6.435 10.57 15.54 ... 1.95e+03 1.999e+03

In [30]:

argo_mean.salinity.plot(y='level', yincrease=False)

Out[30]:

[]

In [31]:

argo_std = argo.std(dim='date')
argo_std.salinity.plot(y='level', yincrease=False)

Out[31]:

[]

Loading Data from netCDF Files¶

NetCDF (Network Common Data Format) is the most widely used format for distributing geoscience data. NetCDF is maintained by the Unidata organization.

Below we quote from the NetCDF website:

NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages. The netCDF libraries support a machine-independent format for representing scientific data. Together, the interfaces, libraries, and format support the creation, access, and sharing of scientific data.

NetCDF data is:

Self-Describing. A netCDF file includes information about the data it contains.

Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.

Scalable. A small subset of a large dataset may be accessed efficiently.

Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.

Sharable. One writer and multiple readers may simultaneously access the same netCDF file.

Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.

Xarray was designed to make reading netCDF files in python as easy, powerful, and flexible as possible. (See xarray netCDF docs for more details.)

Below we download and load some the NASA GISSTemp global temperature anomaly dataset.

In [32]:

! wget https://data.giss.nasa.gov/pub/gistemp/gistemp1200_ERSSTv5.nc.gz
! gunzip gistemp1200_ERSSTv5.nc.gz

In [33]:

ds = xr.open_dataset('gistemp1200_ERSSTv5.nc')
ds

Out[33]:


Dimensions:      (lat: 90, lon: 180, nv: 2, time: 1665)
Coordinates:
  * lat          (lat) float32 -89.0 -87.0 -85.0 -83.0 ... 83.0 85.0 87.0 89.0
  * lon          (lon) float32 -179.0 -177.0 -175.0 -173.0 ... 175.0 177.0 179.0
  * time         (time) datetime64[ns] 1880-01-15 1880-02-15 ... 2018-09-15
Dimensions without coordinates: nv
Data variables:
    time_bnds    (time, nv) int32 ...
    tempanomaly  (time, lat, lon) float32 ...
Attributes:
    title:        GISTEMP Surface Temperature Analysis
    institution:  NASA Goddard Institute for Space Studies
    source:       http://data.giss.nasa.gov/gistemp/
    Conventions:  CF-1.6
    history:      Created 2018-10-12 16:36:37 by SBBX_to_nc 2.0 - ILAND=1200,...

In [34]:

ds.tempanomaly.isel(time=-1).plot()

Out[34]:

In [35]:

ds.tempanomaly.mean(dim=('lon', 'lat')).plot()

Out[35]:

[]

In [ ]: