Spatio-temporal trend analysis of temperature using Python#

Tutors: Alexander Kmoch1, Holger Virro1, and Evelyn Uuemaa1

1Landscape Geoinformatics Lab, Chair of Geoinformatics, Department of Geography, University of Tartu

Introduction#

Under the climate change threat, we need to quantify the spatio-temporal trend of temperature and rainfall patterns to understand and evaluate the potential impacts of climate change on ecosystem services, energy fluxes and biogeochemical processes. There are wide range of global or regional scale climate data freely available in a gridded format that can be used for spatio-temporal analysis. These data can be derived directly from satellite remote sensing data or based on reanalysis of weather station time series data. Additionally, some gridded data are also developed by combining time series data from weather stations with remote sensing data.

Aim of the workshop#

The goal of this session is to teach the participants how to handle NetCDF datasets, apply the Mann-Kendall (MK) test and calculate Sen’s slope (SS) values on a gridded climate dataset (please, check additional info below).

We will use the gridded daily mean temperature product from European Climate Assessment and Dataset (E-OBS version v23.1e) (Cornes et al., 2018). E-OBS is developed by interpolating climate data from weather stations provided by the national meteorological agencies of the European countries. The product is available at a grid resolution of 0.1° or 0.25° and covers the period since 1950 to present. For the workshop, we will use the 0.1° resolution dataset (which represents a resolution of about 11 km at the equator) from January 2001 until December 2020 already aggregated into monthly mean temperature (°C). The figure below illustrates how we will calculate MK and SS at pixel scale on monthly basis.

Mann-Kendall test for trend analysis#

Different methodologies have combined spatial data sets and nonparametric statistical methods, such as the Mann-Kendall (MK) (Mann, 1945; Kendall, 1957) test, to infer about the temporal trends of climatic variables time series (Atta-ur-Rahman and Dawood, 2017; Jaagus et al., 2017; Silva Junior et al., 2018). As MK is a nonparametric test, it does not require normal distribution of the data, but it does assume no autocorrelation in the time series.

The MK test has as null hypothesis that the time series has no trend, and the alternative hypothesis is that it does have a trend (upward or downward). The test first calculates the differences between all pairs of earlier and later data points in the time series. After that, it calculates the variance of these differences, which is posteriorly used to calculate the Z test value. Finally, the statistical significance of the trend is quantified by calculating the probability associated with the normalized Z test. If \(Z>0\), it indicates an increasing trend and if \(Z<0\) it indicates a decreasing trend. Usually, the trends are defined as significant using 95% confidence level.

In addition to the trend calculation, it is also possible to quantify the magnitude of the trends. The magnitudes can be estimated by using the nonparametric Sen statistic, more specifically, the Theil–Sen estimator (Sen Pranab Kumar, 1968), which is given by the median of the slopes of each pair of points. To calculate the Sen’s slope, the times series data is ordered accordingly to the time (as function of time) and a confidence interval is provided for each slope value (same as in the MK test).

NetCDF files#

NetCDF is a file format that allows the storage of multidimensional data into a single file archive and has been widely used. Check more info about NetCDF and Python by following this YouTube tutorial produced by Copernicus Marine Service.

Getting data ready for trend analysis#

Now that we have familiarized ourselves with the dataset, let us prepare it for trend analysis.

This preprocessing phase will consist of * Converting the Xarray Dataset into a Pandas DataFrame * Grouping the data by location (grid cell) and month * Extracting the grouped temperature values as a NumPy array

We will start by converting the dataset into a Pandas DataFrame using the corresponding function to_dataframe. Using reset_index will flatten the DataFrame, so that we have a separate row for each of our 984960 monthly temperature records.

# Convert the xarray to Pandas DF
temp_df = temp_ds.to_dataframe().reset_index() # flatten
display(temp_df)

	latitude	longitude	ym	temp_mean	spatial_ref
0	53.949861	21.04986	2001-01	NaN	0
1	53.949861	21.04986	2001-02	NaN	0
2	53.949861	21.04986	2001-03	NaN	0
3	53.949861	21.04986	2001-04	NaN	0
4	53.949861	21.04986	2001-05	NaN	0
...	...	...	...	...	...
984955	59.549861	28.14986	2020-08	NaN	0
984956	59.549861	28.14986	2020-09	NaN	0
984957	59.549861	28.14986	2020-10	NaN	0
984958	59.549861	28.14986	2020-11	NaN	0
984959	59.549861	28.14986	2020-12	NaN	0

984960 rows × 5 columns

The input data includes a large number of missing values, indicated by NaN in the temp_mean column. These are grid cells located outside the study area (i.e. the Baltic Sea, neighboring countries).

We should remove them with dropna before moving on.

# Drop rows with missing temperature values
temp_df = temp_df.dropna().reset_index()
temp_df

	index	latitude	longitude	ym	temp_mean	spatial_ref
0	6000	53.949861	23.54986	2001-01	-1.155484	0
1	6001	53.949861	23.54986	2001-02	-2.777500	0
2	6002	53.949861	23.54986	2001-03	0.088065	0
3	6003	53.949861	23.54986	2001-04	8.223666	0
4	6004	53.949861	23.54986	2001-05	13.019676	0
...	...	...	...	...	...	...
611275	980635	59.549861	26.34986	2020-08	16.812258	0
611276	980636	59.549861	26.34986	2020-09	13.711668	0
611277	980637	59.549861	26.34986	2020-10	9.299999	0
611278	980638	59.549861	26.34986	2020-11	5.165666	0
611279	980639	59.549861	26.34986	2020-12	0.163548	0

611280 rows × 6 columns

Let us also split the ym column into separate year and month columns for more convenient grouping later on. We can use the function split for that and specify the delimiter in the original column.

# Extract year and month from the observation date as new columns
temp_df[['year', 'month']] = temp_df['ym'].str.split('-', expand=True) # expand=True tells Pandas that there are two output columns
temp_df['year'] = temp_df['year'].astype(int) # astype(int) will convert string column into integer
temp_df['month'] = temp_df['month'].astype(int)
display(temp_df)

	index	latitude	longitude	ym	temp_mean	spatial_ref	year	month
0	6000	53.949861	23.54986	2001-01	-1.155484	0	2001	1
1	6001	53.949861	23.54986	2001-02	-2.777500	0	2001	2
2	6002	53.949861	23.54986	2001-03	0.088065	0	2001	3
3	6003	53.949861	23.54986	2001-04	8.223666	0	2001	4
4	6004	53.949861	23.54986	2001-05	13.019676	0	2001	5
...	...	...	...	...	...	...	...	...
611275	980635	59.549861	26.34986	2020-08	16.812258	0	2020	8
611276	980636	59.549861	26.34986	2020-09	13.711668	0	2020	9
611277	980637	59.549861	26.34986	2020-10	9.299999	0	2020	10
611278	980638	59.549861	26.34986	2020-11	5.165666	0	2020	11
611279	980639	59.549861	26.34986	2020-12	0.163548	0	2020	12

611280 rows × 8 columns

In order to make sure that the monthly temperature values end up in the correct order during grouping we should sort the DataFrame by year and month. This can be done using the sort_values function.

# Sort the DataFrame by year and month in ascending order
temp_df.sort_values(['year', 'month'], ascending=[True, True], inplace=True) # inplace=True tells Pandas to modify the existing object

Now we can aggregate the DataFrame by location and month. We will use groupby to determine the columns used for grouping and assign temp_mean as the target column.

Pandas function apply can be used for applying different functions rowwise in a DataFrame. In this case, we use parameter np.array to tell apply to collect the grouped temperature values into a NumPy array for each row.

# Group locations by month and collect observation values to NumPy array
temp_group_df = temp_df\
    .groupby(['latitude', 'longitude', 'month'])['temp_mean']\
    .apply(np.array)\
    .reset_index() # flatten the DataFrame
display(temp_group_df)

	latitude	longitude	month	temp_mean
0	53.949861	23.54986	1	[-1.155484, -1.569032, -5.0525813, -6.9348392,...
1	53.949861	23.54986	2	[-2.7775, 2.2021434, -6.039285, -1.8844824, -5...
2	53.949861	23.54986	3	[0.08806454, 3.3738706, 0.8622579, 1.8580645, ...
3	53.949861	23.54986	4	[8.223666, 8.266, 5.5366664, 7.551, 7.8069997,...
4	53.949861	23.54986	5	[13.019676, 15.822902, 14.108708, 10.822258, 1...
...	...	...	...	...
30559	59.549861	26.34986	8	[15.805161, 18.406773, 16.10516, 17.048388, 16...
30560	59.549861	26.34986	9	[12.335667, 11.58, 11.987667, 12.545667, 13.21...
30561	59.549861	26.34986	10	[8.4554825, 1.8396775, 4.703548, 6.5590324, 7....
30562	59.549861	26.34986	11	[0.766, -1.2219999, 2.9746668, 0.9679999, 3.94...
30563	59.549861	26.34986	12	[-6.6154838, -7.694516, 0.27806458, 0.46129018...

30564 rows × 4 columns

Let us check the cell content in the array column for the first row. It contains 20 years worth of monthly temperature mean values for January at 53.9499 latitude and 23.5499 longitude.

We can use iloc to select a DataFrame row by index.

# Display first row
display(temp_group_df.iloc[0])

# Display cell content in the grouped column
display(temp_group_df.iloc[0]['temp_mean'])

latitude                                             53.949861
longitude                                             23.54986
month                                                        1
temp_mean    [-1.155484, -1.569032, -5.0525813, -6.9348392,...
Name: 0, dtype: object

array([ -1.155484  ,  -1.569032  ,  -5.0525813 ,  -6.9348392 ,
        -0.04483882,  -7.9774194 ,   1.5177417 ,  -0.8593547 ,
        -3.7196774 , -10.487097  ,  -2.7848382 ,  -3.04      ,
        -5.996774  ,  -5.3016133 ,  -0.33290324,  -5.7196774 ,
        -4.384839  ,  -1.638387  ,  -4.256451  ,   2.0070968 ],
      dtype=float32)

Trend analysis using the Mann-Kendall (MK) test#

Now we have the temperature arrays in a column and can move on with the trend analysis. We are going to use the pyMannKendall library for that.

Although there are 13 different versions of the MK test available in pyMannKendall, the original_test will suffice for now.

The test will return a tuple containing the named parameters * trend: tells the trend (increasing, decreasing or no trend) * h: True (if trend is present) or False (if the trend is absence) * p: p-value of the significance test * z: normalized test statistics * Tau: Kendall Tau * s: Mann-Kendal’s score * var_s: Variance S * slope: Theil-Sen estimator/slope * intercept: intercept of Kendall-Theil Robust Line

# Check the ouput of the MK test function
array = temp_group_df.iloc[0]['temp_mean']
result = mk.original_test(array, alpha=0.05)
display(result)

Mann_Kendall_Test(trend='no trend', h=False, p=0.9741177473553713, z=0.03244428422615251, Tau=0.010526315789473684, s=2.0, var_s=950.0, slope=0.03997344138858082, intercept=-3.759586398254384)

We will loop over the arrays in the temp_mean column, applying the MK test each time using apply. Confidence level—0.05 in our case—can be assigned with parameter alpha.

This can take some time as we need to repeat the process as many times as there are rows, i.e. 30564 iterations in total.

%%time
# Apply MK test rowwise and collect result into new column
temp_group_df['result'] = temp_group_df['temp_mean'].apply(lambda x: mk.original_test(x, alpha=0.05))
display(temp_group_df)

	latitude	longitude	month	temp_mean	result
0	53.949861	23.54986	1	[-1.155484, -1.569032, -5.0525813, -6.9348392,...	(no trend, False, 0.9741177473553713, 0.032444...
1	53.949861	23.54986	2	[-2.7775, 2.2021434, -6.039285, -1.8844824, -5...	(no trend, False, 0.2299690777236405, 1.200438...
2	53.949861	23.54986	3	[0.08806454, 3.3738706, 0.8622579, 1.8580645, ...	(no trend, False, 0.2299690777236405, 1.200438...
3	53.949861	23.54986	4	[8.223666, 8.266, 5.5366664, 7.551, 7.8069997,...	(no trend, False, 0.7702877264736667, 0.291998...
4	53.949861	23.54986	5	[13.019676, 15.822902, 14.108708, 10.822258, 1...	(no trend, False, 0.9741177473553713, 0.032444...
...	...	...	...	...	...
30559	59.549861	26.34986	8	[15.805161, 18.406773, 16.10516, 17.048388, 16...	(no trend, False, 0.721176307474874, 0.3568871...
30560	59.549861	26.34986	9	[12.335667, 11.58, 11.987667, 12.545667, 13.21...	(no trend, False, 0.07435290536856365, 1.78443...
30561	59.549861	26.34986	10	[8.4554825, 1.8396775, 4.703548, 6.5590324, 7....	(no trend, False, 0.5376032363488861, 0.616441...
30562	59.549861	26.34986	11	[0.766, -1.2219999, 2.9746668, 0.9679999, 3.94...	(no trend, False, 0.07435290536856365, 1.78443...
30563	59.549861	26.34986	12	[-6.6154838, -7.694516, 0.27806458, 0.46129018...	(no trend, False, 0.18344722817247305, 1.33021...

30564 rows × 5 columns

CPU times: total: 45.2 s
Wall time: 45.6 s

For our purposes, we will only extract parameters trend, p and slope, which have the indices 0, 2 and 7, respectively. We can use these indices to extract the parameters into separate columns from the MK test result column.

temp_group_df['trend'] = temp_group_df['result'].apply(lambda x: x[0])
temp_group_df['p'] = temp_group_df['result'].apply(lambda x: x[2])
temp_group_df['slope'] = temp_group_df['result'].apply(lambda x: x[7])
temp_group_df

	latitude	longitude	month	temp_mean	result	trend	p	slope
0	53.949861	23.54986	1	[-1.155484, -1.569032, -5.0525813, -6.9348392,...	(no trend, False, 0.9741177473553713, 0.032444...	no trend	0.974118	0.039973
1	53.949861	23.54986	2	[-2.7775, 2.2021434, -6.039285, -1.8844824, -5...	(no trend, False, 0.2299690777236405, 1.200438...	no trend	0.229969	0.251141
2	53.949861	23.54986	3	[0.08806454, 3.3738706, 0.8622579, 1.8580645, ...	(no trend, False, 0.2299690777236405, 1.200438...	no trend	0.229969	0.135081
3	53.949861	23.54986	4	[8.223666, 8.266, 5.5366664, 7.551, 7.8069997,...	(no trend, False, 0.7702877264736667, 0.291998...	no trend	0.770288	0.012048
4	53.949861	23.54986	5	[13.019676, 15.822902, 14.108708, 10.822258, 1...	(no trend, False, 0.9741177473553713, 0.032444...	no trend	0.974118	0.000922
...	...	...	...	...	...	...	...	...
30559	59.549861	26.34986	8	[15.805161, 18.406773, 16.10516, 17.048388, 16...	(no trend, False, 0.721176307474874, 0.3568871...	no trend	0.721176	0.018852
30560	59.549861	26.34986	9	[12.335667, 11.58, 11.987667, 12.545667, 13.21...	(no trend, False, 0.07435290536856365, 1.78443...	no trend	0.074353	0.066889
30561	59.549861	26.34986	10	[8.4554825, 1.8396775, 4.703548, 6.5590324, 7....	(no trend, False, 0.5376032363488861, 0.616441...	no trend	0.537603	0.042174
30562	59.549861	26.34986	11	[0.766, -1.2219999, 2.9746668, 0.9679999, 3.94...	(no trend, False, 0.07435290536856365, 1.78443...	no trend	0.074353	0.129285
30563	59.549861	26.34986	12	[-6.6154838, -7.694516, 0.27806458, 0.46129018...	(no trend, False, 0.18344722817247305, 1.33021...	no trend	0.183447	0.145511

30564 rows × 8 columns

We can use Pandas function value_counts to display how many grid cells ended up in each of the trend classes. We will do this for each month, so we are going to group temp_group_df again by month and loop through it.

# Loop through months and display significant trend count
for month, df in temp_group_df.groupby('month'):
    print('Siginificant trends in month number ' + str(month))
    display(df['trend'].value_counts())
    print('\n')

Siginificant trends in month number 1

no trend    2547
Name: trend, dtype: int64

Siginificant trends in month number 2

no trend    2547
Name: trend, dtype: int64

Siginificant trends in month number 3

no trend    2547
Name: trend, dtype: int64

Siginificant trends in month number 4

no trend      2536
increasing      11
Name: trend, dtype: int64

Siginificant trends in month number 5

no trend    2547
Name: trend, dtype: int64

Siginificant trends in month number 6

no trend      1531
increasing    1016
Name: trend, dtype: int64

Siginificant trends in month number 7

no trend      2436
decreasing     111
Name: trend, dtype: int64

Siginificant trends in month number 8

no trend    2547
Name: trend, dtype: int64

Siginificant trends in month number 9

no trend      1589
increasing     958
Name: trend, dtype: int64

Siginificant trends in month number 10

no trend      2469
increasing      78
Name: trend, dtype: int64

Siginificant trends in month number 11

increasing    2067
no trend       480
Name: trend, dtype: int64

Siginificant trends in month number 12

no trend      2543
increasing       4
Name: trend, dtype: int64

We can see that a large majority of locations did not show a significant trend during the study period as most cells indicated no trend.

Next we will convert the parameter trend into a new numeric column, which is required for plotting later on.

Let us first create a function that assigns a new numeric trend class based on p and slope. Negative significant trends will be indicated by -1, positive trends by 1 and 0 will be assigned to locations where no trend was detected.

# Function for numeric trend conversion
def numeric_trend(slope, p):
    if slope < 0 and p < 0.05:
        return -1
    elif slope > 0 and p < 0.05:
        return 1
    else:
        return 0

Now we can use apply and its lambda parameter to generate the numeric_trend column. The lambda parameter is useful for applying a function to each row.

In our case, we are taking the values of columns slope and p for each row as input parameters for our function numeric_trend, which we created in the last step. Parameter axis=1 will tell the function that we are iterating over rows rather than columns.

Some examples of lambda use can be found at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html.

# Create a numeric version of the trend column
temp_group_df['numeric_trend'] = temp_group_df.apply(lambda x: numeric_trend(x['slope'], x['p']), axis=1)
display(temp_group_df)

	latitude	longitude	month	temp_mean	result	trend	p	slope	numeric_trend
0	53.949861	23.54986	1	[-1.155484, -1.569032, -5.0525813, -6.9348392,...	(no trend, False, 0.9741177473553713, 0.032444...	no trend	0.974118	0.039973	0
1	53.949861	23.54986	2	[-2.7775, 2.2021434, -6.039285, -1.8844824, -5...	(no trend, False, 0.2299690777236405, 1.200438...	no trend	0.229969	0.251141	0
2	53.949861	23.54986	3	[0.08806454, 3.3738706, 0.8622579, 1.8580645, ...	(no trend, False, 0.2299690777236405, 1.200438...	no trend	0.229969	0.135081	0
3	53.949861	23.54986	4	[8.223666, 8.266, 5.5366664, 7.551, 7.8069997,...	(no trend, False, 0.7702877264736667, 0.291998...	no trend	0.770288	0.012048	0
4	53.949861	23.54986	5	[13.019676, 15.822902, 14.108708, 10.822258, 1...	(no trend, False, 0.9741177473553713, 0.032444...	no trend	0.974118	0.000922	0
...	...	...	...	...	...	...	...	...	...
30559	59.549861	26.34986	8	[15.805161, 18.406773, 16.10516, 17.048388, 16...	(no trend, False, 0.721176307474874, 0.3568871...	no trend	0.721176	0.018852	0
30560	59.549861	26.34986	9	[12.335667, 11.58, 11.987667, 12.545667, 13.21...	(no trend, False, 0.07435290536856365, 1.78443...	no trend	0.074353	0.066889	0
30561	59.549861	26.34986	10	[8.4554825, 1.8396775, 4.703548, 6.5590324, 7....	(no trend, False, 0.5376032363488861, 0.616441...	no trend	0.537603	0.042174	0
30562	59.549861	26.34986	11	[0.766, -1.2219999, 2.9746668, 0.9679999, 3.94...	(no trend, False, 0.07435290536856365, 1.78443...	no trend	0.074353	0.129285	0
30563	59.549861	26.34986	12	[-6.6154838, -7.694516, 0.27806458, 0.46129018...	(no trend, False, 0.18344722817247305, 1.33021...	no trend	0.183447	0.145511	0

30564 rows × 9 columns

Before plotting the trend results we need to convert our DataFrame back into an Xarray dataset. This can be done with to_xarray, but first we need to use set_index and assign an index to the DataFrame. The columns used for the index will end up as dimensions in the resulting Xarray dataset.

Let us also assign a CRS with write_crs from rioxarray, which is a supplementary Xarray library used for spatial operations.

# Convert the DataFrame back to xarray Dataset and assign CRS
temp_group_df.set_index(['latitude', 'longitude', 'month'], inplace=True)
temp_array = temp_group_df.to_xarray()
temp_array.rio.write_crs('epsg:4326', inplace=True)
display(temp_array)

<xarray.Dataset>
Dimensions:        (latitude: 57, longitude: 72, month: 12)
Coordinates:
  * latitude       (latitude) float64 53.95 54.05 54.15 ... 59.35 59.45 59.55
  * longitude      (longitude) float64 21.05 21.15 21.25 ... 27.95 28.05 28.15
  * month          (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
    spatial_ref    int32 0
Data variables:
    temp_mean      (latitude, longitude, month) object nan nan nan ... nan nan
    result         (latitude, longitude, month) object nan nan nan ... nan nan
    trend          (latitude, longitude, month) object nan nan nan ... nan nan
    p              (latitude, longitude, month) float64 nan nan nan ... nan nan
    slope          (latitude, longitude, month) float64 nan nan nan ... nan nan
    numeric_trend  (latitude, longitude, month) float64 nan nan nan ... nan nan

References#

Atta-ur-Rahman, and Dawood, M. (2017). Spatio-statistical analysis of temperature fluctuation using Mann–Kendall and Sen’s slope approach. Clim. Dyn. 48, 783–797. doi:10.1007/s00382-016-3110-y.
Cornes, R. C., van der Schrier, G., van den Besselaar, E. J. M., and Jones, P. D. (2018). An Ensemble Version of the E-OBS Temperature and temperature Data Sets. J. Geophys. Res. Atmos. 123, 9391–9409. doi:10.1029/2017JD028200.
Jaagus, J., Sepp, M., Tamm, T., Järvet, A., and Mõisja, K. (2017). Trends and regime shifts in climatic conditions and river runoff in Estonia during 1951–2015. Earth Syst. Dyn. 8, 963–976. doi:10.5194/esd-8-963-2017.
Kendall, M. G. (1957). Rank Correlation Methods. Biometrika 44, 298. doi:10.2307/2333282.
Mann, H. B. (1945). Non-Parametric Test Against Trend. Econometrica 13, 245–259. Available at: http://www.economist.com/node/18330371?story%7B_%7Did=18330371.
Sen Pranab Kumar (1968). Estimates of the Regression Coefficient Based on Kendall’s Tau. J. Am. Stat. Assoc. 63, 1379–1389.
Silva Junior, C. H. L., Almeida, C. T., Santos, J. R. N., Anderson, L. O., Aragão, L. E. O. C., and Silva, F. B. (2018). Spatiotemporal rainfall trends in the Brazilian legal Amazon between the years 1998 and 2015. Water (Switzerland) 10, 1–16. doi:10.3390/w10091220.

Test that everything works

FAIR and Reproducible Research

Spatio-temporal trend analysis of temperature using Python#

Introduction#

Aim of the workshop#

Mann-Kendall test for trend analysis#

NetCDF files#

Reading the NetCDF file with Xarray#

Getting data ready for trend analysis#

Trend analysis using the Mann-Kendall (MK) test#

Plotting the trend analysis results#

Plotting slopes#

Plotting significant trends#

Automatic plotting#

References#