# AES Impact Evaluation With Integrated Farm Data: Combining Statistical Matching and Propensity Score Matching

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methods

#### 2.1. Statistical Matching Background

#### 2.2. Statistical Matching and Propensity Score Matching Combined

#### 2.3. Hot Deck Technique and Distance Function Applied

**X**= $\{{X}_{1},\dots ,{X}_{l},\dots ,{X}_{L}\}$ (with ${X}_{l}^{R}$ being the vector of dimension, (${n}_{R}\times 1$) and ${X}_{l}^{D}$ the vector of dimension (${n}_{D}\times 1$)), reflecting the set of variables which is observed both in R and D. $\underset{{n}_{R}\times P}{\mathit{Z}}=\{{Z}_{1}^{R},\dots ,{Z}_{p}^{R},\dots ,{Z}_{P}^{R}\}$ (with ${Z}_{p}^{R}$ being the vector of dimension (${n}_{R}\times 1$)) and $\underset{{n}_{D}\times M}{\mathit{K}}=\{{K}_{1}^{D},\dots ,{K}_{m}^{D},\dots ,{K}_{M}^{D}\}$ (with ${K}_{m}^{D}$ being the vector of dimension (${n}_{D}\times 1$)) are two sets of variables which are observed either in R or in D, respectively. Hence, in two different data sets (R and D), we have at our disposal a set of jointly observed variables (

**X**) and two sets of variables that are exclusively observed (

**Z**and

**K**). Therefore, let {$\underset{{n}_{R}\times L}{{\mathit{X}}^{R}}$, $\underset{{n}_{R}\times P}{{\mathit{Z}}^{R}}$} be the recipient data set R and {$\underset{{n}_{D}\times L}{{\mathit{X}}^{D}}$, $\underset{{n}_{D}\times M}{{\mathit{K}}^{D}}$} the donor data set D. Finally, let the i-th and the j-th units (i.e., observations) be collected in R and D, respectively, with $i=1,\dots ,{n}_{R}$ and $j=1,\dots ,{n}_{D}$. The aim is to integrate the recipient data set with some variables of interest observed only in the donor in order to have, in the most general case, the resulting synthetic (complete) data set: {$\underset{{n}_{R}\times L}{{\mathit{X}}^{R}}$, $\underset{{n}_{R}\times P}{{\mathit{Z}}^{R}}$, $\underset{{n}_{R}\times M}{{\mathit{K}}^{D}}$}.

- R and D are two data sets containing information on two representative samples of the same target population [27].
- R ∪ D must be considered as a unique sample of the ${n}_{R}+{n}_{D}$ i.i.d. observations from the joint distribution of (
**X**,**Z**,**K**) [27]. - R and D can have any dimensionality, i.e., ${n}_{R}$ and ${n}_{D}$ must not be bounded to the condition ${n}_{R}$ ≤ ${n}_{D}$ [52].

**X**.

#### 2.4. The PSM Estimator

**X**${}_{i}^{psm}$ be a sub-set of the variables observed in the new synthetic (complete) data set generated. These variables can be chosen among all the ones originally observed in R and the ones imputed from D but the variables that have been used for the previous imputation procedure by means of the SM methodology. Hence, if the new synthetic (complete) data set is {$\underset{{n}_{R}\times L}{{\mathit{X}}^{R}}$, $\underset{{n}_{R}\times P}{{\mathit{Z}}^{R}}$, $\underset{{n}_{R}\times M}{{\mathit{K}}^{D}}$}, the ${\mathbf{X}}_{i}^{psm}$ variables can be chosen among these sets of variables with the exception of the previously used matching variables. In our application, for example, we observe both in the recipient data set R and the donor data set D several variables. Among these, the matching variables chosen for the imputation procedure by means of the SM are the farm specialization and the farm TAA, meaning that the sub-set ${\mathbf{X}}_{i}^{psm}$ could potentially consist of all the variables originally observed in R and the variables imputed from D with the exception of the farm specialization and the farm TAA (and the treatment and outcome variables selected).

- 4.
- The assignment to the treatment is independent of the potential outcomes conditional on the covariates [38]:$$({Y}_{0},{Y}_{1})\perp T|X,$$
- 5.
- The probability of the treatment assignment is bounded from 0 to 1 [38]:$$0\le Pr(T=1)\le 1,$$

**X**, i.e.:

**X**, then it is strongly ignorable given any balancing score and, at any value of a balancing score, the difference in means between treatment and control units is an unbiased estimate of the average treatment effect [16].

## 3. Empirical Application

#### 3.1. Data Description

#### 3.2. Nndc–Ms Combination Application

#### 3.3. PSM Application

## 4. Results

#### 4.1. Data Integration Results

#### 4.2. PSM Results

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

AESs | Agri-environmental Schemes |

CAP | Common Agricultural Policy |

DID | Difference-in-Differences |

EU | European Union |

FADN | Farm Accountancy Data Network |

FSS | Farm Structure Survey |

NUTS | Nomenclature des unités territoriales statistiques |

PGs | Public Goods |

PSM | Propensity Score Matching |

PS | Propensity Score |

RDPs | Rural Development Plans |

RL | Record Linkage |

SFP | Single Farm Payment |

SM | Statistical Matching |

SUD | Statistical Up(down)scaling |

TAA | Total Agricultural Area |

UAA | Utilised Agricultural Area |

**Figure 1.**Integration example from a donor data set to a recipient one by means of the statistical matching (SM) methodology.

**Figure 2.**Integration from the donor data set Farm Accountancy Data Network (FADN) 2009 to the recipient data set “CAP-IRE 2009” of the variables of interest for the propensity score matching (PSM) analysis.

**Figure 3.**Graphical analysis of the variables distributions pre-and-post imputation. TAA: total agricultural area.

**Figure 5.**PS distribution between the treated and control groups both on and off the region of common support (NEW CAP-IRE 2009 data set).

T | Frequency | Percent |
---|---|---|

0 | 178 | 62.46 |

1 | 107 | 37.54 |

Total | 285 | 100.00 |

**Table 2.**Covariates used for the propensity score (PS) estimation (NEW CAP-IRE 2009 data set). Coef. = coefficients; Std. Err. = Standard Error; z = value of the test statistic; P > |z| = p-value; Conf. Interval = Confidence Interval.

T | Coef. | Std. Err. | z | P$>\left|\mathit{z}\right|$ | [95% Conf. Interval] | |
---|---|---|---|---|---|---|

owner_agri_edu | −0.476215 | 0.340462 | −1.40 | 0.162 | −1.143509 | 0.191078 |

owner_edu | 0.115175 | 0.117493 | 0.98 | 0.032 | −0.115107 | 0.345456 |

legal_status | 0.778176 | 0.303509 | 2.56 | 0.010 | 0.183310 | 1.373043 |

organic_prod | 0.920301 | 0.424769 | 2.17 | 0.030 | 0.087770 | 1.752832 |

sfp_ha | −1.015806 | 0.375366 | −2.71 | 0.007 | −1.751510 | −0.280103 |

sfp_eur | 0.002397 | 0.001040 | 2.30 | 0.021 | 0.000357 | 0.004436 |

size_esu | 0.000295 | 0.000277 | 1.06 | 0.028 | −0.000249 | 0.000838 |

uaa_irr | 0.014194 | 0.007912 | 1.79 | 0.073 | −0.001314 | 0.0297015 |

gfi | −0.000016 | 0.000008 | −2.08 | 0.038 | −0.000032 | −0.000001 |

ffi | 0.000015 | 0.000008 | 1.92 | 0.054 | −0.000000 | 0.000031 |

awu_total_input | 0.513978 | 0.183611 | 2.80 | 0.005 | 0.154109 | 0.873847 |

**Table 3.**Estimated PS in the region of common support (NEW CAP-IRE 2009 data set). Obs. = number of observations; Std. Dev. = Standard Deviation.

Percentiles | Smallest | |||
---|---|---|---|---|

1% | 0.155566 | 0.152536 | ||

5% | 0.173274 | 0.154019 | ||

10% | 0.197719 | 0.155566 | ||

25% | 0.240842 | 0.155584 | ||

50% | 0.345959 | 0.381021 | ||

Largest | ||||

75% | 0.487506 | 0.894942 | Obs. | 279 |

90% | 0.651458 | 0.899009 | Std. Dev. | 0.175115 |

95% | 0.750122 | 0.922094 | Variance | 0.030665 |

99% | 0.899009 | 0.928727 | Pseudo R${}^{2}$ | 0.1840 |

Inferior of PS Block | T | Total | |
---|---|---|---|

0 | 1 | ||

0.152536 | 15 | 6 | 21 |

0.2 | 66 | 24 | 90 |

0.3 | 43 | 33 | 76 |

0.4 | 35 | 30 | 65 |

0.6 | 10 | 8 | 18 |

0.8 | 3 | 6 | 9 |

Total | 172 | 107 | 279 |

T | Coef. | Std. Err. | z | P$>\left|\mathit{z}\right|$ | [95% Conf. Interval] | |
---|---|---|---|---|---|---|

owner_agri_edu | −0.233312 | 0.316948 | −0.74 | 0.462 | −0.854522 | 0.387897 |

owner_edu | 0.128069 | 0.110731 | 1.16 | 0.247 | −0.088961 | 0.345099 |

legal_status | 0.916068 | 0.287123 | 3.19 | 0.105 | 0.353318 | 1.478818 |

organic_prod | 0.914884 | 0.409481 | 2.23 | 0.075 | 0.112316 | 1.717453 |

ffi | −1.158031 | 0.000003 | −0.68 | 0.497 | −0.000009 | 0.000004 |

**Table 6.**Average treatment effect on treated (ATT) estimated on the land_rent_in outcome variable in the synthetic (complete) NEW CAP-IRE 2009 data set. T-stat = test statistic.

Variable | Sample | Treated | Controls | Difference | Std. Err. | T-stat |
---|---|---|---|---|---|---|

land_rent_in | Unmatched | 8.30841 | 7.18539 | 1.12302 | 2.78536 | 0.40 |

ATT | 8.35577 | 12.31989 | −3.96412 | 2.93514 | −1.35 |

**Table 7.**ATT estimated on the ghi outcome variable in the synthetic (complete) NEW CAP-IRE 2009 data set.

Variable | Sample | Treated | Controls | Difference | Std. Err. | T-stat |
---|---|---|---|---|---|---|

ghi | Unmatched | 0.18454 | 0.13551 | 0.49028 | 0.26747 | 1.83 |

ATT | 0.15188 | 0.18839 | −0.03651 | 0.05547 | −0.66 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

