Interpretable Machine Learning for Legume Yield Prediction Using Satellite Remote Sensing Data

Accurate crop yield prediction is vital towards optimizing agricultural productivity. Machine Learning (ML) has shown promise in this field; however, its application to legume crops, especially to lupin, remains limited, while many models lack interpretability, hindering real-world adoption. To brid...

Full description

Saved in:
Bibliographic Details
Main Authors: Theodoros Petropoulos, Lefteris Benos, Remigio Berruto, Gabriele Miserendino, Vasso Marinoudi, Patrizia Busato, Chrysostomos Zisis, Dionysis Bochtis
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/13/7074
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Accurate crop yield prediction is vital towards optimizing agricultural productivity. Machine Learning (ML) has shown promise in this field; however, its application to legume crops, especially to lupin, remains limited, while many models lack interpretability, hindering real-world adoption. To bridge this literature gap, an interpretable ML framework was developed for predicting lupin yield using Sentinel-2 remote sensing data integrated with georeferenced yield measurements. Data preprocessing involved computing vegetation indices, removing outliers, addressing multicollinearity, normalizing feature scales, and applying data augmentation techniques to correct target imbalance. Subsequently, six ML models were evaluated representing different algorithmic strategies. Among them, XGBoost showed the best performance (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><msup><mi>R</mi><mn>2</mn></msup></mrow></semantics></math></inline-formula> = 0.8756) and low error values across <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>M</mi><mi>A</mi><mi>E</mi></mrow></semantics></math></inline-formula>, <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>M</mi><mi>S</mi><mi>E</mi></mrow></semantics></math></inline-formula>, and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>R</mi><mi>M</mi><mi>S</mi><mi>E</mi></mrow></semantics></math></inline-formula> metrics. To enhance model transparency, SHapley Additive exPlanations (SHAP) values were applied to interpret the feature contributions of the XGBoost model. The Enhanced Vegetation Index (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>E</mi><mi>V</mi><mi>I</mi></mrow></semantics></math></inline-formula>) and Normalized Difference Vegetation Index (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>N</mi><mi>D</mi><mi>V</mi><mi>I</mi></mrow></semantics></math></inline-formula>) were found to be key predictors of crop yield, both showing a positive correlation with higher values reflecting greater vegetation vigor and corresponding to increased yield. These were followed by <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>B</mi><mn>03</mn></mrow></semantics></math></inline-formula> (green) and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>B</mi><mn>12</mn></mrow></semantics></math></inline-formula> (short-wave infrared), which captured key reflectance properties associated with chlorophyll activity and water content, respectively. Both of them substantially influence photosynthetic efficiency and plant health, ultimately affecting yield potential.
ISSN:2076-3417