Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents

The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies....

Full description

Saved in:
Bibliographic Details
Main Author: Raj Bridgelall
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/5/2357
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850030887384645632
author Raj Bridgelall
author_facet Raj Bridgelall
author_sort Raj Bridgelall
collection DOAJ
description The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation.
format Article
id doaj-art-1e245c8b33134a4fbe4ee5fc1b3aa5e4
institution DOAJ
issn 2076-3417
language English
publishDate 2025-02-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-1e245c8b33134a4fbe4ee5fc1b3aa5e42025-08-20T02:59:07ZengMDPI AGApplied Sciences2076-34172025-02-01155235710.3390/app15052357Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of PatentsRaj Bridgelall0Department of Transportation and Supply Chain, College of Business, North Dakota State University, P.O. Box 6050, Fargo, ND 58108-6050, USAThe exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation.https://www.mdpi.com/2076-3417/15/5/2357document searchsupervised machine learningunsupervised machine learningnatural language processinglatent Dirichlet allocationnon-negative matrix factorization
spellingShingle Raj Bridgelall
Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
Applied Sciences
document search
supervised machine learning
unsupervised machine learning
natural language processing
latent Dirichlet allocation
non-negative matrix factorization
title Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
title_full Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
title_fullStr Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
title_full_unstemmed Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
title_short Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
title_sort document relevance filtering by natural language processing and machine learning a multidisciplinary case study of patents
topic document search
supervised machine learning
unsupervised machine learning
natural language processing
latent Dirichlet allocation
non-negative matrix factorization
url https://www.mdpi.com/2076-3417/15/5/2357
work_keys_str_mv AT rajbridgelall documentrelevancefilteringbynaturallanguageprocessingandmachinelearningamultidisciplinarycasestudyofpatents