Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents
The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies....
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-02-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/5/2357 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850030887384645632 |
|---|---|
| author | Raj Bridgelall |
| author_facet | Raj Bridgelall |
| author_sort | Raj Bridgelall |
| collection | DOAJ |
| description | The exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation. |
| format | Article |
| id | doaj-art-1e245c8b33134a4fbe4ee5fc1b3aa5e4 |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-1e245c8b33134a4fbe4ee5fc1b3aa5e42025-08-20T02:59:07ZengMDPI AGApplied Sciences2076-34172025-02-01155235710.3390/app15052357Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of PatentsRaj Bridgelall0Department of Transportation and Supply Chain, College of Business, North Dakota State University, P.O. Box 6050, Fargo, ND 58108-6050, USAThe exponential growth of patent datasets poses a significant challenge in filtering relevant documents for research and innovation. Traditional semantic search methods based on keywords often fail to capture the complexity and variability in multidisciplinary terminology, leading to inefficiencies. This study addresses the problem by systematically evaluating supervised and unsupervised machine learning (ML) techniques for document relevance filtering across five technology domains: solid-state batteries, electric vehicle chargers, connected vehicles, electric vertical takeoff and landing aircraft, and light detecting and ranging (LiDAR) sensors. The contributions include benchmarking the performance of 10 classical models. These models include extreme gradient boosting, random forest, and support vector machines; a deep artificial neural network; and three natural language processing methods: latent Dirichlet allocation, non-negative matrix factorization, and k-means clustering of a manifold-learned reduced feature dimension. Applying these methods to more than 4200 patents filtered from a database of 9.6 million patents revealed that most supervised ML models outperform the unsupervised methods. An average of seven supervised ML models achieved significantly higher precision, recall, and F1-scores across all technology domains, while unsupervised methods show variability depending on domain characteristics. These results offer a practical framework for optimizing document relevance filtering, enabling researchers and practitioners to efficiently manage large datasets and enhance innovation.https://www.mdpi.com/2076-3417/15/5/2357document searchsupervised machine learningunsupervised machine learningnatural language processinglatent Dirichlet allocationnon-negative matrix factorization |
| spellingShingle | Raj Bridgelall Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents Applied Sciences document search supervised machine learning unsupervised machine learning natural language processing latent Dirichlet allocation non-negative matrix factorization |
| title | Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents |
| title_full | Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents |
| title_fullStr | Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents |
| title_full_unstemmed | Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents |
| title_short | Document Relevance Filtering by Natural Language Processing and Machine Learning: A Multidisciplinary Case Study of Patents |
| title_sort | document relevance filtering by natural language processing and machine learning a multidisciplinary case study of patents |
| topic | document search supervised machine learning unsupervised machine learning natural language processing latent Dirichlet allocation non-negative matrix factorization |
| url | https://www.mdpi.com/2076-3417/15/5/2357 |
| work_keys_str_mv | AT rajbridgelall documentrelevancefilteringbynaturallanguageprocessingandmachinelearningamultidisciplinarycasestudyofpatents |