An in‐depth study of the effects of methods on the dataset selection of public development projects

Abstract Public development projects (PDPs) and documented public development projects (DPDPs) are two types of projects that can provide valuable information on how developers and users participate in OSS projects. However, it is hard for researchers to effectively select PDPs and DPDPs due to the...

Full description

Saved in:
Bibliographic Details
Main Authors: Can Cheng, Bing Li, Zengyang Li, Peng Liang, Xu Yang
Format: Article
Language:English
Published: Wiley 2022-04-01
Series:IET Software
Subjects:
Online Access:https://doi.org/10.1049/sfw2.12050
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832559629768327168
author Can Cheng
Bing Li
Zengyang Li
Peng Liang
Xu Yang
author_facet Can Cheng
Bing Li
Zengyang Li
Peng Liang
Xu Yang
author_sort Can Cheng
collection DOAJ
description Abstract Public development projects (PDPs) and documented public development projects (DPDPs) are two types of projects that can provide valuable information on how developers and users participate in OSS projects. However, it is hard for researchers to effectively select PDPs and DPDPs due to the lack of specific project selection methods for these two types of projects. To address this problem, a standard dataset was labelled and the base line methods (i.e. selecting projects according to a single feature like star number) under 60 configurations and the machine learning methods under 18 configurations were tested to identify the best configurations in precision and F‐measure for selecting PDPs and DPDPs. The results show that (1) to select PDPs or DPDPs with a high precision, the base line method is the best with precision of 0.877 (PDPs) and 0.831 (DPDPs); (2) to select PDPs or DPDPs with a high F‐measure, the machine learning methods are the best, with F‐measure of 0.817 (PDPs) and 0.789 (DPDPs); (3) existing sample selection strategies can be combined with the machine learning methods, and the precision of selecting PDPs can be increased by 6.39%–41.33% and the precision of selecting DPDPs can be can be increased by 35.50%–269.02%.
format Article
id doaj-art-6fe40b02bee84944a4c421ab4e5c2cf5
institution Kabale University
issn 1751-8806
1751-8814
language English
publishDate 2022-04-01
publisher Wiley
record_format Article
series IET Software
spelling doaj-art-6fe40b02bee84944a4c421ab4e5c2cf52025-02-03T01:29:38ZengWileyIET Software1751-88061751-88142022-04-0116214616610.1049/sfw2.12050An in‐depth study of the effects of methods on the dataset selection of public development projectsCan Cheng0Bing Li1Zengyang Li2Peng Liang3Xu Yang4School of Computer Science Wuhan University Wuhan ChinaSchool of Computer Science Wuhan University Wuhan ChinaSchool of Computer Science Central China Normal University Wuhan ChinaSchool of Computer Science Wuhan University Wuhan ChinaHuawei Technologies Nanjing ChinaAbstract Public development projects (PDPs) and documented public development projects (DPDPs) are two types of projects that can provide valuable information on how developers and users participate in OSS projects. However, it is hard for researchers to effectively select PDPs and DPDPs due to the lack of specific project selection methods for these two types of projects. To address this problem, a standard dataset was labelled and the base line methods (i.e. selecting projects according to a single feature like star number) under 60 configurations and the machine learning methods under 18 configurations were tested to identify the best configurations in precision and F‐measure for selecting PDPs and DPDPs. The results show that (1) to select PDPs or DPDPs with a high precision, the base line method is the best with precision of 0.877 (PDPs) and 0.831 (DPDPs); (2) to select PDPs or DPDPs with a high F‐measure, the machine learning methods are the best, with F‐measure of 0.817 (PDPs) and 0.789 (DPDPs); (3) existing sample selection strategies can be combined with the machine learning methods, and the precision of selecting PDPs can be increased by 6.39%–41.33% and the precision of selecting DPDPs can be can be increased by 35.50%–269.02%.https://doi.org/10.1049/sfw2.12050data miningsoftware engineering
spellingShingle Can Cheng
Bing Li
Zengyang Li
Peng Liang
Xu Yang
An in‐depth study of the effects of methods on the dataset selection of public development projects
IET Software
data mining
software engineering
title An in‐depth study of the effects of methods on the dataset selection of public development projects
title_full An in‐depth study of the effects of methods on the dataset selection of public development projects
title_fullStr An in‐depth study of the effects of methods on the dataset selection of public development projects
title_full_unstemmed An in‐depth study of the effects of methods on the dataset selection of public development projects
title_short An in‐depth study of the effects of methods on the dataset selection of public development projects
title_sort in depth study of the effects of methods on the dataset selection of public development projects
topic data mining
software engineering
url https://doi.org/10.1049/sfw2.12050
work_keys_str_mv AT cancheng anindepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT bingli anindepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT zengyangli anindepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT pengliang anindepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT xuyang anindepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT cancheng indepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT bingli indepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT zengyangli indepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT pengliang indepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects
AT xuyang indepthstudyoftheeffectsofmethodsonthedatasetselectionofpublicdevelopmentprojects