Research Article
Abdalla Bala and Alain Abra
Abstract
Multi-organizational repositories, in particular those based on voluntary data contributions such as the repository of the International Software Benchmarking Standards Group (ISBSG), may be missing a large number of values for many of their data fields, as well as including some outliers. This paper suggests a number of data quality issues associated with the ISBSG repository which can compromise the outcomes for users exploiting it for benchmarking purposes or for building estimation models. We propose a number of criteria and techniques for preprocessing the data in order to improve the quality of the samples identified for detailed statistical analysis, and present a multiple imputation (MI) strategy for dealing with datasets with missing values.