Document Type : Research Article

Authors

1 Faculty of Literature and Humanities, Hakim Sabzevari University, Sabzevar, Iran

2 Department of Energy Economics, Allameh Tabatabai University, Tehran, Iran

Abstract

In predictive modeling, overfitting poses a significant risk, particularly when the feature count surpasses the number of observations, a common scenario in highdimensional datasets. To mitigate this risk, feature selection is employed to enhance model generalizability by reducing the dimensionality of the data. This study evaluates the stability of feature selection techniques with respect to varying data volumes, focusing on time series similarity methods. Utilizing a comprehensive dataset that includes the closing, opening, high, and low prices of stocks from 100 high-income companies listed in the Fortune Global 500, this research compares several feature selection methods, including variance thresholds, edit distance, and Hausdorff distance metrics. Numerous feature selection methods were investigated in literature. Selecting the more accurate feature selection methods in order to forecast can be challenging [1]. So, this study examines the most well-known feature selection methods’ performance in different data sizes. The aim is to identify methods that show minimal sensitivity to the quantity of data, ensuring robustness and reliability in predictions, which is crucial for financial forecasting. Results indicate that among the tested feature selection strategies, the variance method, edit distance, and Hausdorff methods exhibit the least sensitivity to changes in data volume. These methods, therefore, provide a dependable approach to reducing feature space without significantly compromising predictive accuracy. This study highlights the effectiveness of time series similarity methods in feature selection and underlines their potential in applications involving fluctuating datasets, such as financial markets or dynamic economic conditions.

Keywords

[1] Y. Hmamouche, P. Przymus, A. Casali, and L. Lakhal, GFSM: a feature selection method
for improving time series forecasting, Int. J. Adv. Syst. Meas., (2017).
[2] E. W. Newell and Y. Cheng, Mass cytometry: blessed with the curse of dimensionality,
Nat. Immunol., 17 (2016), pp. 890–895. doi:10.1038/ni.3485.
[3] B. Remeseiro and V. Bolon-Canedo, A review of feature selection methods in medical
applications, Comput. Biol. Med., 112 (2019). doi:10.1016/j.compbiomed.2019.103375.
[4] E. Erguner ¨ Ozko¸c ¨ , Clustering of Time-Series Data, IntechOpen, (2021).
doi:10.5772/intechopen.84490.
[5] A. Alqahtani, M. Ali, X. Xie, and M. W. Jones, Deep Time-Series Clustering: A Review,
Electronics, 10 (23) (2021), 3001. doi:10.3390/electronics10233001.
[6] J. L. Vermeulen, Geometric similarity measures and their applications [dissertation],
Utrecht University, (2023).
[7] H. Xie, J. Li, and H. Xue, A survey of dimensionality reduction techniques based on random
projection, arXiv, (2017). Available from: https://arxiv.org/abs/1706.04371.
[8] X. Zhu, Y. Wang, Y. Li, Y. Tan, G. Wang, and Q. Song, A new unsupervised feature
selection algorithm using similarity-based feature clustering, Comput. Intell., 35 (1) (2019),
pp. 2–22. doi:10.1111/coin.12192.
[9] P. Mitra, C. A. Murthy, and S. K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell., 24 (3) (2002), pp. 301–312.
doi:10.1109/34.990133.
[10] Q. Yu, S. Jiang, R. Wang, and H. Wang, A feature selection approach based on a similarity
measure for software defect prediction, Front. Inf. Technol. Electron. Eng., 18 (11) (2017),
pp. 1744–1753. doi:10.1631/FITEE.1601322.
[11] Y. Shi, C. Zu, M. Hong, L. Zhou, L. Wang, X. Wu, et al., ASMFS: Adaptive-similaritybased multi-modality feature selection for classification of Alzheimer ’s disease, Pattern Recognit., 126 (2022), 108566. doi:10.1016/j.patcog.2022.108566.
[12] X. Fu, F. Tan, H. Wang, Y. Zhang, and R. W. Harrison, Feature similarity based redundancy reduction for gene selection, In: Proceedings of the International Conference on Data
Mining (Dmin), (2006), pp. 357–360.
[13] A. Vabalas, E. Gowen, E. Poliakoff, and A. J. Casson, Machine learning algorithm
validation with a limited sample size, PLoS One, 14 (11) (2019), e0224365.
[14] G. L. Perry and M. E. Dickson, Using machine learning to predict geomorphic disturbance:
The effects of sample size, sample prevalence, and sampling strategy, J. Geophys. Res. Earth
Surf., 123 (11) (2018), pp. 2954–2970. doi:10.1029/2018JF004640.
[15] Z. Cui and G. Gong, The effect of machine learning regression algorithms and sample size
on individualized behavioral prediction with functional connectivity features, Neuroimage, 178
(2018), pp. 622–637. doi:10.1016/j.neuroimage.2018.06.001.
[16] L. I. Kuncheva, C. E. Matthews, A. Arnaiz-Gonzalez, and J. J. Rodr ´ ´ıguez, Feature
selection from high-dimensional data with very low sample size: A cautionary tale, arXiv,
(2020). Available from: https://arxiv.org/abs/2008.12025.
[17] L. I. Kuncheva and J. J. Rodr´ıguez, On feature selection protocols for very low-sample-size
data, Pattern Recognit., 81 (2018), pp. 660–673. doi:10.1016/j.patcog.2018.03.012.
[18] J. Doak, An evaluation of feature selection methods and their application to computer security [Technical Report], CSE-92-18, (1992).
[19] H. Liu and L. Yu, Toward integrating feature selection algorithms for classification and clustering, IEEE Trans. Knowl. Data Eng., 17 (4) (2005), pp. 491–502.
doi:10.1109/TKDE.2005.66.
[20] C. F. Tsai and Y. T. Sung, Ensemble feature selection in high dimension, low sample
size datasets: Parallel and serial combination approaches, Knowl. Based Syst., 203 (2020),
106097. doi:10.1016/j.knosys.2020.106097.
[21] U. Mori, A. Mendiburu, and J. A. Lozano, Similarity measure selection for clustering time series databases, IEEE Trans. Knowl. Data Eng., 28 (1) (2015), pp. 181–195.
doi:10.1109/TKDE.2015.2462369.
[22] M. Goldani, A review of time series similarity methods, In: Proceedings of the 3rd International Conference on Innovation in Business Management and Economics, (2022).
[23] S. Palkhiwala, M. Shah, and M. Shah, Analysis of machine learning algorithms for predicting a student’s grade, J. Data Inf. Manag., 4 (2022), pp. 329–341. doi:10.1007/s42488-
022-00078-2.
[24] A. C. Rencher and W. F. Christensen, Methods of Multivariate Analysis, 3rd ed., Hoboken: John Wiley & Sons, 2012.