Electronics, Journal Year: 2025, Volume and Issue: 14(10), P. 1919 - 1919
Published: May 9, 2025
This study introduces two novel data reduction approaches for efficient sentiment analysis: High-Distance Sentiment Vectors (HDSV) and Centroid Embedding (CSEV). By leveraging embedding space characteristics from DistilBERT, HDSV selects maximally separated sample pairs, while CSEV computes representative centroids each class. We evaluate these methods on three benchmark datasets: SST-2, Yelp, Sentiment140. Our results demonstrate remarkable efficiency, reducing training samples to just 100 with maintaining comparable performance full dataset training. Notable findings include achieving 88.93% accuracy SST-2 (compared 90.14% data) both showing improved cross-dataset generalization, less than 2% drop in domain transfer tasks versus 11.94% The proposed enable significant storage savings, datasets compressed 1% of their original size, making them particularly valuable resource-constrained environments. advance the understanding requirements analysis, demonstrating that strategically selected minimal can achieve robust generalizable classification promoting more sustainable machine learning practices.
Language: Английский