Green Data Pipelines: Energy-Aware Data Engineering for Large-Scale AI Workloads
Keywords:
Hyperparameter Optimization, Class Imbalance, Sustainable Computing, Machine Learning, Green Data Pipeline, XGBoost, Random Forest, Feature Selection, Energy-Aware Computing, AI Workload ClassificationAbstract
Modern cloud environments now require data pipelines that support sustainable computing because of their growing computational demands and the increasing use of artificial intelligence workloads. However, the existing workload classification models based on artificial intelligence suffer from multiple shortcomings, such as overlapping features, unbalanced classes, and so on, and these shortcomings prevent them from reaching high precision and energy efficiency. Using a combination of modern feature selection methods, class balancing methods, and AI models, this research develops an optimal Green Data Pipeline for enhancing the outcomes of workload identification with reduced operational costs. The methodology used in the research follows EDA for assessing the relationship between the features, then performing dimensionality reduction (PCA and recursive feature elimination), and finally, class improvement using oversampling (SMOTE). Logistic regression and lightgbm, random forest, and XGboost algorithms are used for testing energy-aware AI workload classification, with an evaluation process that determines the most suitable model for energy-aware AI workload classification. Therefore, it can be shown that tree-based models are better than linear classifiers as Random Forest yielded 97% accuracy and XGBoost 100% with evidence of possible overfitting. Hyperparameter tuning and regularisation methods are integrated into Bayesian Optimisation, preventing overfitting in the model framework. The numerical performance evaluation of the suggested approach is shown using the measurement of classification reports along with confusion matrices and ROC-AUC scores. The most efficient Energy Efficient AI Workload Classification method is an optimised Ensemble of Random Forest, XGBoost, feature selection, and data balance implementation. Adaptive AI workload adaptivity and the ability to deploy workloads onto the cloud operationally optimised for energy-efficient workloads are also needed in the field.