An integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset

Authors / Editors

Research Areas

No matching items found.

Publication Details

Output type: Journal article

UM6P affiliated Publication?: Yes

Author list: EL Bilali A., Taleb A., Bahlaoui M.A., Brouziyne Y.

Publisher: Elsevier

Publication year: 2021

Journal: Journal of Hydrology (0022-1694)

Volume number: 599

ISSN: 0022-1694


Languages: English (EN-GB)

View in Web of Science | View on publisher site | View citing articles in Web of Science


Machine Learning (ML) techniques can be valuable for modelling the faecal contamination in the rivers to overcome the limitations of the process-based models. However, this approach requires large sufficient data for training and validation processes to avoid the over-fitting problem. This study attempts to overcome the small dataset limitation by relying on the data augmentation techniques. To that end, Adaptive boosting (AdaBoost) models were trained and integrated into the data augmentation method to generate 600 virtual samples based on 40 original datasets. The results revealed that the proposed method significantly improved the accuracy (RMSE = 0.716ln(Colony Forming Unit (CFU)/100 ml)) and generalization ability of the AdaBoost model for predicting the faecal coliform in the rivers compared to the baseline model developed only with a small dataset (RMSE = 2.348ln(CFU/100 ml)). However, the study showed that generating and using too many virtual data could deteriorate the generalization ability of the ML model and the optimal virtual datasets are about (337–415) virtual samples. Globally, the results of this study provide new insights to improve the prediction accuracy of the health risk related to the faecal coliforms in raw water used for drinking purposes under a small dataset. The developed method can broaden the application of ML to water resources and environmental sciences when it is impossible to get a large dataset required by ML models. © 2021 Elsevier B.V.


No matching items found.


No matching items found.

Last updated on 2021-27-10 at 23:21