A

In a typical machine learning project, the dataset is divided into three unequal parts: train sets (the largest part), validation, and test (also called holdout) sets. A train set is used for model training. Validation and test sets are both used to test a model’s performance with a difference that the validation set is used to choose between several models with different hyperparameters. A holdout set is then used to get the final accuracy. Test sets should never be used for model training, tuning, or for selecting between models.

Why is it important?

Dividing data into several parts allows data scientists to effectively train and evaluate machine learning models without mixing the data. Such an approach guarantees unbiased results since the model is tested on previously unseen data. It leads to greater confidence more accurate results.

A

B

C

D

E

F

I

L

M

N

O

P

Q

R

S

T

U

Train, validation, and test sets

APPLICATION FORM

APPLICATION FORM