Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

Information spill from test set in training

I understand that in a supervised context, it is crucial to avoid any form of information spill from the validation set about the relationship between test features and test labels. Otherwise, we cannot estimate generalization error. But, it is not clear to me why information about the distribution of features cannot be used. For example in the following scenarios, where I have a training set with labels and data I want to predict unknown labels for:

  • For standardization, I can get a better a better estimation of the mean and standard deviation of each feature if I use training data also the data with unknown labels to estimate them
  • If I'm building a text representation, say tf-idf, I can better estimate the frequency of each word and thus build a better text representation if I also count word appearance in the text with unknown labels

Because I don't even know the labels in the second set, I don't see how these practices could allow me to cheat and get better accuracy on the data with unknown labels - I did not use them since they were not available. Similarly, I could now apply the same logic when using a validation set. I could perform standardization for example by also using the validation set. It is not using any information about the labels in the validation set, so I don't see how this could worsen generalization. Yet, I have the impression we were taught to never use data from a test or validation set to pre-process the data. Is my argument about information spill incorrect ?

Is my argument about information spill incorrect?

Yes, there is no information spill if you don’t use the labels and the val/test set is well distributed. However, there are mostly better (normalization/model/...) estimates if you stick to the (large) training data.

Yet, I have the impression we were taught to never use data from a test or validation set to pre-process the data.

In the end using the val/test data all depends on the (in-the-wild test setting):

  • if in the test setting that you are preparing for, you get all the test samples at once and the size of the set is on the order of the training set, then using the unlabeled data is fine. Note that the assumption is that your training set distribution is a good representation of the (total) test distribution, and thus the unlabeled data has not so much to add.
  • however, a more common setting is that you get the test samples one by one or in very small batches. In this setting using normalization statistics from the (much larger) training data will be a better estimate of the correct distributional statistics. Also test samples are usually not buffered/saved, so each test sample is treated ‘atomically’ (without knowledge of others).

Please note that the ‘test’ set setup you use for final model evaluation should mimic the exact setting in-the-wild. Otherwise you’re comparing apples and oranges in terms of model performance.

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification