Connect your moderator Slack workspace to receive post notifications:
Sign in with Slack

splitting and standardization

Good Afternoon,
I was wondering what is the best procedure:
Imagine with have a full training sample X_total is there one method that is right or wrong between these two:

1)Splitting data Xtrain and Xvalidation, standardizing both with respective mean and stedv and running accuracy estimates by training with Xtrain and testing on Xvalidation

2) standardize full sample with mean and stdev of Xtotal then split into Xtrain and Xvalidation then run model for accuracy estimates

In both cases we could replicate the procedure with the X_test set either by
1) standardizing the unseen test data and train data by using mean and stdv of features of both sets combined then train then test model
2) standardize the train and test separately then train then test model

I was also wondering for the report should we report the test accuracy we get with our validation testing or with the final test on test.csv? I believe we should choose the model with Cross-Validation on our train.csv with the validation error estimates and not with test error on the test.csv which will only allow to estimate a generalization error of the best model chosen. But maybe i'm wrong.
Thank you for your help

Hope this answer to a related question on Discord is useful,

https://discordapp.com/channels/755347620175675452/755347620175675461/769473824445038628

Copied below,

"Ok, so to clarify and summarise, you should keep the following two points in mind

(A) Performance on the test set should be a measure of performance on unseen data. Hence you should not use the test set at any point during the model building process.

(B) Whatever transformation you apply on training data before feeding them to the model, you should apply the same transformation on the test data before using the model on them for prediction.

While normalising the data for training, in keeping with (A), you should find and apply the mean and standard deviation only on the training data. In keeping with (B) you should apply those same values of mean and standard deviation that you found on the training data to normalise the test data.

While cross-validating, the train-validation split in each fold is simulating a train-test split. Thus by the same reasoning, for each fold, you should find the parameters and normalise the train split of that fold and apply the same parameters found for the train split of that fold to normalise validation split of that fold. "

Thank you, I have two follow up questions
1) But can't we define the model building process itself dependent on the test data? By that I mean we create a model which itself 1)takes in new test data features 2) uses it for standardization 3) reruns ML algorithm for predictions 4) apply retrained model to predict label variables of test set?

2) suppose we did carry out CV then when we retrain on full training set would we also re-standardize using mean and variance of full training set?

I guess my question comes down to: why can't we have a model that recalibrates when it needs to predict Y_test by using the newly given X_test (which become known before prediction)? And if we do have such model than shouldn't we standardize before splitting to emulate model training process?

You can have a model that re-calibrates based on a set of points before making predictions on it. But this is not a common case. We use a test set as a way to evaluate the model. In real settings when we actually use the model, we usually make predictions on single instances or a small set of instances and re-calibration would be of limited use here.

If however you were to use such a model that makes predictions on sets of points, the right way to evaluate it would be to have multiple such 'test sets', which you all standardise independently (the standardisation being part of the model).

Standardising before splitting and training the model can result in some information leaking from the test set, so this should not be done.

why can't we have a model that recalibrates when it needs to predict Y_test by using the newly given X_test (which become known before prediction)?
And if we do have such model than shouldn't we standardize before splitting to emulate model training process?

Good questions!

Generally, there is no "transduction" allowed between test samples. I.e. you have to assume that the model only receives one test sample at a time (in the wild), and has to make a prediction for it.

This also explains the reasoning behind computing the normalization parameters on the training set, and merely applying it on the test sample(s). The test set which is used for model evaluation (MSE, accuracy, F1 score, ...) contains many samples to get a representative estimate, but this is different from the in-the-wild goal of single (test) sample prediction.

Thank you so much for your reactivity!
I was wondering when we have a labelled data set and we split it into train and validation to evaluate performance. We standardize using X_train and use those parameters for X_validation. Once we chose our model and we retrain on full labelled data set before testing on unseen data then we should also restandardize right and this time use [X,train,X_validation] concatenation? and use those new mean and stdev for X_test.

Once we chose our model and we retrain on full labelled data set before testing on unseen data then we should also restandardize right and this time use [X_train,X_validation] concatenation? and use those new mean and stdev for X_test.

Correct.

Note that the normalization/standardization numbers for the (cross-validation) X_train and the (final) [X_train, X_validation] are expected to be very similar. This is because X_train (and X_validation) should be a good representation of the full dataset [X_train, X_validation] for cross-validation to work in the first place.

Page 1 of 1

Add comment

Post as Anonymous Dont send out notification