Model evaulation with cross-validation

Learning Objectives

  • Learners can define and identify overfitting.
  • Learners can implement split-half cross-validation to evaluate model error.

How do we know that a model is good enough?

In the previous section, we managed to reduce the SSE substantially relative to the linear model, by fitting a non-linear one, but it’s still not perfect. How about trying a more complicated model, with more parameters? Another function that is often used to fit psychometric data is the Weibull cumulative distribution function, named after the great Swedish mathematician and engineer Waloddi Weibull

As you can see, this function has several more parameters to account for features of the data that may interest us.

Let’s see what that might look like:

Weibull functions
Weibull functions

We go ahead and fit this function as well, using curve_fit

And examine the results:

Weibull fits
Weibull fits

Calculating the error, we find that the SSE is smaller for this model:

Should we switch over to this model? It seems to be doing better in fitting the data.

Overfitting

One of the reasons we should be suspicious of the more accurate second model is that the Weibull function has more parameters. This is sometimes referred to as the “degrees of freedom” of the model, and intuitively just means that the more parameters a function has, the more adjustments you can make to the estimated \(y\) values that you could consistently get across a range of \(x\) values. Let’s examine the most extreme case of that. Consider the case of fitting a polynomial of degree 6 to these data (7 parameters, including \(\beta_0\)):

This model has even smaller error on the data:

But examining the curves of the fits, you can see that it does some absurd things in order to fit the data:

Fits of polynomials with degree 6
Fits of polynomials with degree 6

That is, it bends to match not only the large-scale trends in the data, but also the noise associated with each data point. This match to the noise that characterizes the sample is called “overfitting”.

Overfitting generally becomes worse as a model becomes more flexible. This can roughly be quantified by the number of parameters in the model. That is, a polynomial of degree 6 will always fit the data more accurately than a polynomial of degree 5, only because of overfitting to the noise in the sample.

Overcoming overfitting – model selection with cross-validation

One strategy to overcome overfitting is by separating the noise in the data used to fit the model from the noise in the data used to evaluate the model. This is called “cross-validation”. We fit the model to one sample, and then evaluate it on another. If we haven’t collected two separate samples, it might still be safe to assume that the noise in every individual observation (in the case of these experiments, each trial) is independent, and we can generate two samples by splitting the data up into sub-samples. There are different ways to split up the data, but in this case it makes sense to separately look at odd trials and at even trials, minimizing serial order effects (such as how tired the person got).

We refer to the sample used for fitting the model as the “training set”. Using the training set, we find the parameters of the model

The left-out sample is called the “testing set”. We use this sample to calculate the error:

In order for this to be cross-validation, we would want to also do it the other way around: use set #2 as the training set and evaluate on the set #1 as the testing set:

To evaluate model over all

Cross-validation

  1. Write the code to cross-validate the parallel condition (bonus points if you write a function that can do both without changes!)
  2. Evaluate the Weibull model as well. Which model do you think is better?

Next, let’s summarize the points we learned. We’ll also see a list of resources for further learning about the topics we discussed.

Click here for the next section.