up SearchFeedback[help] CPMCnet

Model Selection


The following describes in some detail the statistical approach to selection of model-free parameters outlined in our paper Mandel et al. (1995) J. Mol. Biol. 246, 144-163.
It may be helpful for you to have this paper at hand as you go through what follows. Also, please refer to the ModelFree man pages.

If you have determined the 3 relaxation parameters R1, R2 and NOE, then you can fit no more than 3 model-free parameters (in addition to the overall rotational diffusion time constant, tm) to these data. There are 5 possible sets of model-free parameters that can be fit to 3 experimental data points (the "binary" short-hand notation given within parentheses below is also used by the ModelFree program as input to the -s flag):

  1. (00100) S2
  2. (00110) S2, te
  3. (00101) S2, Rex
  4. (00111) S2, te, Rex
  5. (01110) Sf2, S2, te
We want to determine which of these sets best describe the experimental data for each individual residue of our protein. The model selection is carried out in the following step-by-step fashion. See the flowchart in Figure 9 of Mandel et al., 1995.
  1. First, each of the 5 different models is fit to the experimental data, while keeping tm fixed at the value obtained from the R2/R1 ratio. For each model, 500 (although you can choose any number you like; a larger number gives more reliable statistics, but takes longer time to run) randomly distributed synthetic data sets are created using Monte Carlo simulations. In these first 5 runs, the same set of model-free parameters are fitted to the simulated data as to the experimental data. These 5 runs correspond to the 5 square boxes of the flowchart.

    The output file from ModelFree (specified by the -o flag) contains the SSE (sum squared error; called gamma in Mandel et al., 1995) of the actual fit, as well as the SSE(0.95), i.e. the 95 percentile SSE from the fits against the synthetic data (the X percentile corresponds to the (1-X/100) critical value). If you choose to evaluate the goodness of fit by comparing with the simulated SSE of some other (than the 95) percentile, then you can obtain this SSE(X/100) from the simulation output file (specified by the -e flag), which contains a row number, the SSE and optimized model-free parameters (one line per individual synthetic data set). Simply sort the simulation output file after increasing SSE, and calculate which line number in the sorted file corresponds to the X percentile. E.g. if you have simulated 300 data sets, then the 90 percentile corresponds to line 270; read off the SSE(90) from that line. Here are two scripts that will do this for you:

    sort_ext
    get_percentile

    In summary, for these ModelFree runs the command line flags that we need to specify are:

    -i (input file name)
    -o (output file name)
    -e (extension to the simulation output file names)

  2. Next, 2 ModelFree runs are carried out, both using model 1 (00100) to create the synthetic data sets, while one uses model 2 (00110) and the other model 3 (00101) for fitting to the synthetic data sets. Here, the (00100) setting is done in the input file (as usual), whereas the (00110) and (00101) settings are specified on the command line after the -s flag. These 2 runs are not explicitly included in the flowchart, but the results are used to evaluate the F-tests of the second level branch points.

    One important point here is that you need to use the same random number generator seed for these runs as you used (or: as the ModelFree program picked if you had set "seed" = 0 in the input file) for the model 1 run under step 1 above. This is because we want the distributions of the synthetic data sets to be identical in the two cases.

    For these ModelFree runs the command line flags that we need to specify are:

    -i (input file name)
    -o (output file name)
    -e (extension to the simulation output file names)
    -s (a string xxxxx in which x = 0 or 1)

  3. Now we have generated all the simulated data we need in order to test all of the branch-point conditions of the flowchart.

    First we test whether model 1 (00100) represents our experimental data sufficiently well: Is SSE < SSE(X/100)? (this is the first branch point of the flowchart ). If this condition is true, then we accept model 1 for this residue without further testing if any of the other models are "even better". This is a consequence of "Occam's razor", which loosely can be stated: "if a given model explains the experimental data sufficiently well, then there is no reason to apply models that include a larger number of parameters".

    Next, we check whether SSE < SSE(X/100) for models 2 (00110) and 3 (00101). We also need to calculate the F-statistic in order to assess whether an improvement in fit really is significant, or merely arises because of the random statistical reduction in SSE that follows upon incorporation of additional parameters. The F-statistic is defined as:

    F = p2 * ( SSE_1 - SSE_2 ) / ( SSE_2 * ( p1-p2 ) ),

    where SSE_1 and SSE_2 are the SSEs for models with p1 and p2 degrees of freedom (p1 > p2). The F-statistic calculated from the SSEs of the fits to the experimental data should be compared with the (1-X/100) critical value of the F-statistic calculated from the SSEs of the fits to the simulated data. More precisely, the F(1-X/100) is calculated by taking the SSE_1 value obtained from step 1 (model 1), and the SSE_2 value obtained from step 2, as described above. Thus, we evaluate: F > F(1-X/100)?
    In stead of calculating the critical value of the F-statistic from the simulated data, you could look this up in the appropriate table in your favorite statistics text book. Tabulated values have been calculated assuming that the data have a perfectly normal distribution, which is not a priori true for our relaxation data; however, as noted in Mandel et al. (1995), we have found that this is most often the case. We think that it is good practice to actually calculate the F-statistic from the simulated data.

    The rest of the selection scheme should be obvious from the flowchart and the discussion under Materials & Methods of Mandel et al. (1995). Once you have determined which model best fits each residue, you should carry out one single ModelFree run that simultaneously fits the global tm and the appropriate model-free parameters for each residue.


    Hierarchical links

    Over to the ModelFree page

    Over to the CurveFit page

    Over to the relax_scripts page

    Palmer Group Home Page


    Updated 9/23/97 by Arthur G. Palmer (agp6@columbia.edu)