
Canadian Strategic Highway
Research Program
C-SHRP Bayesian
Modelling:
A User's Guide
Chapter
Four
10 STEP TEMPLATE - MODEL EVALUATION
This chapter deals with Steps 8, 9 and 10 of the template: interpretation of the results of Bayesian regression. This includes comparing the performance of the various experts' models, choosing a representative prior, evaluating the model results, and planning further iterations of the model. 4.1 Use Model to Predict Performance - Step 8 4.1.1 Compare Model Performance In the previous step, a separate Bayesian regression was conducted using a combination of each expert's prior and the experimental data. This results in separate Bayesian models for each expert. In Step 8 the coefficient values and model performance are compared for the experts' prior models and for the data. The purpose of this phase is ultimately to reach a consensus on the experts' priors. There are a number of useful methods for comparing model performance. Table 4-1 : Comparing Model Performance
Table of Regression Coefficients Using a table of regression coefficients is the simplest way to compare several regression models. Four experts were interviewed for the C-LTPP rutting model. Hence, a separate prior was developed for each expert. Table 4-2 presents the results for each individual expert's model. The mean, standard deviation, and T-test value for each regression coefficient are also provided. From this table, the similarities between priors can be assessed. For example, it can be seen that Expert #1 does not believe Percent Air Voids has any influence on rutting whatsoever (i.e. a regression coefficient of zero), whereas the other experts consider it to be an important factor. Expert #1 also feels that thickness has a negative influence on rutting (i.e. less rutting with greater overlay thickness) while the other 3 experts believe the opposite is true. Note that although the results from the regression on the experimental data are included in the table, the purpose at this point is mainly to compare the different priors. Prediction Sensitivity Line Plot Another way to compare the various priors is to use a prediction sensitivity diagram. This diagram is created based on a 'base case', 'high', and 'low' setting of the independent variables. A sensitivity analysis is conducted by varying one variable at a time between its base, high and low setting while holding all other variables at their base setting. The results can be plotted in a diagram that graphically shows the sensitivity of the different priors. An example created for the rutting model is shown in Figure 4-1. Note how easily conflicting views can be spotted on the diagram. For example, there is disagreement about the sign of the coefficient on thickness. Expert #1 feels that higher air void content will have no effect on rutting whereas the other three experts feel the opposite is true. Note the steep slope of the age lines which indicates a marked sensitivity of the models to age. Figure 4-1 : Prediction Sensitivity Diagram
There may be other diagrams of particular interest to help compare model performance. In this case, a key concern is how the different models forecast rutting over time. A diagram of rutting versus age was plotted to assess the age variable. This is shown in Figure 4-2. More involved comparison diagrams can also be created such as ones that compare the variance of the coefficients in addition to the mean. 4.1.2 Selecting a Representative Prior Ideally, the comparison process described in the previous section will result in a natural consensus of the experts. In practice however, this is seldom the case. There are often significant inconsistencies among experts that need to be discussed with the experts in order to be resolved. A group discussion is often useful after the experts have all developed their priors initially. Inconsistencies among experts should be discussed in the group session. In certain cases, experts whose views are in conflict with the majority may have a low degree of confidence about their assessment and be receptive to the views of others. In other cases, experts who disagree with the group may point to past experience that confirms their prior. This may result in the group changing their opinion. Ultimately the goal is either to choose a single expert that is best representative of the data, or to pool the data from a number of consistent experts with similar responses and create a prior from this pool. The analyst should be aware that it is generally considered unacceptable to pool data containing significant inconsistencies such as opposing signs or having great differences in sensitivity to individual variables. This forces an artificial consensus. Instead, disagreements should be thought through, discussed and resolved to the group's satisfaction. Once a representative prior has been selected, a Bayesian regression should be conducted before proceeding to Step 9. In the case of the rutting example, Expert #3's prior was selected because it was the most representative of the group's overall opinion. Another useful example that describes the process of reaching a representative prior can be found in the joint C-SHRP/agency applications (available on the CD-ROM version of this guide). Alberta - Predicting Roughness Progression on AC Overlays (pp. 10-12) (Kurlanda 1995). The purpose of evaluating the model results is to draw conclusions about the Bayesian posterior result. This step emphasizes comparisons between the data, prior and posterior. The comparisons are ultimately used in Step 10 where the need for additional iterations of analysis is assessed. 4.1.1 Data/Prior/Posterior PDF Plots An important output of XLBayes is the PDF (Probability Density Function) plots for each coefficient in the model. These plots graphically compare the distributions of the same coefficient when based on the data alone, the prior alone, or the Bayesian posterior. Table 4-3 shows the results of a Bayesian regression performed using Expert #3's prior and the experimental data. Figure 4-3 shows the PDF plot for the air voids coefficient based on the results presented in the table. Under the assumptions of both classical linear regression and the Bayesian regression used by C-SHRP, the model coefficients are student-t distributed. The width of the bell shaped curve shows the confidence in the estimate of a coefficient. These plots are used extensively in understanding the results of a Bayesian regression. An introduction to PDF plots is contained in the single variable regression model example of Appendix A. Section 4.2.2, which follows, provides a guide to understanding PDF plots by showing the sensitivity of the posterior coefficients to changes in the certainty (i.e. variance/covariance matrix) of the prior. Additional examples based on the rutting example are provided in Section 4.2.3. Table 4-3 - Model Coefficient Means
Figure 4-3 - PDF Plot for Air Voids Coefficient
4.2.2 Understanding PDF Plot Results A key to understanding the results of a Bayesian regression is to understand the sensitivity of the results to the certainty of the coefficients in either the prior or the data. In the following example a sensitivity analysis is performed using two additional Bayesian regression runs on the rutting example. In the first run we arbitrarily double the standard deviation (i.e. quadruple the variance) in the prior variance-covariance matrix. In doing this we simulate the effect of Expert #3 being less certain about the value of the coefficients. This run is named the 'increased uncertainty' run. The results of the increased uncertainty run are presented in Table 4-4. Table 4-4 : Increased Uncertainty Run
We can see the effect that increased uncertainty in the prior has on the air voids coefficient in the PDF plot from the increased uncertainty run, Figure 4-4. The effect has been to shift the posterior mean away from the prior and towards the data. This is an intuitively reasonable result. However, the shifts for other variables in our example are less easy to understand. For example, the mean of the posterior coefficient on age (Table 4-4) does not lie in between the mean for the prior and the mean for the data. The reason for this is that the Bayesian regression process is multi-dimensional. It is not a simple averaging process. Section 4.2.3 contains a further discussion of this situation and other possible PDF results. The posterior variance has decreased. The exact reasons for this are quite involved, as can be seen in the equation for the posterior variance, Equation 17 of Appendix B. Figure 4-4 - PDF for %Air Voids, Increased Uncertainty Run
As a second sensitivity case, consider what happens if we decrease the uncertainty in the prior. For the 'Decreased Uncertainty' run, one-quarter the prior variances and covariance's were used. The results of this analysis for the air voids coefficient is shown in Figure 4-5. Again we have a very intuitive result with the posterior coefficient tending towards the prior. The numerical results of the decreased uncertainty case are contained in Table 4-5. Note that most coefficients tend strongly toward the prior estimates in this case. In this case the variance of the residual for the posterior has increased over the base case. While the equation for the posterior variance (Equation 17 of Appendix B) is very complex, an increase in the residual variance of the posterior is common in cases where the prior and the data are in strong disagreement. It is therefore reasonable that the residual variance of the posterior is the highest in this case. Although the means have not changed, the difference in opinion between the prior and data is the largest in the decreased uncertainty run because the prior viewpoint is at its most confident. Figure 4-5 - PDF for %Air Voids, Decreased Uncertainty Run
4.2.3 PDF Plot Results for the Rutting example As mentioned in the previous section, some PDF results are more difficult to explain than the simple case where the posterior mean lies between the prior and the data. In this section three examples from the rutting model are discussed. The discussion is intended as an aid for the reader to use in interpreting their own model results. Figure 4-6 - PDF Plot for Percent Crushed Aggregate The first example deals with the PDF plot for the crushed aggregate coefficient, Figure 4-6. Bayesian regression has a pronounced effect on the outcome of this coefficient. If classical regression were used the mean value of the coefficient would be near zero. The t-test results on the variable would show that it is not statistically significant. This is clearly evident because the PDF plot for the coefficient straddles the zero axis, indicating that the coefficient is not significantly different from zero. In contrast, the prior definitively shows that increasing the amount of crushed aggregate in the asphalt concrete should decrease the amount of rutting. The Bayesian posterior reflects this view and results in a markedly different value for the coefficient than does classical regression. The reason for this outcome may be that the data for the rutting model is 'premature'. Even though there are some 250 data point in the database, all of these data points are for overlays in the early stages of their life (1-4 years). However, the prior is based on a wide range of ages (2-20 years). In this case, Bayesian regression was used in part to compensate for the lack of data on older pavements. Figure 4-7 - PDF Plot for Traffic Coefficient (Log(Traf)) The second example shows that the experts and the data obviously reflect a much different opinion with respect to the influence of traffic on rutting (Figure 4-7). The prior reflects a good deal of conviction that increased traffic will lead to increased rutting (i.e. a very high t-statistic). The data shows this to a much lesser degree. There are a number of potential explanations for this. One possibility is that overlays and the pavement structure beneath them tend to be designed for the traffic situations they face. One can expect that the designers of the overlays used in high traffic situations will control design variables, including variables which may not be included in the regression model, such that high traffic areas will be more rutting resistant than low traffic areas. Thus rutting may have been, to some degree at least, designed out of the experimental database In contrast, when the experts created their priors using the full matrix orthogonal method they dealt with artificial design scenarios. Thus the potential effects of traffic were not masked by control of other variables. It may be the case that many of the overlays described in the orthogonal matrix, such as poorly rut-resistant designs in high traffic areas, would simply never be built. A model that forecasts this situation well may in fact be irrelevant. This again calls attention to the fact that one must be very aware of how the model is to be used when it is designed. Another possible difficulty with the traffic variable in this case is a poor functional form. Intuitively, a rutting model should address cumulative effects of some sort. The simple linear form used is only crudely capable of addressing cumulative effects. The simple form is a particular issue because of the large age difference between the data and the prior. The average age of pavements in the experimental database was two years whereas in the prior it is about ten years. Since one would only expect say four to five millimeters of rutting at an age of two years, the coefficient on annual traffic has to be very small at this age. On the other hand, a larger coefficient may work well in an equation designed for older overlays. This problem could be investigated further by normalizing the dependent variable to age. For example, average yearly rutting progression (i.e. rut depth divided by age) could be used as a dependent variable and age could be removed as an independent variable. Figure 4-8 - PDF Plot for Age Coefficient The final example, Figure 4-8, shows an example of the Bayesian posterior lying to the right of a strongly overlapping prior and data result. This counterintuitive result is likely a symptom of other strong disagreements between the prior and the data result. This result is possible because the Bayesian regression process is multi-dimensional and not a simple averaging process. The net conclusion for the rutting model is that there are some fairly serious disagreements between the prior and the data. This may or may not be due to actual disagreement between the experts and the experimental results. As discussed previously, the reasons for this may involve the functional form and the differences in inference space between the data and the prior. While this may seem like a dismal result, one should keep in mind that the Bayesian regression process has been highly valuable. The information contained in the prior was a key factor in understanding the shortcomings of this particular rutting model. The classical regression model on its own, because it is based on only four years worth of data, has very limited predictive power. In this case neither the classical regression model, the prior, or the posterior model seem to be conclusive. Another iteration of the process with an improved functional form and a more clearly defined inference space is probably warranted. To facilitate the process of evaluating a Bayesian regression result, a checklist has been created as part of the 10 Step Template. While the template obviously cannot address all the particular issues that may arise in any given analysis, it is a good starting point. The five checks in Table 4-6 can be applied to each coefficient. Table 4-6 - Coefficient Checklist
Generally the sign of each coefficient in a regression model should be clear and intuitively reasonable. The first basic test of coefficient results is therefore a check for a correct sign. If either the data, prior or posterior has an incorrect sign the reason for this should be determined. One indicator of the seriousness of the problem is the student's t-test of ratio of the mean to the standard deviation. If the t value is low, less than say 1 or 2, the incorrect sign may be less of a concern as the coefficient is not statistically significant. In this case, it may be reasonable to eliminate the variable and re-run the analysis. An incorrect sign on a coefficient in the experimental data model combined with a correct sign in the prior and posterior models may be an entirely reasonable result for a model based on a limited amount of experimental data. This is in fact the expected result if Bayesian regression is being used to overcome a small sample size problem. The magnitude of the coefficient in the prior is usually the basis for a test of a rational coefficient magnitude in either the posterior or data derived coefficient. A large discrepancy such as the one which occurred with the traffic coefficient for the rutting model (Figure 4-7) would be an obvious cause of concern. Smaller discrepancies may also be a cause for concern in certain situations. A comparison and sensitivity analysis of dependent variable predictions is useful to investigate whether the magnitudes of a coefficient are rational. The t-test is used to determine whether a regression coefficient is significantly different than zero (although it is possible to test for other values as well, zero is the most common). The t-value for a regression coefficient is calculated by dividing the mean of the regression coefficient by its standard deviation. Equation 4.1 The null hypothesis in this test is H0 : bn = 0 which is tested against the alternative hypothesis H1 : bn 0. At the 5% level of significance where the number of degrees of freedom is very large (i.e. the t distribution is approximately the same as the normal distribution), the critical value of t is +/- 1.96 or approximately +/- 2. If t is greater than 2 or less than -2, we reject the null hypothesis and accept that the estimate of bn is statistically significant. We can be confident that the variable in question has an actual influence on the model result. The higher the value of t, the more confident we are about its value and its significance. If t is between 2 and -2, we accept the null hypothesis that the estimate of bn is not statistically significant. We are not confident that the variable in question actually influences on the model result. The values calculated for the coefficient may only be different from zero due to chance. Where the number of degrees of freedom is not large, particularly below 15, the t value is greater than 2 because of the flatter tails of the t distribution. Where there are 8 degrees of freedom, the critical value of t is +/- 2.3 and with only 5 degrees of freedom the critical value is +/- 2.57. These critical values are available from tables in most statistics textbooks If the regression coefficients in the prior and posterior are not statistically significant it may be useful to re-run the analysis after excluding the variable. If the standard error term of the model does not increase significantly the excluded variable may not be a statistically significant contributory variable. The ideal result is for the data and prior to reinforce each other, resulting in a posterior coefficient that has a smaller standard error than either one individually. This is not always the case however and the posterior may in fact have a larger standard error. Irrespective of how much the variance has changed, it is desirable that the coefficients in the posterior model all be statistically significant. 4.2.4.4 Which Information Does the Posterior Reflect? The posterior coefficients may tend toward either the prior or the data. It is useful to evaluate each coefficient to determine on which view the posterior relies more heavily. This is useful in understanding the differences between the prior and the data. 4.2.4.5 Standard Error of the Posterior Model The statistical performance of a classical regression model is typically measured by evaluating Se, r2, the F-statistic and the t-statistic for each regression coefficient. In Bayesian regression only Se and the t-statistic can be evaluated. Neither r2 nor the F-statistic can be calculated because they rely on experimental data which does not exist for the posterior result. Standard Error of the Residual - Se The standard error of the residual, Se is a basic measure of regression model performance. The standard error (or standard deviation) of the residual is simply the square-root of the residual variance, Se2. The lower Se, the closer the predictions made by the model are to actual observations of the dependent variable and therefore the better then model. Under the assumptions of regression, the residual has a mean of zero and is normally distributed. Thus the confidence interval for forecasts made by the model can be calculated using a table of areas under the normal curve, which are contained in most statistics texts. For example, the 95% confidence interval for a forecast corresponds to the mean forecast plus or minus 1.96 times the standard deviation of the residual. Figure 4-9 - Confidence Interval for Model Prediction
Assuming all the model coefficients seem reasonable, a final issue in evaluating the model is the standard error of forecasts made with the posterior. The general test for Se is whether the confidence interval on the model forecasts is narrow enough for the purpose that the model is intended. In the rutting example, the residual variance for the posterior is 10.95. This corresponds to an Se of 3.3mm. The 95% confidence interval on predictions made with the model would therefore be plus or minus 6.6mm. The confidence intervals for predictions may be determined using the prediction feature of XLBayes. Provided the confidence interval on the posterior is acceptable, the modelling process may be complete. The following selected additional examples on model evaluation are taken from the joint C-SHRP/agency application reports (available on the CD-ROM version of this user's guide): Alberta - Roughness Progression of AC Overlays (pp. 12-15) (Kurlanda 1995). New Brunswick - Rutting in Alternative AC Overlay Methods (pp. 28-31) (Jackart 1995). At this stage in the process, other model runs can be performed to try to improve model results. Major possibilities include adding or removing independent variables, modifying the functional form and encoding additional experts. In the longer term, additional experimental data can be collected. This additional data can be used to update the regression model. The experimental data and the posterior should become more and more definitive as more data is collected. Ultimately, with sufficient amounts of experimental data, the data alone will be significant and definitive - thus eliminating the need for prior models and Bayesian regression. A useful example that details multiple iterations of a regression model can be found in the following report from the joint C-SHRP/agency applications (available on the CD-ROM version of this user's guide): Manitoba - Benkelman Beam Rebound in AC Overlays (pp. 8-12) (Kavanagh 1995) References English, J., Predicting the Compressive Strength of High-Performance Silica Fume Concrete by Bayesian Methods, Joint C-SHRP/Newfoundland Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Jackart, M., MacPherson-Munn, P., Callaghan, L., Prediction of Rutting in Alternative Asphalt Concrete Overlay Methods, Joint C-SHRP/New Brunswick Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Kavanagh, L., Benkelman Beam Rebound - AC Overlay Model, Joint C-SHRP/Manitoba Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Kurlanda, Marian H., Kajner, L., Predicting Roughness Progression of Asphalt Overlays, Joint C-SHRP/Alberta Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada/Alberta Transportation and Utilities, Ottawa, 1995. Vemax Management Inc., C-LTPP Bayesian Analysis Project - Consolidated Working File, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1994. |
||||||||||||||||||||