
Canadian Strategic Highway Research
Program
C-SHRP Bayesian Modelling:
A User's Guide
|
Chapter Two 10 STEP TEMPLATE - MODEL DESIGN The 10 Step Template for Bayesian regression was developed by C-SHRP during the 1990-1994 pavement performance modelling project conducted for the C-LTPP experiment. The methods and issues raised in each of the 10 modelling steps are based on C-SHRP's experience with this project. The objective of the template is to provide the user with a step-by-step methodology for performing Bayesian regression using BSTAT or XLBayes. The template is intended as a guide only and variations on the approach may be more appropriate for particular applications. The template is comprised of 10 steps as shown in Table 2-1. It is presented in this guide as Chapters 2, 3 and 4. Table 2-1: 10 Step Bayesian Modelling Template
This user's guide uses the C-LTPP rutting model as an example throughout each of the 10 steps. The model was developed for the C-LTPP modelling project using Bayesian regression and data obtained from the C-LTPP database (C-SHRP 1996). For those readers unfamiliar with highway pavement design, 'rutting' refers to the concave wheel tracks (i.e. ruts) that develop as vehicles repeatedly travel on the highway. Rutting is a problem in older highways because deep ruts allow water to collect which can result in a safety problem for motorists. The C-LTPP experiment monitors the performance of 65 highway test sections comprised of asphaltic concrete (AC) overlays over AC pavements. The C-LTPP database contains time series data on performance indicators for overlays. For example, rut depth would be measured on an annual basis for each of the test sections. 2.2 Decide What You Want to Model - Step 1 The first step in building a Bayesian regression model is to clearly define what to model, the context in which the model will be applied, and the prospective inference space of the model. Addressing these issues up front by asking the 'difficult questions' will help focus the model development process and go a long way towards ensuring the development of a successful model. Some general questions to consider are:
The first question is a practical issue to consider at the outset of the modelling process. If the required experimental data does not exist, or cannot be obtained, it may not be practical to proceed further. The same is true for the second question; without the benefit of prior information it would not be advisable to proceed with a Bayesian regression. The last question addresses the importance of deciding beforehand how the model will be used in practice. Specifically, what predictions will the model be used to make? What will these predictions be used for? There is a good chance that asking this question will bring to light conflicting and possibly unobtainable objectives. These conflicts need to be reconciled in the planning stages. The output of Step 1 of the template should be summarized in a written statement that describes the objectives of the model and any assumptions regarding the inference space. This statement will be used in the encoding package described in Section 3.4.1. The following section which discusses deciding what to model is taken from the Developing the Bayesian Rutting Model Working Paper, (Sparks et al, September 1993).
Three additional examples on defining what to model can be found in the Joint C-SHRP/Agency Bayesian Application reports (available on the CD-ROM version of this guide): Saskatchewan - Subgrade Shear Failures (pp. 1-2) (Widger 1995). Quebec - Pavement Performance in Frost Conditions (pp. 1-5) (Doré 1995) Ontario - Deterioration of AC Surfaces Containing Steel Slag (pp. 1-6) (Afrani 1995) 2.3 Select a Dependent Variable - Step 2 The second step in performing a Bayesian regression is to identify a dependent variable, Y, that best meets the model objectives described in Step 1. Potential dependent variables are usually identified once Step 1 is complete. Prospective dependent variables can also be identified though a literature search or through interviews with available experts. Candidate variables should be evaluated according to:
Other considerations are whether to use absolute or delta measurements, and whether to pre-process or otherwise transform raw measurements. Availability The first consideration, the availability of experimental data, can be addressed by reconciling dependent variable data requirements with databases available in-house or through other agencies. If the required data is not available and cannot be obtained in the future, a different dependent variable needs to be considered. Inference Space The inference space of the available data is also a consideration in choosing the dependent variable. The basic question in the C-LTPP rutting example is whether there is data available for rutting on the types of overlays we wish to model. The upper and lower bounds of the independent variables should encompass the types of situations we would like to forecast. These bounds represent limitations on the operating range of the model. If the range does not extend far enough in either direction the analyst may try to bridge the gap with additional data or with expert judgment. One of the reasons for using Bayesian regression is to build a model applicable outside the inference space of available data. In the rutting model example, the C-LTPP experimental data contained rut depths ranging from 0 to 10mm. However, a model relevant to a wider range of rut depths was desired. The Bayesian prior was therefore developed based on situations that would result in rut depths of up to 50mm. Other inference space considerations are the assumptions and conditions on the model. For the rutting model, a key assumption made was that rutting would only occur in the overlay layer. A common inference space condition for pavements is a geographic one. Temperature, moisture and construction materials can significantly affect performance and vary from region to region. If these factors are not included as independent variables, the operating range of the model will be limited to the same geographic region as the data. Time may also be a factor in the inference space. Equipment, methods and materials change over time. A historical database extending back 20 years may or may not be relevant today if design and construction methods are significantly different. Measurement The next consideration is the way in which the dependent variable is measured. It is critical that the dependent variable be well defined. A problem faced by C-SHRP in the rutting example, was that rut depth can be measured in several different ways on a road section. A basic consideration is whether the rut depth measurements represent the worst spot on the highway test section or the average for that section. If worst spot is used, the finished model will be appropriate for making extreme or worst case predictions only. Conversely, if average measurements are used, the finished model will be best suited for predicting the central tendency of rutting. Related issues include the number of rut depths measured to calculate the average, whether the left or right wheelpath is used, and which lane to measure rutting in. The length of straight edge used to measure the rut depth is also important since a different depth will be measured depending on straight edge length.Absolute vs. Delta Another issue to contend with when selecting a dependent variable is whether the model will be used to make absolute or incremental (delta) predictions. This is important in pavement performance models because predictions are almost always made as a function of time. When making absolute predictions over time, age must be included as an independent variable. A delta model may or may not use an explicit age variable since predictions are made relative to the previous time period. Figure 2-1 shows the difference between absolute and delta predictions.
A final issue to consider when selecting a dependent variable is whether raw experimental data will be used or if it will be processed, averaged or otherwise manipulated first. In the delta model, for example, the raw dependent variable data (measured in absolute units) must be processed by calculating the change between related time periods. To complete step 2 of the template it is useful to prepare a written summary that describes the selected dependent variable. This information will be used in the expert judgment encoding package. The definition used for the dependent variable in the C-LTPP rutting model was:
Two other practical examples dealing with step 2 of the template can be found in the following reports (available on the CD-ROM version of the user's guide): Saskatchewan - Subgrade Shear Failures (p.11 )(Widger 1995). Ontario - Deterioration of AC Surfaces Containing Steel Slag (pp. 1-6) (Afrani 1995) 2.4 Select Model Type - Step 3 This step presents a choice between classical and Bayesian regression and mechanistic vs. empirical model forms. In the current context it is assumed that Bayesian regression has been selected. The choice between mechanistic and empirical models depends on the availability of a theoretical equation or formula (i.e. mechanism) to describe the performance of the dependent variable and the availability of data to fill in the variables in this equation. The following example illustrates the difference between an empirical model form and a mechanistic one. A basic empirical model form for the rutting model is a simple linear equation such as Equation 2.1.
Equation 2.1
Empirical model forms may also be complex. For example, it may be found through trial and error, or judged from experience that rut depth is more closely correlated with the square root of overlay thickness and the Log10 of age. If this were the case, the following empirical form could be used:
Equation 2.2
In both equations, the forms are empirical because the independent variables are derived from observations and have no particular basis in theory. An example empirical-mechanistic model is shown in Equation 2.3. The model is one that has been proposed for alligator-type cracking (another pavement distress) by Lytton (1989).
Equation 2.3
The cluster term variables in equation 2.3 were not determined empirically. Rather, they were derived from theory which explains pavement fatigue in a road structure. This type of equation is termed 'empirical-mechanistic' because the coefficients bi are determined from data using regression, as opposed to being based on theory. The advantage of an empirical-mechanistic model is that it can address complex interaction between independent variables, such as indicated by cluster term x2 in Equation 2.3. A disadvantage of these models is that the cluster terms may require independent variables which are not available in the database. Another disadvantage is that theory to derive these models is often not available. 2.5 Select Independent Variables - Step 4 The next step in the template is to select independent variables that are strong indicators of the behavior of the selected dependent variable. The selection process has two parts:
2.5.1 Enumerating Independent Variables The process of selecting the independent variables begins with enumerating prospective variables thought to influence the selected dependent variable. Prospective variables can be identified by interviewing experts, doing a literature search or analyzing available data. Initially, the best approach is to ask experts for their input. There are several reasons to bring experts in at this early stage. If they are unable to provide a list of variables directly they will, at the very least, be able to provide guidance about other information sources. Experts will also provide practical guidance, such as listing only independent variables for which they know data exists. Further, they will provide theoretical guidance, such as describing different theoretical causes for a phenomenon and the different contributory variables involved in each case. An example of the latter in pavement performance modelling could be the different causes of rutting such as structural rutting versus instability of the asphaltic concrete. The independent variables involved are quite different depending on which phenomenon is being modelled. Lastly, it is important to have experts involved in the development process early on, especially if they are candidates for the expert judgment encoding process and for using and/or endorsing the model. In the case of the C-LTPP rutting model, a total of 25 variables were identified as having some influence on rutting performance. Often, only a small number of the candidate independent variables identified in the enumeration phase can actually be used to build the model. A rule of thumb is that the number of independent variables should not exceed about 4 to 6 if the prior is to be based on expert judgment. The primary reason for this is that the level of effort required to encode expert judgment using the orthogonal method (see Section 3.3) increases exponentially as the number of independent variables increases. It may be very difficult to encode on too many independent variables. Thus the next step is to pare down the list of prospective independent variables to a manageable number. 2.5.2 Evaluating Independent Variables Subjectively The most common methods of selecting independent variables from the pool of candidates is by having experts rank them and/or eliminate highly correlated variables until the number of remaining variables is sufficiently small. Another method is to base selection on a literature search. Ranking Using experts to rank candidate variables is an effective way of selecting independent variables. This process is philosophically consistent with the expert judgment based priors and Bayesian regression. The process works by having an expert rank candidate variables in order of contributory significance. The experts are also asked to provide feedback on whether they feel the variables can be readily measured. This allows an expert to state their preference for a particular variable and gives the modelling team an idea of the level of effort required to obtain the necessary experimental data. The ranked responses from one or more experts are aggregated and analyzed in order to determine the consensus rank of the eligible independent variables. A good example of this is the systematic approach developed for the joint C-SHRP/Alberta Bayesian Application (Kurlanda and Kajner, 1995). Several experts were given a list of 15 variables and asked to rank each variable on a scale of 0 (remove) to 100 (keep). The experts' responses for each variable were averaged and the variables with the highest average ranks were selected for analysis. Literature Search The approach used for the C-LTPP rutting model involved a literature search for rutting models developed by other researchers. Many papers were reviewed and a count was kept of the number of times certain variables appeared. The following conclusions were reached:
2.5.3 Evaluating Independent Variables Using Correlation Another approach to evaluating independent variables is purely empirical. In this approach, the experimental data is used to evaluate correlation of the independent variables with the dependent variable. Two methods can be used to do this, evaluating correlation coefficients or using scatter diagrams. Correlation Coefficient The correlation coefficient is a non-dimensional (i.e. unitless) measure of the strength of the correlation between two variables. Unlike covariance, the correlation coefficients for different pairs of variables can be readily compared, with a larger coefficient always indicating a stronger correlation. Variables to be used as independent variables in the regression equation should have a strong correlation with the dependent variable. Equation 2.4 is used to calculate the correlation coefficient. For further information on correlation theory, most textbooks on statistics will cover the topic of correlation in depth.
The correlation coefficient always has a value between -1 and 1. A coefficient value near 0 indicates no correlation. The closer the coefficient is to 1 or -1, the greater the strength of the relationship. For the relationship between prospective independent variables and the dependent variable, strong positive or negative relationships are equally desirable. Correlation coefficients are compared in relative terms with each other. Variables having the highest correlation with the dependent variable should be selected. The Excel software program used to run XLBayes and B-STAT contains a correlation analysis tool. This tool can be used to produce a matrix of correlation coefficients, showing the correlation between all possible pairs of variables in the data set. The reader should consult either the Excel manual or the 'Help' facility for further information on using the correlation analysis tool. For the C-LTPP rutting model, the correlation coefficient matrix for 18 prospective independent variables was evaluated. Although the selection of independent variables was primarily based on a literature search, the correlation matrix was also a consideration. To illustrate the use of correlation matrices, the matrix for the six independent variables and rut depth dependent variable used in the C-LTPP rutting model is presented in Table 2-2. The matrix is symmetrical because the correlation coefficient is always the same for two variables, irrespective of the order of the variables in calculating the coefficient. The matrix is presented here in lower triangular form with the (redundant) information in the upper right hand corner omitted. In addition to correlation coefficients between the dependent variable and each of the six independent variables, the correlation matrix shows the correlation between independent variables. One issue to be aware of is the potential for problems when independent variables are strongly correlated with each other. The reason for this is that because both variables tend to consistently change in the same manner, it is unclear which variable actually accounts for the change in performance. It is a tenet of linear regression that independent variables not be strongly correlated. If two independent variables are strongly correlated, it may be preferable to include only one of these variables and not both in the regression equation. This may apply to the Age and Log Traffic Variables in Table 2-2. Scatter Diagrams A second approach to selecting independent variables based on correlation is to use scatter diagrams. Scatter diagrams are generated by plotting data in a pair-wise fashion. Possible scatter plot combinations that may be looked at include:
The plots are inspected for noticeable trends indicating a possible relationship. Independent variables that are judged to have the strongest relationships with the dependent variable are selected. In the rutting example introduced previously, a scatter plot of rut depth vs. age (Figure 2-2) shows a clear correlation. As indicated by the steadily increasing rut depth with age, age will likely be a strong contributory variable in the model. The primary drawback of the scatter plot method is that the plots are often a shot-gun like pattern with no discernible trend. This is due to the fact that a scatter plot presents only a two-dimensional cross section view of the entire database. This two dimensional view does not account for the influence of other variables. An advantage of scatter plots is that they may reveal relationships and other trends which would not be detected from the correlation coefficient matrix, such as variable range and distribution. The correlation coefficient is a measure of linear correlation only and will not indicate a non-linear relationship between variables, such as a quadratic one (i.e. the dependent variable Y has a linear relationship with the square of Xi). Table 2-3 shows the variables used in the C-LTPP rutting model. Table 2-4 shows a sample of the data extracted from the C-LTPP database. The database used for the rutting model can be found in Appendix C. The first three independent (contributory) variables are measured properties of the AC mixture that can be controlled in its manufacture. The number of traffic loads that travel over a highway is often measured in KESAL's which stands for 'Thousands of Equivalent Single Axle Loads'. A single large truck might consist of a few ESAL's and a passenger car only a small fraction of an ESAL.
Table 2-4 : Sample of C-LTPP Experimental Database for Rutting in AC Overlays
In the C-LTPP modelling project, formal definitions of these variables were provided for the encoding package (see section 3.4.1) used to interview the experts. As an example, the following definition was used for the Percent Retained independent variable: Percent Retained "Percent Retained" is defined as the percentage of aggregate in the AC mix retained on a #4 sieve (on a dry weight basis). Percent retained is calculated for the top layer of the overlay only. The gradation of the mix is determined using AASHTO T30-871 on bulk AC samples. Additional examples that provide a good discussion on selecting independent variables can be found in the following joint C-SHRP/agency application reports (available on the CD-ROM version of this manual): Alberta - Predicting Roughness Progression on AC Overlays (pp. 3-6) (Kurlanda 1995). Nova Scotia - Evaluation of Rutting in Nova Scotia Special "B" AC Overlays (pp. 5-6) (Ramia 1995). 2.6 Postulate Functional Form - Step 5 Step 5 of the template is concerned with the functional form of the relationship between the selected dependent variable, Y, and independent variables, Xi. The functional form is used in the analysis of both the experimental data and the prior information. There are many different possible functional forms. The most basic is additive-linear:
Equation 2.5 The benefit of the additive-linear form is that it is easy to understand and the coefficients (bi's) can be readily interpreted as proportionality constants between each independent variable and the dependent variable. For example, consider a simple model such as Rut Depth = 1.5Age. The coefficient for age can be directly interpreted as meaning that rut depth progresses at a rate of 1.5 mm per year. In general, the functional form can have a relatively free-form structure as long as each term is made up of a coefficient multiplied by an arbitrary function of independent variables. Some examples are:
There is no definitive method for selecting the functional form. One approach has been simply to use the additive-linear form out of convenience. At a later time the additive-linear form can be enhanced by transforming individual variables or clustering two or more variables together. The functional form used for the C-LTPP rutting model has a simple additive linear form. It is useful to try to incorporate the underlying performance mechanism, even if approximate, in a regression model. In the rutting example, it was noted that traffic has a cumulative effect which was not represented well by the average annual traffic variable. It is intuitively reasonable that a cluster term which considers cumulative traffic would be better than simply considering age and traffic as separate independent variables. The form used in the second iteration of the model was to replace the separate age and traffic terms with a cluster term of AgeESAL's in order to incorporate the desired cumulative traffic effect. In a regression analysis, the functional form must postulate variable interaction like the cumulative effect of traffic. Bayesian and classical regression determine only linear coefficients for the variables. Thus determining variable interaction is a trial and error process, with successive functional forms postulated and evaluated. Another approach to determining functional form is to do a literature review. This approach allows an agency to apply its own data and judgment to a published model. The result will be a model specific to the agency's regional location. It also gives the agency an opportunity to compare its model results with others. When the functional form has been selected, the experimental dependent and independent data must be transformed accordingly. For example, in the C-LTPP rutting model the raw traffic data (in the form of KESAL's) was log transformed. Selected additional examples that offer a good discussion on postulating functional form are available on the CD-ROM version of this manual: Nova Scotia - Evaluation of Rutting in Nova Scotia's Special "B' AC Overlays (p. 7) (Ramia 1995) Quebec - Pavement Performance in Frost Conditions (pp. 13-16) (Doré 1995) Manitoba - Benkelman Beam Rebound - AC Overlay Model (pp. 5-6) (Kavanagh 1995) Summary of Chapter 2 This chapter has presented the basic steps involved in regression model design. The process is essentially identical for both Bayesian and classical regression. The choices made in the design phase are critical as a successful outcome will obviously be highly dependent on appropriate variable and functional form selection. Once the design is complete, the prior may be developed, the experimental data assembled and the model analyzed. References Afrani, I., Bradbury, A., Hajek, J., Deterioration of Asphaltic Concrete Surfaces Containing Steel Slag, Joint C-SHRP/Ontario Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Canadian Strategic Highway Research Program, C-LTPP Database User's Guide, Transportation Association of Canada, Ottawa, 1996. Doré, G., The Development of Models to Predict Pavement Performance in Frost Conditions, Joint C-SHRP/Quebec Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Kavanagh, L., Benkelman Beam Rebound - AC Overlay Model, Joint C-SHRP/Manitoba Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Kurlanda, Marian H., Kajner, L., Predicting Roughness Progression of Asphalt Overlays, Joint C-SHRP/Alberta Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada/Alberta Transportation and Utilities, Ottawa, 1995. Lytton, R., Technical Memorandum for SHRP Contract 89-P-020, 1989. Ramia A. P., Ali, N., Speiran, K., Evaluation of Rutting in Nova Scotia's Special "B" Asphalt Concrete Overlays, Joint C-SHRP/Nova Scotia Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Sparks, G., Nickeson, M, Kajner, L., Kaweski, D., Jorgenson, J, Bayesian Rutting Model Working Paper: Developing the Regression Model, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1993. Vemax Management Inc./Decision Focus Inc., Training Sessions in Bayesian Methods and Software, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. Vemax Management Inc., C-LTPP Bayesian Analysis Project - Consolidated Working File, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1994. Widger, A., Schmidt, R., Subgrade Shear Failures: Joint C-SHRP Saskatchewan Bayesian Application, Canadian Strategic Highway Research Program, Transportation Association of Canada, Ottawa, 1995. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||