***************************************************************************; ** PROGRAM: control_3MRCV.sas **; ** AUTHOR: Christopher R. Bilder **; ** Department of Statistics **; ** University of Nebraska-Lincoln **; ** chris@chrisbilder.com **; ** DATE: 3-23-06 **; ** PURPOSE: Control program for other programs that do the following: **; ** Fit a generalized loglinear model via the marginal modeling **; ** method and use the bootstrap and Rao-Scott approximations **; ** to find a p-value for testing the model of interest vs. the **; ** saturated model. This program is for only 3 MRCVs. **; ** NOTES: **; ** 1) Do not distribute copies of this program without the permission of**; ** the author. **; ** 2) Copyright 2006 Christopher R. Bilder **; ** 3) For more information on these procedures, see **; ** Bilder, C. R. and Loughin, T. M. (2006). Modeling Association **; ** between Two or More Categorical Variables that Allow for **; ** Multiple Category Choices. Submitted to Communications in Stat.**; ** 4) This is the control program for the analysis. The program **; ** reads in code from Gange_3MRCV.sas, data_gen_resample.sas, and **; ** boot_3MRCV.sas to do the analysis. **; ** 5) The data set needs to be in a multinomial format and each row **; ** represents a subject sampled. **; ** 7) The statements: **; ** options nomprint nosymbolgen nomlogic; **; ** options nosource nonotes; **; ** may be helpful to prevent the LOG window from having too much **; ** information displayed to it. Of course, this should only be used **; ** when one is confident the user is running the program correctly! **; ** 8) A form of conditional resampling is done here to make sure the **; ** marginal table has at least one positive response per item. This **; ** is done to keep the resampled marginal table the same size as what**; ** was observed. See the &totsim and &numbsim variables to control **; ** this. **; ** 9) This program can take some time to run. I STRONGLY recommend **; ** running a test case with a small number of resamples and a small **; ** number of iterations for the Gange algorithm. The program takes **; ** about 6 hours on a 2.4GHZ processor with 1GB memory computer. **; ** 10) With respect to specifying the model, the data set is transformed **; ** in the program to the following format: **; ** **; ** W Y Z wi yj zk COUNT **; ** 1 1 1 0 0 0 115 **; ** 1 1 1 0 0 1 8 **; ** 1 1 1 0 1 0 88 **; ** 1 1 1 0 1 1 28 **; ** 1 1 1 1 0 0 11 **; ** 1 1 1 1 0 1 2 **; ** 1 1 1 1 1 0 21 **; ** 1 1 1 1 1 1 6 **; ** 1 1 2 0 0 0 93 **; ** 1 1 2 0 0 1 30 **; ** ... **; ** **; **; ** The W, Y, and Z denote the item numbers. The wi, yj, and zk **; ** denote the 0 or 1 item responses. Thus, the first row count above**; ** denotes the W_1 = 0, Y_1 = 0, Z_1 = 0 sub-table cell count of 115.**; ** Models can be specified in a similar manner as shown in **; ** control_2MRCV.sas for the MODEL statement of PROC GENMOD. For **; ** example, W*Y*Z accounts for the gamma_ijk terms. **; ** PROC GENMOD Model statements are given in the Settings section **; ** below to model the data in this format. **; ** 11) These programs have only been tested in SAS version 8.02 **; ** 12) The code in boot_3MRCV.sas is not quite general enough to use with**; ** any data set. Code in the DATA STEP to form pred_sort2 and the **; ** DATA STEP to form sasdata_one3 would need to be changed to **; ** correspond to the correct number of W, Y, and Z items. **; ** 13) The RS MACRO calculates the V matrix of Appendix A. The program **; ** is currently set up to read the matrix in from v_mat_m.txt which **; ** contains the matrix from the Section 4.2 data set. This is done **; ** since it can take 20 minutes for the matrix to be calculated **; ** in PROC IML and it does not change for each model. When a **; ** different data set is used, one should changed some of the **; ** code in the RS MACRO (this part is commented - see macro for more **; ** information). **; ***************************************************************************; dm 'log;clear;output;clear;'; options ps=70 ls=100 pageno=1; goptions reset=all border ftext=swiss gunit=cm htext=0.4 htitle=0.5; options source notes; options mprint symbolgen mlogic; *Suppress information sent to the RESULTS window (the program will be slower without it); ODS noresults; title1 'Chris Bilder''s marginal modeling of 3 MRCVs program'; ************************************************************************************; *Read in other programs containing the code; * Do not forget to change the folder from where I stored the programs on my computer; * to where you are storing the programs on your computer!; *Part of Gange algorithm which finds a set of multinomial probabilities satisfying the * null hypothesis model; %include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\Gange_3MRCV.sas"; *Uses the multinomial probabilities from the Gange program to take the resamples; %include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\data_gen_resample.sas"; *Performs the bootstrap and Rao-Scott (RS) methods, also finds the standardized Pearson residuals; %include "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\boot_3MRCV_final.sas"; ***************************************************************************************; *Settings; *Number of resamples to take. This should be greater than or equal to the number actually desired.; * Resamples are not used if there are no positive responses for an item. This is done to keep; * the marginal table the same size as what was observed.; * I recommend using 6,000 with this data set.; %let totsim=6; *Number of resamples actually to use. For example, B = 5,000 are used for the 2 MRCV example in; * the paper. I recommend trying a low number (like 5) first in order to estimate how long * it will take for 5,000.; %let numbsim=5; *Number of row items for W; %let nrow2=3; *Number of column items for Y; %let ncol2=4; *Number of strata items for Z; %let nstrat2=5; *Convergence criterion for Gange program - max(abs(tau*_i - tau*_(i+1))) <= tol where; * tau*_i is a vector of multinomial probabilities for the ith iteration; %let tol = 0.00000001; *Maximum Gange iterations for iterative proportional fitting; * I recommend using at least 250, but test (time it) first with a smaller number.; %let gange_iter = 5; *Constant to add to 0 cells in the item response table; %let constant_zero = 0.5; *Level of significance used with bootstrap methods; %let alpha = 0.05; *Class statement to use in PROCs GENMOD and GLMMOD (2nd PROC is used to find the X matrix); %let class = w y z wi yj zk; *Complete independence model; %let comp_ind_model = w*y*z wi(w*y*z) yj(w*y*z) zk(w*y*z); *Ho model - everything after complete independence model; %let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z) yj*zk(y) yj*zk(z*y); *Final model; *%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z) yj*zk(y); *%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(z); *%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk yj*zk(y); *%let model = wi*yj wi*yj(y) wi*yj(w) wi*yj(y*w) yj*zk; *%let model = ; *For complete independence model; *NOTE: The final model above with constant_zero = 0.5 will result in a few standardized Pearson ; * residuals which are greater than 2.576 in absolute value. These residuals are all ; * associated with the 0 cell count sub-tables. When this constant is lowered (say to 0.001),; * there are no more standardized Pearson residuals greater than 2.576 in absolute value; * The other models do have standardized Pearson residuals greater than 2.576 when adding; * this constant and these residuals are not associated with the 0 cell count sub-tables.; *Number of sub-tables - no changes need to be made here; %let ncol=%eval(&nrow2 + &ncol2 + &nstrat2); *Total number of multinomial probabilities - no changes need to be made here; %let numbp=%eval(2**&ncol); ***************************************************************************************; *Read in the data set and get it into the correct format; *W is test waste, Y is waste storage method, and Z is sources of veterinary information; data sasdata_one; infile 'C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\3_MRCV_example_data.txt' firstobs=2; input w1-w3 y1-y4 z1-z5; run; *Add a dummy variable to the data set; * The programs are actually partially set up for a simulation study. By using just one iteration ; * and a few other modifications (already done here), the programs can be used for an example data set; data sasdata_two; set sasdata_one; iter = 1; run; *This DATA STEP reads in the V matrix (cov(m)) given in Appendix A. The boot_3MRCV.sas contains code in its * RS macro to calculate it, but this can take some time. Since the matrix does not * change for each model, I calculated it once and then just read it in for whatever model under * investigation. ; PROC IMPORT OUT= WORK.COV_M_SET DATAFILE= "C:\chris\UNL\research\NSF_grant_SES-0207212\paper_communications\revision\programs\v_mat_m.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; RUN; ***************************************************************; * Bootstrap; *%SIM_LOOP(data set, seed number for resamples, sample size); * - Performs the bootstrap methods; * - This is located in the boot_3MRCV.sas program; %SIM_LOOP(sasdata_two, 636265496, 279); *%RS(sample size) - Finds the Rao-Scott (RS) adjusted Pearson statistic and corresponding p-value; * - Also finds the standardized Pearson residuals; * - This is located in the boot_3MRCV.sas program; %RS(279); ***************************************************************; *Sound to let user know the program is done; data _null_; call sound(300, 1000); run; quit;