Title Page
Abstract
Contents
Chapter 1. Introduction 19
1.1. Statement of the problem 19
1.2. Context of the study 22
1.3. Objectives of the study and research questions 27
1.4. Organization of the dissertation 30
Chapter 2. Literature review 32
2.1. Historical overview of speaking assessment 32
2.1.1. From interviewer-led to group interview 33
2.1.2. Assessment via multimedia 34
2.1.3. Virtual environment (VE)-based testing 39
2.2. Models of L2 speaking test performance 45
2.2.1. McNamara's model(1996) 46
2.2.2. Skehan's model(1998) 47
2.2.3. Fulcher's extended model(2003) 48
2.2.4. Implication on MAR-based speaking test 50
2.3. Affordances of augmented reality (AR) technology 51
2.3.1. Integration of text and picture comprehension 52
2.3.2. Social cues: personalization, embodiment and voice 53
2.3.3. Animation 54
2.3.4. Implication: Connection to language assessment design 55
2.4. Task Characteristics Framework for Test Design 58
2.4.1. Characteristics of the setting 59
2.4.2. Characteristics of the test rubrics 60
2.4.3. Characteristics of the input and expected response 65
2.4.4. Relationship between input and response 67
2.5. Test Method Characteristics 69
2.6. Validation framework 71
2.6.1. Historical overview 71
2.6.2. Constructing an Assessment Use Argument (AUA) 73
2.6.3. Interpretation/Use Argument(I/UA) structure 74
2.6.4. Validation framework for MARST 79
Chapter 3. Methodology 82
3.1. Test development 82
3.1.1. Domain analysis 83
3.1.2. Speaking construct 86
3.1.3. Test structure 87
3.1.4. Test task specifications 89
3.2. Participants 94
3.3. Data analysis 96
3.3.1. CTT and MTMM analysis 96
3.3.2. MFRM analysis 99
3.3.3. Questionnaires and interview data 101
3.3.4. Speaking response data 102
Chapter 4. Results 105
4.1. Descriptive analysis 105
4.1.1. Item analysis 107
4.1.2. Inter-rater reliability 110
4.1.3. Score reliability 112
4.2. MTMM analysis and test comparability 115
4.2.1. Correlation matrix 116
4.2.2. Factor analysis 118
4.3. MFRM analysis 124
4.3.1. Fit statistics 124
4.3.2. Interaction analysis 140
4.3.3. Analysis of unusual responses in MFRM 147
4.4. Analysis of the testing process 152
4.4.1. Perceptions of MAR mode 152
4.4.2. Analysis of speaking responses 158
4.5. Summary 176
Chapter 5. Validation 179
5.1. Validity argument 179
5.2. Analysis of target domain 182
5.3. Assessment records: Evaluation and generalization 183
5.4. Test interpretations: Explanation and extrapolation 189
5.4.1. Meaningfulness 189
5.4.2. Impartiality 193
5.4.3. Generalizability 195
5.4.4. Relevance and sufficiency 200
5.5. Decisions and test use: Utilization 202
5.6. Consequences 204
5.7. Summary of the validity argument 206
Chapter 6. Discussion 214
6.1. Summary of results for research questions 214
6.2. Validation issues 221
6.2.1. Integrating MAR technology in L2 assessment 221
6.2.2. Mode effect on test construct 224
6.2.3. Mode effect on test task 228
6.2.4. Mode effect on test-takers 231
6.2.5. Control of variabilities of test conditions 233
Chapter 7. Conclusion 235
7.1. Technological implications of MARST 235
7.2. Pedagogical implications of MARST 237
7.3. Limitations and suggestions for future research 238
Bibliography 241
국문 초록 259
Appendices 262
Appendix 1. Test design 262
Appendix 2. Questionnaires 269
Appendix 3. Mean scores of four dimensions(item easiness) 269
Appendix 4. Item-total correlation (item discrimination) 269
Appendix 5. Measure of agreement (Cohen's Kappa, p=.000) 270
Appendix 6. Predicted reliability for different test lengths (Spearman-Brown Prophecy formula) 270
Appendix 7. Unexpected responses (32 residuals) in MFRM analysis 270
Appendix 8. Misfit cases of test-takers' ability measures in MFRM analysis 271
Appendix 9. Sample transcripts of spoken responses to Task 3 271
Appendix 10. Main text (p.83~87, High School English, YBM, Han et al., YBM Holdings, 2018) 274
Appendix 11. One way ANOVA test result 276
Table 1. Summary of selected evaluative research on test mode 45
Table 2. Relationship between rating categories and assessment features 61
Table 3. Relationship between assessment features and learning contents 63
Table 4. Features of the relationship between input and response of speaking test formats 69
Table 5. Test method characteristics and advantages and limitations of MARST 70
Table 6. Comparison of the validation frameworks developed by Bachman and Palmer (2010) and Kane (1992; 2006; 2013) 81
Table 7. Relevant English Language Speaking Achievement Standards in the 2015 revised national curriculum of high school 83
Table 8. Test structure 88
Table 9. Rating procedure 96
Table 10. MTMM Design for quantitative analysis 97
Table 11. Post-test questionnaires 102
Table 12. Item statistics for MARST tasks (N=194) 106
Table 13. Cronbach's alpha and ICC for rater agreement of MARST tasks (N=194) 106
Table 14. Statistics for MARST rating criteria (N=194) 107
Table 15. Item-total statistics (Accuracy) 113
Table 16. Item-total statistics (Fluency) 113
Table 17. Item-total statistics (Content) 114
Table 18. Item-total statistics (Sum scores) 114
Table 19. Correlation matrix 117
Table 20. Descriptive statistics of the MARST tasks 118
Table 21. Correlations of the MARST tasks 119
Table 22. KMO and Bartlett's test 120
Table 23. Communalities 120
Table 24. Variance explained 120
Table 25. Component matrix 121
Table 26. Descriptive statistics of various speaking measures 122
Table 27. Correlations of various speaking measures 122
Table 28. KMO and Bartlett's test 123
Table 29. Communalities 123
Table 30. Variance explained 123
Table 31. Component matrix 123
Table 32. Summary of test-taker facet statistics 129
Table 33. Frequencies (%) of test-taker fit mean square statistics(N=194) 130
Table 34. Task measurement report 131
Table 35. Rater measurement report 133
Table 36. Category (Accuracy) scale statistics 135
Table 37. Category (Fluency) scale statistics 135
Table 38. Category (Content) scale statistics 135
Table 39. Rating category measurement report 136
Table 40. Summary of test-takers' questionnaire result(N=199) 153
Table 41. Summary of teachers' questionnaire results(N=17) 156
Table 42. Token and type of the speaking response corpus in three proficiency groups 162
Table 43. Top ten keywords in three proficiency groups 163
Table 44. Top ten 4-grams in three proficiency groups 167
Table 45. Frequency of modality across three proficiency groups 174
Table 46. Summary of articulating the validity argument of the MARST test use 209
Table 47. Integrating MAR technology into developing language assessment 222
Table 48. MAR-mediated competence in connection to communicative competence and interactive competence for L2 speaking assessment 227
Figure 1. Proficiency and its relationship with performance 46
Figure 2. Model of oral test performance 47
Figure 3. Extended model of speaking test performance 49
Figure 4. Framework of the MAR-integrated L2 speaking assessment 57
Figure 5. Inferential links from consequences to assessment performance 73
Figure 6. Sketch of the MARST interpretive argument 76
Figure 7. Screenshots of the pre-test stage on the MAR app 89
Figure 8. Screenshots of Task 1 on the MAR app 90
Figure 9. Screenshots of Task 2-1 and 2-2 on the MAR app 92
Figure 10. Screenshots of Task 3 on the MAR app 93
Figure 11. A screenshot of ratings in the MAR app 98
Figure 12. Questionnaires in the MAR app 102
Figure 13. Histograms of scores of three rating categories 107
Figure 14-a. Item easiness (Sum) 108
Figure 14-b. Item easiness (Categories) 108
Figure 15. Item discrimination (Item-total correlation) 109
Figure 16. Measure of agreement (Cohen's Kappa coefficient, p=.000) 111
Figure 17. Predicted reliability (Spearman-Brown prophecy formula) 115
Figure 18. Scree plot (Four MARST tasks) 121
Figure 19. Scree plot (Various speaking measures) 122
Figure 20. Item characteristic curve(ICC) of test scores and 95% CI 125
Figure 21. All facet vertical rulers 126
Figure 22. Category(Accuracy) scale structure 136
Figure 23. Category(Fluency) scale structure 137
Figure 24. Category (Content) scale structure 137
Figure 25. Probability curve (ACC) 139
Figure 26. Probability curve (FLU) 139
Figure 27. Probability curve (CONT) 139
Figure 28. Interaction statistics between the MAR mode and gender 141
Figure 29. Interaction statistics between the MAR mode and region 141
Figure 30. Interaction statistics between the MAR mode and rating criteria 142
Figure 31. Interaction statistics between the MAR mode and task 143
Figure 32. Interaction statistics between the MAR mode and test-takers' proficiency level 143
Figure 33. Interaction statistics between rater and test-taker gender 144
Figure 34. Interaction statistics between raters and test-taker's region 144
Figure 35. Interaction statistics between rater and rating criteria 145
Figure 36. Interaction statistics between rater and task 146
Figure 37. Teacher's workshop for practicing the MARST 158
Figure 38. A screenshot of loading a target corpus in Corpus manager menu 160
Figure 39. A screenshot of loading a reference corpus in Corpus manager menu 160