목차

Title Page

Abstract

Contents

Chapter 1. Introduction 19

1.1. Statement of the problem 19

1.2. Context of the study 22

1.3. Objectives of the study and research questions 27

1.4. Organization of the dissertation 30

Chapter 2. Literature review 32

2.1. Historical overview of speaking assessment 32

2.1.1. From interviewer-led to group interview 33

2.1.2. Assessment via multimedia 34

2.1.3. Virtual environment (VE)-based testing 39

2.2. Models of L2 speaking test performance 45

2.2.1. McNamara's model(1996) 46

2.2.2. Skehan's model(1998) 47

2.2.3. Fulcher's extended model(2003) 48

2.2.4. Implication on MAR-based speaking test 50

2.3. Affordances of augmented reality (AR) technology 51

2.3.1. Integration of text and picture comprehension 52

2.3.2. Social cues: personalization, embodiment and voice 53

2.3.3. Animation 54

2.3.4. Implication: Connection to language assessment design 55

2.4. Task Characteristics Framework for Test Design 58

2.4.1. Characteristics of the setting 59

2.4.2. Characteristics of the test rubrics 60

2.4.3. Characteristics of the input and expected response 65

2.4.4. Relationship between input and response 67

2.5. Test Method Characteristics 69

2.6. Validation framework 71

2.6.1. Historical overview 71

2.6.2. Constructing an Assessment Use Argument (AUA) 73

2.6.3. Interpretation/Use Argument(I/UA) structure 74

2.6.4. Validation framework for MARST 79

Chapter 3. Methodology 82

3.1. Test development 82

3.1.1. Domain analysis 83

3.1.2. Speaking construct 86

3.1.3. Test structure 87

3.1.4. Test task specifications 89

3.2. Participants 94

3.3. Data analysis 96

3.3.1. CTT and MTMM analysis 96

3.3.2. MFRM analysis 99

3.3.3. Questionnaires and interview data 101

3.3.4. Speaking response data 102

Chapter 4. Results 105

4.1. Descriptive analysis 105

4.1.1. Item analysis 107

4.1.2. Inter-rater reliability 110

4.1.3. Score reliability 112

4.2. MTMM analysis and test comparability 115

4.2.1. Correlation matrix 116

4.2.2. Factor analysis 118

4.3. MFRM analysis 124

4.3.1. Fit statistics 124

4.3.2. Interaction analysis 140

4.3.3. Analysis of unusual responses in MFRM 147

4.4. Analysis of the testing process 152

4.4.1. Perceptions of MAR mode 152

4.4.2. Analysis of speaking responses 158

4.5. Summary 176

Chapter 5. Validation 179

5.1. Validity argument 179

5.2. Analysis of target domain 182

5.3. Assessment records: Evaluation and generalization 183

5.4. Test interpretations: Explanation and extrapolation 189

5.4.1. Meaningfulness 189

5.4.2. Impartiality 193

5.4.3. Generalizability 195

5.4.4. Relevance and sufficiency 200

5.5. Decisions and test use: Utilization 202

5.6. Consequences 204

5.7. Summary of the validity argument 206

Chapter 6. Discussion 214

6.1. Summary of results for research questions 214

6.2. Validation issues 221

6.2.1. Integrating MAR technology in L2 assessment 221

6.2.2. Mode effect on test construct 224

6.2.3. Mode effect on test task 228

6.2.4. Mode effect on test-takers 231

6.2.5. Control of variabilities of test conditions 233

Chapter 7. Conclusion 235

7.1. Technological implications of MARST 235

7.2. Pedagogical implications of MARST 237

7.3. Limitations and suggestions for future research 238

Bibliography 241

국문 초록 259

Appendices 262

Appendix 1. Test design 262

Appendix 2. Questionnaires 269

Appendix 3. Mean scores of four dimensions(item easiness) 269

Appendix 4. Item-total correlation (item discrimination) 269

Appendix 5. Measure of agreement (Cohen's Kappa, p＝.000) 270

Appendix 6. Predicted reliability for different test lengths (Spearman-Brown Prophecy formula) 270

Appendix 7. Unexpected responses (32 residuals) in MFRM analysis 270

Appendix 8. Misfit cases of test-takers' ability measures in MFRM analysis 271

Appendix 9. Sample transcripts of spoken responses to Task 3 271

Appendix 10. Main text (p.83~87, High School English, YBM, Han et al., YBM Holdings, 2018) 274

Appendix 11. One way ANOVA test result 276

Table 1. Summary of selected evaluative research on test mode 45

Table 2. Relationship between rating categories and assessment features 61

Table 3. Relationship between assessment features and learning contents 63

Table 4. Features of the relationship between input and response of speaking test formats 69

Table 5. Test method characteristics and advantages and limitations of MARST 70

Table 6. Comparison of the validation frameworks developed by Bachman and Palmer (2010) and Kane (1992; 2006; 2013) 81

Table 7. Relevant English Language Speaking Achievement Standards in the 2015 revised national curriculum of high school 83

Table 8. Test structure 88

Table 9. Rating procedure 96

Table 10. MTMM Design for quantitative analysis 97

Table 11. Post-test questionnaires 102

Table 12. Item statistics for MARST tasks (N＝194) 106

Table 13. Cronbach's alpha and ICC for rater agreement of MARST tasks (N＝194) 106

Table 14. Statistics for MARST rating criteria (N＝194) 107

Table 15. Item-total statistics (Accuracy) 113

Table 16. Item-total statistics (Fluency) 113

Table 17. Item-total statistics (Content) 114

Table 18. Item-total statistics (Sum scores) 114

Table 19. Correlation matrix 117

Table 20. Descriptive statistics of the MARST tasks 118

Table 21. Correlations of the MARST tasks 119

Table 22. KMO and Bartlett's test 120

Table 23. Communalities 120

Table 24. Variance explained 120

Table 25. Component matrix 121

Table 26. Descriptive statistics of various speaking measures 122

Table 27. Correlations of various speaking measures 122

Table 28. KMO and Bartlett's test 123

Table 29. Communalities 123

Table 30. Variance explained 123

Table 31. Component matrix 123

Table 32. Summary of test-taker facet statistics 129

Table 33. Frequencies (%) of test-taker fit mean square statistics(N＝194) 130

Table 34. Task measurement report 131

Table 35. Rater measurement report 133

Table 36. Category (Accuracy) scale statistics 135

Table 37. Category (Fluency) scale statistics 135

Table 38. Category (Content) scale statistics 135

Table 39. Rating category measurement report 136

Table 40. Summary of test-takers' questionnaire result(N＝199) 153

Table 41. Summary of teachers' questionnaire results(N＝17) 156

Table 42. Token and type of the speaking response corpus in three proficiency groups 162

Table 43. Top ten keywords in three proficiency groups 163

Table 44. Top ten 4-grams in three proficiency groups 167

Table 45. Frequency of modality across three proficiency groups 174

Table 46. Summary of articulating the validity argument of the MARST test use 209

Table 47. Integrating MAR technology into developing language assessment 222

Table 48. MAR-mediated competence in connection to communicative competence and interactive competence for L2 speaking assessment 227

Figure 1. Proficiency and its relationship with performance 46

Figure 2. Model of oral test performance 47

Figure 3. Extended model of speaking test performance 49

Figure 4. Framework of the MAR-integrated L2 speaking assessment 57

Figure 5. Inferential links from consequences to assessment performance 73

Figure 6. Sketch of the MARST interpretive argument 76

Figure 7. Screenshots of the pre-test stage on the MAR app 89

Figure 8. Screenshots of Task 1 on the MAR app 90

Figure 9. Screenshots of Task 2-1 and 2-2 on the MAR app 92

Figure 10. Screenshots of Task 3 on the MAR app 93

Figure 11. A screenshot of ratings in the MAR app 98

Figure 12. Questionnaires in the MAR app 102

Figure 13. Histograms of scores of three rating categories 107

Figure 14-a. Item easiness (Sum) 108

Figure 14-b. Item easiness (Categories) 108

Figure 15. Item discrimination (Item-total correlation) 109

Figure 16. Measure of agreement (Cohen's Kappa coefficient, p＝.000) 111

Figure 17. Predicted reliability (Spearman-Brown prophecy formula) 115

Figure 18. Scree plot (Four MARST tasks) 121

Figure 19. Scree plot (Various speaking measures) 122

Figure 20. Item characteristic curve(ICC) of test scores and 95% CI 125

Figure 21. All facet vertical rulers 126

Figure 22. Category(Accuracy) scale structure 136

Figure 23. Category(Fluency) scale structure 137

Figure 24. Category (Content) scale structure 137

Figure 25. Probability curve (ACC) 139

Figure 26. Probability curve (FLU) 139

Figure 27. Probability curve (CONT) 139

Figure 28. Interaction statistics between the MAR mode and gender 141

Figure 29. Interaction statistics between the MAR mode and region 141

Figure 30. Interaction statistics between the MAR mode and rating criteria 142

Figure 31. Interaction statistics between the MAR mode and task 143

Figure 32. Interaction statistics between the MAR mode and test-takers' proficiency level 143

Figure 33. Interaction statistics between rater and test-taker gender 144

Figure 34. Interaction statistics between raters and test-taker's region 144

Figure 35. Interaction statistics between rater and rating criteria 145

Figure 36. Interaction statistics between rater and task 146

Figure 37. Teacher's workshop for practicing the MARST 158

Figure 38. A screenshot of loading a target corpus in Corpus manager menu 160

Figure 39. A screenshot of loading a reference corpus in Corpus manager menu 160