Title Page
Abstract
Contents
Chapter 1. Introduction 15
1.1. Motivation 15
1.2. Contributions 16
1.3. Dissertation Structure 18
Chapter 2. Background 19
2.1. In-context Learning 19
2.2. Manual prompt tuning 21
2.3. Automatic prompt engineering 21
2.3.1. Gradient-based Prompt Engineering 22
2.3.2. Gradient-free Prompt Engineering 23
2.4. Language Model as a Service 26
2.5. Computation costs of automatic prompt engineering methods 26
2.6. Efficiency of prompt engineering methods 29
2.7. Meta-learning 29
Chapter 3. Context Regularization for Gradient-based Prompt Tuning 31
3.1. Motivation 31
3.2. Approach 32
3.2.1. Context attuning 34
3.2.2. Context filtering 37
3.2.3. Practical objectives 38
3.3. Experimental Setup 40
3.4. Experimental Results 43
3.4.1. Main result 43
3.4.2. How many examples to concat? 46
3.4.3. The impact of each regularizer 49
3.4.4. Measurement of bias caused by prompts 50
3.4.5. Which examples to concat? 51
3.4.6. Few-shot in-context learning 52
3.5. Discussion 53
Chapter 4. Meta-Learning of Prompt Generation for In-context Learning 56
4.1. Motivation 56
4.2. MetaL-Prompt 57
4.2.1. Prompt generation model (PGM) 59
4.2.2. Trainable padding 61
4.2.3. Prompt design 61
4.3. Experimental setup 62
4.4. Experimental Results 69
4.4.1. Prompt-only In-context Learning 69
4.4.2. Additional demonstrations with generated prompts 71
4.4.3. Transferability to different test settings 72
4.4.4. Comparison between various prompt designs 73
4.5. Discussion 75
Chapter 5. Conclusion 78
5.1. Conclusion 78
5.2. Future Work 79
5.2.1. Interpretation of in-context learning 79
5.2.2. Applicability of automatic prompt engineering according to tasks 80
5.2.3. Location of adaptation modules 80
Bibliography 82
초록 94
Table 2.1. Classification of automatic prompt engineering methods based on gradient usage and computation cost for prompt creation. Methods that are... 22
Table 2.2. Number of examples that each prompt tuning methods processed for the experiments in the original papers. Note that we present the additional costs... 27
Table 3.1. Hyperparameter search space for P-tuning, Softprompt, and Prefixtuningat GPT2-XL and GPT-J 42
Table 3.2. Learning rate for P-tuning, Softprompt, and Prefix-tuning at GPT2-XL and GPT-J 42
Table 3.3. Comparison of zero-shot evaluation results between three baselines and applying CoRe to the baselines across various NLU datasets. For evaluation metric, we used averaged accuracy. We highlight the better one among... 44
Table 3.4. The performance of P-tuning (s=1) and CoRe on P-tuning where s is the sequence size. We highlight the best performance among all sequence sizes. 47
Table 3.5. The performance of Softprompt (s=1) and CoRe on Softprompt where s is the sequence size. We highlight the best performance among all sequence sizes. 48
Table 3.6. The performance of Prefix-tuning (s=1) and CoRe on Prefix-tuning where s is the sequence size. We highlight the best performance among all... 49
Table 3.7. Ablations studies on the effect of context attuning regularizer (A) and context filtering regularizer (F) where sequence size is 2. 50
Table 3.8. Comparison of one-step generalization ratio [40] before and after applying CoRe. 51
Table 3.9. Comparison of zero-shot performance of various sampling methods considering semantic similarity. Standard is CoRe without such sampling. 51
Table 4.1. The number of datasets for each task setting. There is no overlapbetween the meta-learning datasets and the evaluation datasets in each setting. 63
Table 4.2. Hyperparameters for the baselines. 66
Table 4.3. Number of examples that the baselines process with the forward andbackward passes for each task. 67
Table 4.4. Comparison of evaluation results on the prompt-only setting between MetaL-Prompt and the baselines. All examples are used for prompt generation or tuning, and no additional demonstration is provided for in-context learning. 68
Table 4.5. Comparison of the performances on various distribution of examplesfor prompting and demonstration. We use GPT2-XL for the LM and PGMs. 71
Table 4.6. Results for transferability of the PGMs. We evaluate the PGMs forGPT2-XL on the test settings different from the training settings. 73
Table 4.7. Ablation studies on the effect of our prompt design. We evaluatevarious prompt designs on GPT2-XL and cls→cls. 74
Figure 2.1. Illustration of three categories of gradient-free methods. 24
Figure 3.1. Abstracted illustration of the two regularizers, context attuning (red line) and context filtering (blue line), of CoRe with sentiment analysis data.... 32
Figure 3.2. Zero-shot evaluation on varying sequence sizes. Accuracy for each dataset is normalized with respect to its accuracy when the sequence size is 1. 46
Figure 3.3. Comparison of few-shot in-context learning performance between P-tuning and P-tuning with CoRe, normalized by zero-shot performance of P-... 53
Figure 4.1. A workflow of MetaL-Prompt on LMaaS. 58
Figure 4.2. An illustration of meta-learning of a prompt generation model. 59