Fine-Tune Language Models: Instruction Tuning

Nature Language Processing

Posted by Cheng Kang on 2022-09-10
Fine-Tune Language Models: Instruction Tuning

You can find the codes from this:

This article will take you to ucover the facial mask of Instruction Tuning, a new paradigm proposed by the Quoc V. team at Google. You can read the article from InstructionNER: A Multi-Task Instruction-Based Generative Framework for Few-shot NER. However, this blog aims to use the Instruction Tuning strategy,

Requirement:

  • Python/3.6.8-foss-2019a
  • pip install requirements.txt
  • mt5-base or mt5-large
  1. Transfer Data to Instruction
  2. Pretrain on Professional Language Data
  3. Fine-Tune on Instruction
  4. Tips

1. Transfer Data to Instruction

The issue became to seq2seq. For example, we use the template [Find the entity of {Thing} in article {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.} ?] as a seq2seq task to replace the entity extraction tasks extracting {Thing: ant, road} from {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.} You also can transfer CoLA, STS, RTE… tasks to Instruction by constructing a proper seq2seq template.

This class is developed to transfer NER to NERInstruction:
You can set the “label_mappings” as what you want to be: Abstract/Time/Quantifier/Address/Measurement/Thing/Special Name/Organization/Name.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class NERInstruction(Instruction):
def __init__(self, data_list: List, verbalizer: Dict, instruction: str, keys_order: List[str], data_type: str):
super(NERInstruction, self).__init__(data_list, verbalizer, instruction, keys_order, data_type)

def transform2instruction(self):
examples = []
for sample in self.data_list:
example = {k: v for k, v in sample.items()}
example["target"] = ",".join(example["entities"]) if "entities" in example and type(example["entities"]) is list else ""
example["entity_type"] = self.verbalizer[example["entity_type"]]
example["verbalizer"] = "/".join(list(set(self.verbalizer.values())))
example["instruction"] = self.instruction.format(*[
example[k] for k in self.keys_order
])
example["data_type"] = self.data_type
examples.append(example)
return examples

From Entity Extraction:

{
    "context": "Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.",
    "entity_type": "Thing",
    "entities": [
    "road",
    "ant"
    ],
    "ID": "NER_SANWEN_001"
}

To Instruction:

{
    "context": "Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.",
    "entity_type": "Thing",
    "entities": [
        "road",
        "ant"
    ],
    "ID": "NER_SANWEN_001",
    "target": "road, ant",
    "verbalizer": "Abstract/Time/Quantifier/Address/Measurement/Thing/Special Name/Organization/Name",
    "instruction": "Find the entity of {Thing} in article {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.}?",
    "data_type": "ner"
}

2. Pretrain on Professional Language Data

Pretraining Language Dataset contains knowledge of Healthcare, Medical, Insurance, Finacial, Law.

1.ClassificationInstruction, 2. NLIInstruction, 3. STSInstruction, 4. PARAInstruction, 5. WeiboEmotionInstruction, 6. C3Instruction, 7. MRCInstruction, 8. WSCInstruction, 9. KEYSInstruction, 10. SUMMInstruction, 11. NERInstruction, 12. MRCInstruction. .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
dataset2instruction = {
"senti": {
"prompt": "【{}】这篇文章的情感态度是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
},
"cls": {
"prompt": "【{}】这篇文章的类别是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
},
"app": {
"prompt": "【{}】这篇文章的类别是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
},
"news": {
"prompt": "【{}】这篇文章的类别是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
},
"intent": {
"prompt": "【{}】这句话的意图是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
},
"nli": {
"prompt": "【{}】和【{}】,以上两句话的逻辑关系是什么?{}",
"keys_order": ["text_a", "text_b", "verbalizer"],
"instruction": NLIInstruction,
"data_type": "classification",
},
"sts": {
"prompt": "【{}】和【{}】,以上两句话的内容是否相似?{}",
"keys_order": ["text_a", "text_b", "verbalizer"],
"instruction": STSInstruction,
"data_type": "classification",
},
"para": {
"prompt": "【{}】和【{}】,以上两句话的内容是否相似?{}",
"keys_order": ["text_a", "text_b", "verbalizer"],
"instruction": PARAInstruction,
"data_type": "classification",
},
"mrc": {
"prompt": "阅读文章【{}】问题【{}】的答案是什么?",
"keys_order": ["context", "question"],
"instruction": MRCInstruction,
"data_type": "mrc",
},
"ner": {
"prompt": "找出【{}】这篇文章中所有【{}】类型的实体?",
"keys_order": ["context", "entity_type"],
"instruction": NERInstruction,
"data_type": "ner",
},
"summ": {
"prompt": "【{}】这篇文章的摘要是什么?",
"keys_order": ["passage"],
"instruction": SUMMInstruction,
"data_type": "summ",
},
"keys": {
"prompt": "【{}】这篇文章的关键词是什么?",
"keys_order": ["text_a"],
"instruction": KEYSInstruction,
"data_type": "keys",
},
"wsc": {
"prompt": "文章【{}】中【{}】的是【{}】吗?{}",
"keys_order": ["text", "target/span2_text", "target/span1_text", "verbalizer"],
"instruction": WSCInstruction,
"data_type": "classification",
},
"yesno": {
"prompt": "阅读文章【{}】问题【{}】?{}",
"keys_order": ["text_b", "text_a", "verbalizer"],
"instruction": MRCInstruction,
"data_type": "classification",
},
"c3": {
"prompt": "阅读文章【{}】问题【{}】{}",
"keys_order": ["context", "question", "choice"],
"instruction": C3Instruction,
"data_type": "classification",
},
"weibo_emotion": {
"prompt": "【{}】这篇文章的情感态度是什么?{}",
"keys_order": ["text_a", "verbalizer"],
"instruction": WeiboEmotionInstruction,
"data_type": "classification",
},
"lsht": {
"prompt": "【{}】这篇文章的类别是什么?{}",
"keys_order": ["content", "verbalizer"],
"instruction": ClassificationInstruction,
"data_type": "classification",
}
}


def instruction_format(data_dict: Dict) -> List[Dict]:
special_datasets = {
"dureader_yesno": "yesno",
"c3": "c3",
"NLPCC2014_Weibo_Emotion_classification": "weibo_emotion",
"NLPCC2014_LSHT_sample": "lsht"
}
instruction_data = []
for data_type, type_dict in data_dict.items():
for data_name, data_info in type_dict.items():
label_mappings = data_info.get("label_mappings")
data_list = data_info["data_list"]
format_info = dataset2instruction[special_datasets.get(data_name, data_type)]
instruction_processor = format_info["instruction"](
data_list,
label_mappings,
format_info["prompt"],
format_info["keys_order"],
format_info["data_type"]
)
instruction_data.extend(instruction_processor.transform2instruction())
print(instruction_data[-1])

return instruction_data

3. Fine-Tune on Instruction

Few-Shot

Zero-Shot

4. Tips

1. Clean and Process Data: distribution of the Data, remove abnormal symbols, data augmentation (repetition or new relevant data).

2. Redesign the Label: the variation of SpEx-BERT. You can find this method from one entity has one specific demention.

3. Fusion of Language Models: BERT-base without fully connected layer + BERT-base without fully connected layer + … + BERT-large fully connected layer + emmbedding.