Fine-Tune Language Models: Instruction Tuning

You can find the codes from this:

This article will take you to ucover the facial mask of Instruction Tuning, a new paradigm proposed by the Quoc V. team at Google. You can read the article from InstructionNER: A Multi-Task Instruction-Based Generative Framework for Few-shot NER. However, this blog aims to use the Instruction Tuning strategy,

Requirement:

Python/3.6.8-foss-2019a
pip install requirements.txt
mt5-base or mt5-large

Transfer Data to Instruction
Pretrain on Professional Language Data
Fine-Tune on Instruction
Tips

1. Transfer Data to Instruction

The issue became to seq2seq. For example, we use the template [Find the entity of {Thing} in article {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.} ？] as a seq2seq task to replace the entity extraction tasks extracting {Thing: ant, road} from {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.} You also can transfer CoLA, STS, RTE… tasks to Instruction by constructing a proper seq2seq template.

This class is developed to transfer NER to NERInstruction:
You can set the “label_mappings” as what you want to be: Abstract/Time/Quantifier/Address/Measurement/Thing/Special Name/Organization/Name.

class NERInstruction(Instruction):
    def __init__(self, data_list: List, verbalizer: Dict, instruction: str, keys_order: List[str], data_type: str):
        super(NERInstruction, self).__init__(data_list, verbalizer, instruction, keys_order, data_type)

    def transform2instruction(self):
        examples = []
        for sample in self.data_list:
            example = {k: v for k, v in sample.items()}
            example["target"] = "，".join(example["entities"]) if "entities" in example and type(example["entities"]) is list else ""
            example["entity_type"] = self.verbalizer[example["entity_type"]]
            example["verbalizer"] = "/".join(list(set(self.verbalizer.values())))
            example["instruction"] = self.instruction.format(*[
                example[k] for k in self.keys_order
            ])
            example["data_type"] = self.data_type
            examples.append(example)
        return examples

From Entity Extraction:

{
    "context": "Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.",
    "entity_type": "Thing",
    "entities": [
    "road",
    "ant"
    ],
    "ID": "NER_SANWEN_001"
}

To Instruction:

{
    "context": "Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.",
    "entity_type": "Thing",
    "entities": [
        "road",
        "ant"
    ],
    "ID": "NER_SANWEN_001",
    "target": "road, ant",
    "verbalizer": "Abstract/Time/Quantifier/Address/Measurement/Thing/Special Name/Organization/Name",
    "instruction": "Find the entity of {Thing} in article {Ants hurriedly carry the evening, a long road, holding the endless happiness of returning.}？",
    "data_type": "ner"
}

2. Pretrain on Professional Language Data

Pretraining Language Dataset contains knowledge of Healthcare, Medical, Insurance, Finacial, Law.

1.ClassificationInstruction, 2. NLIInstruction, 3. STSInstruction, 4. PARAInstruction, 5. WeiboEmotionInstruction, 6. C3Instruction, 7. MRCInstruction, 8. WSCInstruction, 9. KEYSInstruction, 10. SUMMInstruction, 11. NERInstruction, 12. MRCInstruction. .

dataset2instruction = {
    "senti": {
        "prompt": "【{}】这篇文章的情感态度是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    },
    "cls": {
        "prompt": "【{}】这篇文章的类别是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    },
    "app": {
        "prompt": "【{}】这篇文章的类别是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    },
    "news": {
        "prompt": "【{}】这篇文章的类别是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    },
    "intent": {
        "prompt": "【{}】这句话的意图是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    },
    "nli": {
        "prompt": "【{}】和【{}】，以上两句话的逻辑关系是什么？{}",
        "keys_order": ["text_a", "text_b", "verbalizer"],
        "instruction": NLIInstruction,
        "data_type": "classification",
    },
    "sts": {
        "prompt": "【{}】和【{}】，以上两句话的内容是否相似？{}",
        "keys_order": ["text_a", "text_b", "verbalizer"],
        "instruction": STSInstruction,
        "data_type": "classification",
    },
    "para": {
        "prompt": "【{}】和【{}】，以上两句话的内容是否相似？{}",
        "keys_order": ["text_a", "text_b", "verbalizer"],
        "instruction": PARAInstruction,
        "data_type": "classification",
    },
    "mrc": {
        "prompt": "阅读文章【{}】问题【{}】的答案是什么？",
        "keys_order": ["context", "question"],
        "instruction": MRCInstruction,
        "data_type": "mrc",
    },
    "ner": {
        "prompt": "找出【{}】这篇文章中所有【{}】类型的实体？",
        "keys_order": ["context", "entity_type"],
        "instruction": NERInstruction,
        "data_type": "ner",
    },
    "summ": {
        "prompt": "【{}】这篇文章的摘要是什么？",
        "keys_order": ["passage"],
        "instruction": SUMMInstruction,
        "data_type": "summ",
    },
    "keys": {
        "prompt": "【{}】这篇文章的关键词是什么？",
        "keys_order": ["text_a"],
        "instruction": KEYSInstruction,
        "data_type": "keys",
    },
    "wsc": {
        "prompt": "文章【{}】中【{}】的是【{}】吗？{}",
        "keys_order": ["text", "target/span2_text", "target/span1_text", "verbalizer"],
        "instruction": WSCInstruction,
        "data_type": "classification",
    },
    "yesno": {
        "prompt": "阅读文章【{}】问题【{}】？{}",
        "keys_order": ["text_b", "text_a", "verbalizer"],
        "instruction": MRCInstruction,
        "data_type": "classification",
    },
    "c3": {
        "prompt": "阅读文章【{}】问题【{}】{}",
        "keys_order": ["context", "question", "choice"],
        "instruction": C3Instruction,
        "data_type": "classification",
    },
    "weibo_emotion": {
        "prompt": "【{}】这篇文章的情感态度是什么？{}",
        "keys_order": ["text_a", "verbalizer"],
        "instruction": WeiboEmotionInstruction,
        "data_type": "classification",
    },
    "lsht": {
        "prompt": "【{}】这篇文章的类别是什么？{}",
        "keys_order": ["content", "verbalizer"],
        "instruction": ClassificationInstruction,
        "data_type": "classification",
    }
}


def instruction_format(data_dict: Dict) -> List[Dict]:
    special_datasets = {
        "dureader_yesno": "yesno",
        "c3": "c3",
        "NLPCC2014_Weibo_Emotion_classification": "weibo_emotion",
        "NLPCC2014_LSHT_sample": "lsht"
    }
    instruction_data = []
    for data_type, type_dict in data_dict.items():
        for data_name, data_info in type_dict.items():
            label_mappings = data_info.get("label_mappings")
            data_list = data_info["data_list"]
            format_info = dataset2instruction[special_datasets.get(data_name, data_type)]
            instruction_processor = format_info["instruction"](
                data_list,
                label_mappings,
                format_info["prompt"],
                format_info["keys_order"],
                format_info["data_type"]
            )
            instruction_data.extend(instruction_processor.transform2instruction())
            print(instruction_data[-1])

    return instruction_data

3. Fine-Tune on Instruction

Few-Shot

Zero-Shot

4. Tips

1. Clean and Process Data: distribution of the Data, remove abnormal symbols, data augmentation (repetition or new relevant data).

2. Redesign the Label: the variation of SpEx-BERT. You can find this method from one entity has one specific demention.

3. Fusion of Language Models: BERT-base without fully connected layer + BERT-base without fully connected layer + … + BERT-large fully connected layer + emmbedding.

Fine-Tune Language Models: Instruction Tuning

Nature Language Processing

1. Transfer Data to Instruction

2. Pretrain on Professional Language Data

3. Fine-Tune on Instruction

4. Tips

FEATURED TAGS

FRIENDS