在古诗词数据集上微调GPT2,介绍了如何调用GPT,如何制作数据集,以及如何微调。
一. GPT2中文模型的推理调用 1.1 下载GPT2模型 具体下载方法见LLM应用第一篇
从HF-mirror下载gpt2-chinese-cluecorpussmall
主要步骤有: 加载模型和分词器,构建模型生成Pipeline, 调用模型生成文本。代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from transformers import GPT2LMHeadModel, BertTokenizer, TextGenerationPipelinemodel = GPT2LMHeadModel.from_pretrained('' ) tokenizer = BertTokenizer.from_pretrained('' ) print (model)text_generator = TextGenerationPipeline(model, tokenizer, device='cpu' ) for i in range (3 ): print (text_generator("这是很久之前的事情了," , max_length=100 , do_sample=True ))
运行结果如下:因为开启了do_sample, 所以三次生成的文本都不同。
二. 本地训练GPT2中文模型 2.1 准备数据 准备想要GPT生成的语料数据,比如中文诗词,不需要太多的标注。
数据集以及代码可以在 微调GPT中文生成模型,生成古诗风格资源-CSDN文库 获取
2.2 构造数据类 数据类构造如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from torch.utils.data import Datasetclass MyDataset (Dataset ): def __init__ (self ): with open (file='./data/chinese_poems.txt' , encoding='utf-8' ) as file: lines = file.readlines() lines = [i.strip() for i in lines] self .lines = lines def __len__ (self ): return len (self .lines) def __getitem__ (self, item ): return self .lines[item] if __name__ == "__main__" : dataset = MyDataset() for data in dataset: print (data)
2.3 训练模型 train.py代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 from transformers import AdamWfrom transformers.optimization import get_schedulerimport torchfrom data import MyDatasetfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom torch.utils.data import DataLoaderdataset = MyDataset() model = AutoModelForCausalLM.from_pretrained('gpt2-chinese-cluecorpussmall' ) tokenizer = AutoTokenizer.from_pretrained('gpt2-chinese-cluecorpussmall' ) def collate_fn (data ): data = tokenizer.batch_encode_plus( batch_text_or_text_pairs=data, padding=True , truncation=True , max_length=512 , return_tensors="pt" , ) data['labels' ] = data['input_ids' ].clone() return data data_loader = DataLoader( dataset=dataset, batch_size=2 , shuffle=True , drop_last=True , collate_fn=collate_fn ) print (len (data_loader))def train (): EPOCH = 3000 global model DEVICE = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(DEVICE) optimizer = AdamW(model.parameters(), lr=2e-5 ) scheduler = get_scheduler(name='linear' , num_warmup_steps=0 , num_training_steps=len (data_loader), optimizer=optimizer) model.train() for epoch in range (EPOCH): for i, data in enumerate (data_loader): for k in data.keys(): data[k] = data[k].to(DEVICE) out = model(**data) loss = out['loss' ] loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0 ) optimizer.step() scheduler.step() optimizer.zero_grad() model.zero_grad() if i % 50 == 0 : labels = data["labels" ][:, 1 :] out = out["logits" ].argmax(dim=2 )[:, :-1 ] select = labels != 0 labels = labels[select] out = out[select] del select acc = (labels == out).sum ().item() / labels.numel() lr = optimizer.state_dict()["param_groups" ][0 ]['lr' ] print (f"epoch:{epoch} ,batch:{i} ,loss:{loss.item()} ,lr:{lr} ,acc:{acc} " ) torch.save(model.state_dict(), "params/net.pt" ) print ("权重保存成功!" ) if __name__ == '__main__' : train()
三. 补充 3.1 生成模型的训练数据 训练的数据不含标签, 只含文本, 文本就是标签,模型根据前面生成下一个字,然后把后面的文本作为标签,计算loss。