it is time to fine-tuning the LLM

LLMs excel in deal with normal context task, but may not always align with specific task or domains, so we can fine-tuning the pre-existing model, to improve LLMs’ performance and accurancy on particular field.

so I will learn LLM fine-tuning in this blog, and start my learning record.

the key step of LLM fine-tuning

  1. select a model. to tune the model on my laptop, with 16GB CPU and 6GB graphic memory, I need to select a model that can be tuned successfully on my machine, without CUDA outofmemory.(and select a model that have better functionalities on our desired field)

  2. prepare relevant dataset.

  3. preprocess the dataset. clean, re-format, split, to meet the input format requirement.

  4. fine-tuning. make the model more adapt and specialize for particular domain or application.

  5. after fine-tuning we get the nuances of the parameters of the model for adapt sepcified domains, so we need to tailor the model with the adapt-nuances.

different tuning type

  1. full fine-tuning: update all model weights, just like create a new version of model.

  2. parameter efficient fine-tuning: more efficient than full fine-tuning. PEFT only update a subset of parameter, effectively reduce the demand of computational resources.

LoRA and QLoRA is the most widely used and effective PEFT.

LoRA

LoRA use two smaller matrices as the adapter.

after LoRA fine-tuning for a specific task, it will generate a LoRA adapter, use adapter and original LLM, we will get a model that have more proficiency on specific domains.

QLoRA

Quantized LoRA.

QLoRA has lower precision. the pre-trained model is loaded into GPU memory with quantized weights, in constrast to the LoRA has lower precision, which means less memory overhead.

environment prepare

we use jupyter notebook to do practice about ML/AI/DL…

we can use kaggle notebook, colab, or local environment.

install necessary libraries for fine-tuning:

1
!pip install -q -U bitsandbytes transformers peft accelerate datasets scipy einops evaluate trl rouge_score

or you can use my .toml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
[project]
name = "llm"
version = "0.1.0"
description = "Add your description here"
authors = [{ name = "Invoker-pray", email = "jiaohongbao04@gmail.com" }]
dependencies = [
"transformers>=4.52.3",
"tensorflow>=2.19.0",
"torch>=2.7.0",
"torchvision>=0.22.0",
"torchaudio>=2.7.0",
"pandas>=2.2.3",
"sentencepiece>=0.2.0",
"tokenizers>=0.21.1",
"datasets>=3.6.0",
"evaluate>=0.4.3",
"scikit-optimize>=0.10.2",
"optuna>=4.3.0",
"tqdm>=4.67.1",
"jupyter>=1.1.1",
"accelerate>=1.7.0",
"bitsandbytes>=0.46.0",
"peft>=0.15.2",
"trl>=0.18.1",
"scipy>=1.15.3",
"nodejs>=0.1.1",
"gradio>=5.31.0",
"transformers-stream-generator>=0.0.5",
"einops>=0.8.1",
"requests>=2.32.3",
"huggingface-hub>=0.32.2",
"pyopenssl>=25.1.0",
"hf-xet>=1.1.2",
"tf-keras>=2.19.0",
"tiktoken>=0.9.0",
"dataset>=1.6.2",
"rouge-score>=0.1.2",
]
readme = "README.md"
requires-python = ">= 3.8"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.rye]
managed = true
dev-dependencies = []

[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.build.targets.wheel]
packages = ["src/llm"]

and then we can import packages to confirm our environment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
AutoTokenizer,
TrainingArguments,
Trainer,
GenerationConfig
)
from tqdm import tqdm
from trl import SFTTrainer
import torch
import time
import pandas as pd
import numpy as np
from huggingface_hub import interpreter_login

interpreter_login()

Weights&Biases is a visualized tool, you can check the train process via some values like loss, accuracy; also model version manage, hyperparameter manage…

sometimes we do not want to use W&B, for example the model is very easy, don’t want to create a account, connect to internet, or upload our logs, so we can baned it via:

1
2
import os
os.environ['WANDB_DISSBLED'] = "true"

loading dataset

we can use dataset from huggingface, like:

1
2
huggingface_dataset_name = "neil-code/dialogsum-test"
dataset = load_dataset(huggingface_dataset_name)

or we can use our custom dataset.

local dataset prepare and load

for demonstrate(local data): I will use the wechat chat history as dataset.

like this:

1
2
3
import json
from datasets import load_dataset
dataset = load_dataset("json", data_files="train_data1.json")

and we can take a look at what is inside of the data:

1
'🤔\n?外面什么阴风\n杀软verilog实验\n@Invoker\u2005强芯的verilog做到哪了\nChat History for Group Chat\n荷塘月色: 话说有没有玩过galgame的同学 可以告诉我白屏是酱紫放的嘛\n荷塘月色: [Photo]\n荷塘月色: 就是那个出货的时候 那个白屏是怎么闪的\n荷塘月色: 我看网上说什么先闪两下 然后快速闪三下 最后慢慢闪一下这种\n荷塘月色: emm所以有样例视频嘛(如果能放出来的话)...\n@上岸复旦cos小仓朝日\u2005\n不知道,我快进的没注意这些\n表情\n下班\n若智实验', '表情', 'You changed the group name to "看看群uのflag"\n表情\n我只要50是因为我似透了(恼)\n不能再拖延了\n唉唉,我得2月发邮件才有用', '事已至此先练坐位体前屈', '?\n帮我测', '唉小萝莉jhb', '表情', '表情\n立即出分的吗', '体测要带啥\n校园卡', '无所谓,我会赖掉1k', '反正只有毕业一个需求\n表情\n不是哥们,我跳远怎么才这么点', '图片\n奏了', '这跳远和50m的机子绝对有问题\n我草这跳远\n真有问题\n表情\n我感觉我会拼尽全力无法及格\n210及格\n你得跳220\n不是 我跳远测完了\n还差两个跑步\n跳远给我干没分了\n什么意思\n第一次180第二次1cm\n表情\n好像是踩线了\n神人项目\n这50也是傻逼啊\n你也坠机了吗\n他最好真跟保研没关系\n不然哥们不读研了\n傻逼\n"姚维颢" recalled a message\n我备战秋招去了\n要是没保上的话\n表情\n唉唉cs✌🏻本科学历有班上', '👍🏻\n表情\nbyd怎么模电也有带隙基准', '表情', '这是什么群', '这两天在调作息()\n已摆烂\n←到处觅食', '实习倒计时3day', '我有春游综合症\n现在什么都不想干', '我也空虚\n我也空虚\n表情\n我空虚一个月了\n啥正事没干\n表情\n天天打守望先锋\n看直播\n《近月少女的礼仪2》将于5月9日推出!中文版特典活动将于5月8日登场!欢迎提前加入愿望单!本作是由日本美少女游戏资深大厂 Navel 制作的超人气女装系列作品的第三部。国际中文版除收录游戏本篇中的内容外,也将包括Limited Edition追加的『樱小路露娜接后日谈的后日谈《对好答案后是美妙的问候》以及『堂兄妹理论及其中心』两个特别篇内容以及全角色语音内容。此外,游戏将拥有超越原版的1920X1080的高清画质,内置简繁日三版本,预定支持Steam成就、集换卡牌、云存储等功能,并将支持大量额外内容。https://www.bilibili.com/video/BV1mNQGYUEbj\nSteam:https://store.steampowered.com/app/3446150 \n官网商城:https://store.hikarifield.co.jp/shop/tsukiniyorisou_2nd  \n\n本作由铃平广与西又葵继续联袂负责原画与角色设计;由系列前作《近月1》、《少女理论》的核心剧本家东之助独立执笔;遥空、奏雨、桐谷华、川岛莉乃等人气声优为本作新角色献声。主题曲《Glitter》由知名歌姬美乡秋演唱。  \n\n《近月少女的礼仪2》曾于美少女游戏大赏中荣获综合第1、剧本第1、系统第8、作画第3、音乐第4、影片第6,角色樱小路露娜第1、八日堂朔莉第2名、艾斯特第5、大藏瑠美音第8等诸多奖项。并曾于萌系游戏大赏中荣获年度纯爱系作品金奖。\n\n#HIKARI FIELD# #hikarifield# #steam# #Navel# #近月少女的礼仪# #樱小路露娜# #视觉小说# #视觉小说游戏# #galgame# #GAL# #女装#\n送我\n不如去玩恋爱泡馍\n感觉跳票了\n下一位\n好像快有demo了\n虽然我也觉得跳了',

my dataset content has two features:

  • input : the message someone send to me.

  • output : the message I response.

this is how the data generate:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

import pandas as pd
import json
from datasets import load_dataset
import os

folder_path = "csv"

all_conversations = []


for dirpath, dirnames, filenames in os.walk(folder_path):
for filename in filenames:
if filename.endswith(".csv"):
file_path = os.path.join(dirpath, filename)
df = pd.read_csv(file_path)
#print(file_path)
all_conversation = []
for i in range(len(df) -1):
if df.iloc[i]["is_sender"] == 0:
all_conversation.append([0,df.iloc[i]["msg"]])
else:
all_conversation.append([1,df.iloc[i]["msg"]])
all_conversations.append(all_conversation)

convs = []
for conversation in all_conversations:
conv = []
buffer = []
flag_init = 1
before = 100
for message in conversation:
#print(message)
sender, msg = message
#print(sender)
#print(msg)
if flag_init == 1:

before = sender
buffer.append(msg)
flag_init = 0
else:


if sender == before:
buffer.append(msg)
else:
conv.append([before,buffer])
buffer = []
buffer.append(msg)
before = sender
convs.append(conv)
all = []

for conv in convs:
one = []
for message in conv:
#print(message)
message[1] = "\n".join(message[1])
#print(message[1])
#print("---")

one.append(message)
all.append(one)

print(all)



import json
from datasets import load_dataset



pairs = []

for i in range(len(convs)):
if all[i][0][0] == 1:
all[i].pop(0)
#print(all)



for conv in all:
for i in range(len(conv) - 1):
sender, msg = conv[i]
next_sender, next_msg = conv[i+1]

# 仅保留 ->sender的连续对话
if sender == 0 and next_sender == 1:
pairs.append({
"input": msg,
"output": next_msg
})

#print(pairs)

# 保存为 JSON 文件
with open("train_data1.json", "w", encoding="utf-8") as f:
json.dump(pairs, f, ensure_ascii=False, indent=2)

# 加载为 HuggingFace 数据集
dataset = load_dataset("json", data_files="train_data1.json")


create bitsandbytes configuration

we need a configuration class that specifies how we want to do the quantization.

we can use BitsAndBytesConfig to load the model in 4-bit format.

1
2
3
4
5
6
7
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=False,
)
key words explaination
compute_dtype = getattr(torch, “float16”) get float16 attribute from torch, then give the value to computedtype(can also be _torch.float16)
bnb_config = BitsAndBytesConfig() class of huggingface, a parameter for model quantized
load_in_4bit model load as 4-bit
bnb_4bit_quant_type = ‘nf4’ use ‘nf4’ quant format
bnb_4bit_compute_dtype = compute_dtype the data type while computing
bnb_4bit_use_double_quant use double-quantization

loading pre-trained model

we use neil-code/dialogsum-test as dataset, so we need a model that excel in english language process, we can use Phi-2(2.7 billion parameters).

1
2
3
4
5
6
7
8

model_name='microsoft/phi-2'
device_map = {"": 0}
original_model = AutoModelForCausalLM.from_pretrained(model_name,
device_map=device_map,
quantization_config=bnb_config,
trust_remote_code=True,
use_auth_token=True)
key words explaination
model_name = ‘microsoft/phi-2’ select your model. if it can’t be found model in local machine, it will be download from huggingface hub.
device_map = {“”: 0} load model into GPU 0. if use accelerate, it can be “auto”.
original_model = AutoModelForCausalLM.from_pretrained() for substantial
quantization_config quant func
trust_remote_code allow execution of the repository code
use_auth_token use the token of the account that login now

tokenization

neural network can only process digital tensor information, can not deal with text directly. so we should make tokenization, to transform the text information to token ID, for example:

1
2
3
4
5
"你好吗"

-> ["你""好","吗"]

-> [101,872,1962,1408,102]

we have different ways to tokenization:

  • word-level tokenizer: split as word, for language splited by space.

  • character-level tokenizer: splited as char, for Chinese, Janpese.

  • subword tokenizer: mixed.

so we can configure tokenizer via:

1
2
tokenizer = AutoTokenizer.from_pretrained(model_name,trust_remote_code=True,padding_side="left",add_eos_token=True,add_bos_token=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
key words explaination
AutoTokenizer.from_pretrained() load tokenizer for huggingface
model_name none
trust_remote_code allow the implementation of remote repository code
padding_side if we padding for input, it will be added at left/right side
add_eos_token / add_bos_token eos: end-of-sequence / bos: begin-of-sequence
use_fast use fast_Tokenizer
tokenizer.pad_token sometimes model does not define pad_token and report error

test the origin model

test the model can normally be used or not.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
%%time
from transformers import set_seed
seed = 42
set_seed(seed)

index = 10

prompt = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

formatted_prompt = f"Instruct: Summarize the following conversation.\n{prompt}\nOutput:\n"
res = gen(original_model,formatted_prompt,100, tokenizer, )
#print(res[0])
output = res[0].split('Output:\n')[1]

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{formatted_prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')
key words explaination
%%time measure the time that execute the code block, from jupyter
seed/set_seed set seed can make sure that the LLM can be re-appear
index = 10 select a sample
formatted_prompt = f”” combinate the dialog into a instruct-style input
res = gen() invoke the model-generate function, pass the model, prompt, and max generate length
output = … extract the content after output as summary text, for output
dash_line none

and there is a typical gen function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

def gen(model, prompt, max_new_tokens=100, tokenizer=None, device='cuda' if torch.cuda.is_available() else 'cpu'):
"""
use specified model and tokenizer to generate text.

parameter:
model: a transformers model that has loaded (by AutoModelForCausalLM)
prompt: the input prompt
max_new_tokens: max tokens that generate
tokenizer: tokenizer you use
device: device(by default: 'cuda' , if can't use gpu, 'cpu' instead)

return:
text that generated(mostly length by 1)
"""
if tokenizer is None:
raise ValueError("tokenizer cannot be None.")

model.to(device)
inputs = tokenizer(prompt, return_tensors='pt').to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)

next serial blog is here.

learn from :

https://dassum.medium.com/fine-tune-large-language-model-llm-on-a-custom-dataset-with-qlora-fb60abdeba07