evaluate the model qualitatively

we need to evaluate the model’s quality, and sometimes we need to make further adjustments to model’s architecture, hyperparameters, or datasets.

we should perform inference using the same input with the PEFT model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
%%time
from transformers import set_seed
set_seed(seed)

index = 5
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,tokenizer,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('###')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

diff between evaluate and train:

diffs train eval reason
model origin_model ft_model to show the difference after PEFT
output output raw text use partition() to clip output PEFT model may output extra signal, should be clean
gen()000 pass tokenizer none packaged in PEFT

the output here:

PEFT-model

evaluate the model Quantitatively with ROUGE metric

ROUGE, is a set of metrics and a software package used for evaluate automatic summarization and machine translation software in natural language processing, in Gisting evaluation.

it will compare the produced/generated summary to a reference one.

it compares summarizations to a baseline summary that usually created by humans.

we can use sample inputs to evaluate.

load original model:

1
2
3
4
5
6

original_model = AutoModelForCausalLM.from_pretrained(base_model_id,
device_map='auto',
quantization_config=bnb_config,
trust_remote_code=True,
use_auth_token=True)

compare the result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary'] # from dataset\['test'\] fetch 10 dialogue and summary, for generate and compare.

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

# generate summary for each dialogue.
for idx, dialogue in enumerate(dialogues):
human_baseline_text_output = human_baseline_summaries[idx]
# create prompt
prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
# original model's result
original_model_res = gen(original_model,prompt,100, tokenizer,)
original_model_text_output = original_model_res[0].split('Output:\n')[1]

# peft_model's result
peft_model_res = gen(ft_model,prompt,100, tokenizer,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
print(peft_model_output)
peft_model_text_output, success, result = peft_model_output.partition('###')

original_model_summaries.append(original_model_text_output)
peft_model_summaries.append(peft_model_text_output)

# create a dataframe for compare.
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

import evaluate

# use rouge to compare the summary of original model and peft model to human/baseline one.

rouge = evaluate.load('rouge')

# evaluate original model.
original_model_results = rouge.compute(
predictions=original_model_summaries,
references=human_baseline_summaries[0:len(original_model_summaries)],
use_aggregator=True,
use_stemmer=True,
)
# evaluate the peft model.
peft_model_results = rouge.compute(
predictions=peft_model_summaries,
references=human_baseline_summaries[0:len(peft_model_summaries)],
use_aggregator=True,
use_stemmer=True,
)

# output the score of ROUGE.
print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")
# compute the improvement of peft.
improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
print(f'{key}: {value*100:.2f}%')

evaluate
so we can find that there is a significant improvement in the PEFT model as compared to the original model model denoted in terms of percentage.

files download
notebook ipynb-kaggle and ipynb-local
environment pyproject.toml

learn from :

https://dassum.medium.com/fine-tune-large-language-model-llm-on-a-custom-dataset-with-qlora-fb60abdeba07