evaluate the model qualitatively

we need to evaluate the model’s quality, and sometimes we need to make further adjustments to model’s architecture, hyperparameters, or datasets.

we should perform inference using the same input with the PEFT model.

%%time
from transformers import set_seed
set_seed(seed)

index = 5
dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"

peft_model_res = gen(ft_model,prompt,100,tokenizer,)
peft_model_output = peft_model_res[0].split('Output:\n')[1]
#print(peft_model_output)
prefix, success, result = peft_model_output.partition('###')

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'PEFT MODEL:\n{prefix}')

diff between evaluate and train:

diffs	train	eval	reason
model	origin_model	ft_model	to show the difference after PEFT
output	output raw text	use partition() to clip output	PEFT model may output extra signal, should be clean
gen()000	pass tokenizer	none	packaged in PEFT

the output here:

PEFT-model

evaluate the model Quantitatively with ROUGE metric

ROUGE, is a set of metrics and a software package used for evaluate automatic summarization and machine translation software in natural language processing, in Gisting evaluation.

it will compare the produced/generated summary to a reference one.

it compares summarizations to a baseline summary that usually created by humans.

we can use sample inputs to evaluate.

load original model:


original_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

compare the result:

import pandas as pd

dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary'] # from dataset\['test'\] fetch 10 dialogue and summary, for generate and compare.

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

# generate summary for each dialogue.
for idx, dialogue in enumerate(dialogues):
    human_baseline_text_output = human_baseline_summaries[idx]
    # create prompt
    prompt = f"Instruct: Summarize the following conversation.\n{dialogue}\nOutput:\n"
    # original model's result
    original_model_res = gen(original_model,prompt,100, tokenizer,)
    original_model_text_output = original_model_res[0].split('Output:\n')[1]

    # peft_model's result
    peft_model_res = gen(ft_model,prompt,100, tokenizer,)
    peft_model_output = peft_model_res[0].split('Output:\n')[1]
    print(peft_model_output)
    peft_model_text_output, success, result = peft_model_output.partition('###')

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

# create a dataframe for compare.
zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df


import evaluate

# use rouge to compare the summary of original model and peft model to human/baseline one.

rouge = evaluate.load('rouge')

# evaluate original model.
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)
# evaluate the peft model.
peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# output the score of ROUGE.
print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")
# compute the improvement of peft.
improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

evaluate
so we can find that there is a significant improvement in the PEFT model as compared to the original model model denoted in terms of percentage.