dataset pre-processing

we should format the prompt in a way that the model can recognize. we should refer the huggingface model documentation, to check the format we should transform the original dataset to.

for Phi-2, it should be:

where the model generates the text after "." . To encourage the model to write more concise answers, you can also try the following QA format using "Instruct: <prompt>\nOutput:"

Instruct: Write a detailed analogy between mathematics and a lighthouse.

Output: Mathematics is like a lighthouse. Just as a lighthouse guides ships safely to shore, mathematics provides a guiding light in the world of numbers and logic. It helps us navigate through complex problems and find solutions. Just as a lighthouse emits a steady beam of light, mathematics provides a consistent framework for reasoning and problem-solving. It illuminates the path to understanding and helps us make sense of the world around us.

where the model generates the text after "Output:".

we should convert dialog-summary/prompt-response pairs into explicit instruction for LLM.

def create_prompt_formats(sample):
    """
    Format various fields of the sample ('instruction','output')
    Then concatenate them using two newline characters
    :param sample: Sample dictionary
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
    RESPONSE_KEY = "### Output:"
    END_KEY = "### End"

    blurb = f"\n{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}"
    input_context = f"{sample['dialogue']}" if sample["dialogue"] else None
    response = f"{RESPONSE_KEY}\n{sample['summary']}"
    end = f"{END_KEY}"

    parts = [part for part in [blurb, instruction, input_context, response, end] if part]

    formatted_prompt = "\n\n".join(parts)
    sample["text"] = formatted_prompt

    return sample

to format the sample, input is dictionary, format the instruction and output, and concatenate them with two \n.

the prompt tags is:


INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
INSTRUCTION_KEY = "### Instruct: Summarize the below conversation."
RESPONSE_KEY = "### Output:"
END_KEY = "### End"

it can make AI to recognize different paragragh’s meaning.

to create blurb and instruction:

1
2
3


blurb = f"\n{INTRO_BLURB}"
instruction = f"{INSTRUCTION_KEY}"

get the input content, and generate answer/response:


input_context = f"{sample['dialogue']}" if sample["dialogue"] else None


response = f"{RESPONSE_KEY}\n{sample['summary']}"

end = f"{END_KEY}"

and then, concatenate all response info to one data in specified format.

tokenize the input prompts

we need to ensure that the input sequences will not surpass the model’s maximum token limit.


from functools import partial

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max length: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=['id', 'topic', 'dialogue', 'summary'],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)

    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

function	explaination
get_max_length()	try to get the max_length from model, if can not, set to 1024 as default
preprocess_batch()	tokenize a batch of data, set maxlength, _truncation = True to promise the amount of token will not exceed limit
preprocessing_dataset()	total func
dataset = dataset.map(create_prompt_formats)	format data and generate “text”
partial()	create a function and use tokenizer to encode “text”, remove useless data
filter()	remove input data that is too long
shuffle	none

and run this code to get train_dataset and eval_dataset:

## Pre-process dataset
max_length = get_max_length(original_model)
print(max_length)

train_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['train'])
eval_dataset = preprocess_dataset(tokenizer, max_length,seed, dataset['validation'])

preparing the model for QLoRA

1 2	from peft import prepare_model_for_kbit_training original_model = prepare_model_for_kbit_training(original_model)

prepare_model_for_kbit_training can initialize a model for QLoRA by setting up the necessary configurations.

setup PEFT for Fine-tuning

define LoRA config for Fine-tuning the base model.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

config = LoraConfig(
    r=32, #Rank
    lora_alpha=32,
    target_modules=[
        'q_proj',
        'k_proj',
        'v_proj',
        'dense'
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

# 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
original_model.gradient_checkpointing_enable()

peft_model = get_peft_model(original_model, config)

parameters	explaination
r	define the dimention of the adapter to be trained. r is the rank of low-rank matrix used in the adapters
lora_alpha	scaling factor for the learned weights. the weights matrix is scaled by alpha/r, higher alpha, more weight to the LoRA activations

after peft is prepared, we can use



def print_number_of_trainable_model_parameters(model):
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Trainable parameters: {trainable_params}")
    print(f"All parameters: {total_params}")
    print(f"Percentage of trainable parameters: {100 * trainable_params / total_params:.2f}%")




print(print_number_of_trainable_model_parameters(peft_model))

to see how many parameters trainable.

train PEFT Adapter

Define training arguments and create Trainer instance:


output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'
import transformers

peft_training_args = TrainingArguments(
    output_dir = output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",
    save_steps=25,
    eval_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir = True,
    group_by_length=True,
)

peft_model.config.use_cache = False

peft_trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

key words	explaination
paged_adamw_8bit	save graphic memory
gradient_accumulation_steps = 4	applicable on personal laptop
max_steps = 1000	limit the train steps
logging_steps, save_steps, eval_steps	convinent for monitor
group_by_length = True	improve the train efficiency
gradient_checkpointing = True	save graphic memory
peft_model.config.use_cache = False	do not use cache compute

about ‘label’, unsupervised tasks do not need to add label, or can copy a feature as label, almost for some task like language prediction; supervised task need a label, for all dataset.

train model

1	peft_trainer.train()

most excited part part(and most torture part if not first time).

train

resume from checkpoint

if the train progress is interrputed, it will be painful, so we should make a mechanism to resume from checkpoint, to avoid that situation.

import os
import time
import transformers
from transformers import TrainingArguments, Trainer

# create a directory to save checkpoint
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# before train, confirm that if there has checkpoint or not.
def find_latest_checkpoint(output_dir):
    if not os.path.exists(output_dir):
        return None
    checkpoints = [d for d in os.listdir(output_dir) if d.startswith("checkpoint-")]
    if not checkpoints:
        return None
    latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1]))
    return os.path.join(output_dir, latest_checkpoint)

resume_from_checkpoint = find_latest_checkpoint(output_dir)

# train parameters configure, support resume.
peft_training_args = TrainingArguments(
    output_dir=output_dir,
    warmup_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    max_steps=1000,
    learning_rate=2e-4,
    optim="paged_adamw_8bit",
    logging_steps=25,
    logging_dir="./logs",
    save_strategy="steps",      #save checkpoint by steps
    save_steps=25,             # save checkpoint per 25 steps.
    eval_strategy="steps",
    eval_steps=25,
    do_eval=True,
    gradient_checkpointing=True,
    report_to="none",
    overwrite_output_dir=True,  #allow overwrite checkpoint
    group_by_length=True,
    load_best_model_at_end=False,  #if you want to resume, change it to be True.
    seed=42,                    # fixed random seed.
)

# ban cache.(avoid conflict)
peft_model.config.use_cache = False

# trainer's initialization, support resume.
peft_trainer = Trainer(
    model=peft_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=peft_training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# resume train(after find resume checkpoint)
peft_trainer.train(resume_from_checkpoint=resume_from_checkpoint)

inference of PEFT model

after train, we can use it for inference.

we should add an adapter to the original Phi-2 model, the achieve the new proficiency in specified domains.

load original model:


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"
base_model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                      device_map='auto',
                                                      quantization_config=bnb_config,
                                                      trust_remote_code=True,
                                                      use_auth_token=True)

configure eval_tokenizer:

1
2
3

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token


from peft import PeftModel

ft_model = PeftModel.from_pretrained(base_model,
                                     "/kaggle/working/peft-dialogue-summary-training-1705417060/checkpoint-1000", # PEFT adapter path
                                     torch_dtype=torch.float16, # model precision
                                     is_trainable=False # only for inference.
                                     )

next serial blog is here.
learn from :

https://dassum.medium.com/fine-tune-large-language-model-llm-on-a-custom-dataset-with-qlora-fb60abdeba07