we should format the prompt in a way that the model can recognize. we should refer the huggingface model documentation, to check the format we should transform the original dataset to.
for Phi-2, it should be:
1 2 3 4 5 6 7
where the model generates the text after "." . To encourage the model to write more concise answers, you can also try the following QA format using "Instruct: <prompt>\nOutput:"
Instruct: Write a detailed analogy between mathematics and a lighthouse.
Output: Mathematics is like a lighthouse. Just as a lighthouse guides ships safely to shore, mathematics provides a guiding light in the world of numbers and logic. It helps us navigate through complex problems and find solutions. Just as a lighthouse emits a steady beam of light, mathematics provides a consistent framework for reasoning and problem-solving. It illuminates the path to understanding and helps us make sense of the world around us.
where the model generates the text after "Output:".
we should convert dialog-summary/prompt-response pairs into explicit instruction for LLM.
defcreate_prompt_formats(sample): """ Format various fields of the sample ('instruction','output') Then concatenate them using two newline characters :param sample: Sample dictionary """ INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request." INSTRUCTION_KEY = "### Instruct: Summarize the below conversation." RESPONSE_KEY = "### Output:" END_KEY = "### End"
to format the sample, input is dictionary, format the instruction and output, and concatenate them with two \n.
the prompt tags is:
1 2 3 4 5
INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request." INSTRUCTION_KEY = "### Instruct: Summarize the below conversation." RESPONSE_KEY = "### Output:" END_KEY = "### End"
it can make AI to recognize different paragragh’s meaning.
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py defpreprocess_dataset(tokenizer: AutoTokenizer, max_length: int,seed, dataset): """Format & tokenize it so it is ready for training :param tokenizer (AutoTokenizer): Model Tokenizer :param max_length (int): Maximum number of tokens to emit from tokenizer """
# Add prompt to each sample print("Preprocessing dataset...") dataset = dataset.map(create_prompt_formats)#, batched=True)
# Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer) dataset = dataset.map( _preprocessing_function, batched=True, remove_columns=['id', 'topic', 'dialogue', 'summary'], )
# Filter out samples that have input_ids exceeding max_length dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
define the dimention of the adapter to be trained. r is the rank of low-rank matrix used in the adapters
lora_alpha
scaling factor for the learned weights. the weights matrix is scaled by alpha/r, higher alpha, more weight to the LoRA activations
after peft is prepared, we can use
1 2 3 4 5 6 7 8 9 10 11 12 13
defprint_number_of_trainable_model_parameters(model): trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params}") print(f"All parameters: {total_params}") print(f"Percentage of trainable parameters: {100 * trainable_params / total_params:.2f}%")
about ‘label’, unsupervised tasks do not need to add label, or can copy a feature as label, almost for some task like language prediction; supervised task need a label, for all dataset.
train model
1
peft_trainer.train()
most excited part part(and most torture part if not first time).
resume from checkpoint
if the train progress is interrputed, it will be painful, so we should make a mechanism to resume from checkpoint, to avoid that situation.
import os import time import transformers from transformers import TrainingArguments, Trainer
# create a directory to save checkpoint output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'
# before train, confirm that if there has checkpoint or not. deffind_latest_checkpoint(output_dir): ifnot os.path.exists(output_dir): returnNone checkpoints = [d for d in os.listdir(output_dir) if d.startswith("checkpoint-")] ifnot checkpoints: returnNone latest_checkpoint = max(checkpoints, key=lambda x: int(x.split("-")[1])) return os.path.join(output_dir, latest_checkpoint)