[NLP] Hugging Face 허깅페이스에서 불러온 모델 미세조정 후 모델업로드하기 (SQuAD v1.1 Dataset)
2023. 2. 7. 16:18
Dataset 불러오기 (Squad v 1.1 dataset)¶
In [ ]:
from datasets import load_dataset
raw_datasets = load_dataset("squad")
Found cached dataset squad
Dataset 확인¶
In [ ]:
Out[ ]:
DatasetDict({ train: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 87599 }) validation: Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 10570 }) })
In [ ]:
raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)
Loading cached processed dataset
Out[ ]:
Dataset({ features: ['id', 'title', 'context', 'question', 'answers'], num_rows: 0 })
In [ ]:
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]} Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50. Where did Super Bowl 50 take place?
Pre-Trained Model 불러오기¶
- model_checkpoint = <<model_name>>
In [ ]:
from transformers import AutoTokenizer
model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
토크나이저 적용해서 확인하기¶
In [ ]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]
inputs = tokenizer(question, context)
Out[ ]:
'[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'
In [ ]:
inputs = tokenizer(
for ids in inputs["input_ids"]:
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP] [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP] [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 [SEP] [CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP]. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]
In [ ]:
inputs = tokenizer(
Out[ ]:
dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])
In [ ]:
Out[ ]:
[0, 0, 0, 0]
In [ ]:
inputs = tokenizer(
print(f"The 4 examples give {len(inputs['input_ids'])} features.")
print(f"Here is where each comes from: {inputs['overflow_to_sample_mapping']}.")
The 4 examples give 19 features. Here is where each comes from: [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3].
In [ ]:
answers = raw_datasets["train"][2:6]["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(inputs["offset_mapping"]):
sample_idx = inputs["overflow_to_sample_mapping"][i]
answer = answers[sample_idx]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# 컨텍스트의 시작 및 마지막을 찾는다.
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# 만일 정답이 컨텍스트에 완전히 포함되지 않는다면, 레이블은 (0, 0)임
if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
# 그렇지 않으면 정답의 시작 및 마지막 인덱스
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
start_positions, end_positions
Out[ ]:
([83, 51, 19, 0, 0, 64, 27, 0, 34, 0, 0, 0, 67, 34, 0, 0, 0, 0, 0], [85, 53, 21, 0, 0, 70, 33, 0, 40, 0, 0, 0, 68, 35, 0, 0, 0, 0, 0])
In [ ]:
idx = 0
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]
start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(inputs["input_ids"][idx][start : end + 1])
print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")
Theoretical answer: the Main Building, labels give: the Main Building
In [ ]:
idx = 4
sample_idx = inputs["overflow_to_sample_mapping"][idx]
answer = answers[sample_idx]["text"][0]
decoded_example = tokenizer.decode(inputs["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}")
Theoretical answer: a Marian place of prayer and reflection, decoded example: [CLS] What is the Grotto at Notre Dame? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grot [SEP]
- 사용된 최대 길이와 슬라이딩 윈도우의 길이를 결정하기 위해 두 개의 상수 정의
- 토큰화하기 전에 약간의 정제 작업 추가
모델 학습을 위한 train dataset 전처리¶
In [ ]:
max_length = 384
stride = 128
def preprocess_training_examples(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
offset_mapping = inputs.pop("offset_mapping")
sample_map = inputs.pop("overflow_to_sample_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
sample_idx = sample_map[i]
answer = answers[sample_idx]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
모델 학습에 필요한 전처리 완료된 데이터셋 로드
In [ ]:
train_dataset = raw_datasets["train"].map(
len(raw_datasets["train"]), len(train_dataset)
Loading cached processed dataset
Out[ ]:
(87599, 88729)
모델 검증을 위한 train dataset 전처리¶
In [ ]:
def preprocess_validation_examples(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
sample_map = inputs.pop("overflow_to_sample_mapping")
example_ids = []
for i in range(len(inputs["input_ids"])):
sample_idx = sample_map[i]
sequence_ids = inputs.sequence_ids(i)
offset = inputs["offset_mapping"][i]
inputs["offset_mapping"][i] = [
o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
inputs["example_id"] = example_ids
return inputs
모델 검증에 필요한 전처리 완료된 데이터셋 로드
In [ ]:
validation_dataset = raw_datasets["validation"].map(
len(raw_datasets["validation"]), len(validation_dataset)
Loading cached processed dataset
Out[ ]:
(10570, 10822)
MODEL 미세조정을 통한 학습 (Fine-turning)¶
In [ ]:
from torch.utils.data import DataLoader
from transformers import default_data_collator
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
train_dataloader = DataLoader(
eval_dataloader = DataLoader(
validation_set, collate_fn=default_data_collator, batch_size=8
In [ ]:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight'] - This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [ ]:
# 최적화 함수 로드
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=2e-5)
In [ ]:
from accelerate import Accelerator
accelerator = Accelerator(True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
In [ ]:
from transformers import get_scheduler
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch
lr_scheduler = get_scheduler(
모델 평가하기¶
In [ ]:
from datasets import load_metric
metric = load_metric("squad")
/tmp/ipykernel_21228/214284904.py:3: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
metric = load_metric("squad")
In [ ]:
n_best = 20
max_answer_length = 30
In [ ]:
from tqdm.auto import tqdm
import collections
import numpy as np
def compute_metrics(start_logits, end_logits, features, examples):
example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(features):
predicted_answers = []
for example in tqdm(examples):
example_id = example["id"]
context = example["context"]
answers = []
# 해당 예제와 연관된 모든 자질들에 대해서...
for feature_index in example_to_features[example_id]:
start_logit = start_logits[feature_index]
end_logit = end_logits[feature_index]
offsets = features[feature_index]["offset_mapping"]
start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
for start_index in start_indexes:
for end_index in end_indexes:
# 컨텍스트에 완전히 포함되지 않는 답변은 생략
if offsets[start_index] is None or offsets[end_index] is None:
# 길이가 음수거나 max_answer_length를 넘는 답변은 생략
if end_index < start_index or end_index - start_index + 1 > max_answer_length:
answer = {
"text": context[offsets[start_index][0] : offsets[end_index][1]],
"logit_score": start_logit[start_index] + end_logit[end_index],
if len(answers) > 0:
best_answer = max(answers, key=lambda x: x["logit_score"])
{"id": example_id, "prediction_text": best_answer["text"]}
predicted_answers.append({"id": example_id, "prediction_text": ""})
theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
return metric.compute(predictions=predicted_answers, references=theoretical_answers)
허깅페이스에 로그인하기¶
- 아래 셀에 작성된 코드는 주피터노트북 기준으로 작성되었습니다.
- 터미널에서 로그인하는 경우에는 'huggingface-cli login'을 입력하여 로그인(토큰입력)을 해주세요.
In [ ]:
from huggingface_hub import notebook_login
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
Repository 생성 및 삭제¶
- create_repo(REPO_NAME): repository 생성
- delete_repo(REPO_NAME): repository 삭제
In [ ]:
from huggingface_hub import create_repo,delete_repo
create_repo("huggingface-tutorial") # Repository 생성
# delete_repo("huggingface-tutorial") # Repository 삭제
Out[ ]:
RepoUrl('https://huggingface.co/sillon/huggingface-tutorial', endpoint='https://huggingface.co', repo_type='model', repo_id='sillon/huggingface-tutorial')
In [ ]:
from huggingface_hub import Repository, get_full_repo_name
model_name = "huggingface-tutorial"
repo_name = get_full_repo_name(model_name)
Out[ ]:
In [ ]:
output_dir = "huggingface-tutorial"
repo = Repository(output_dir, clone_from=repo_name)
Cloning https://huggingface.co/sillon/huggingface-tutorial into local empty directory.
In [ ]:
from tqdm.auto import tqdm
import torch
import numpy as np
progress_bar = tqdm(range(num_training_steps))
for epoch in range(num_train_epochs):
# 학습
for step, batch in enumerate(train_dataloader):
outputs = model(**batch)
loss = outputs.loss
# 평가
start_logits = []
end_logits = []
for batch in tqdm(eval_dataloader):
with torch.no_grad():
outputs = model(**batch)
start_logits = np.concatenate(start_logits)
end_logits = np.concatenate(end_logits)
start_logits = start_logits[: len(validation_dataset)]
end_logits = end_logits[: len(validation_dataset)]
metrics = compute_metrics(
start_logits, end_logits, validation_dataset, raw_datasets["validation"]
print(f"epoch {epoch}:", metrics)
# 저장 및 업로드
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
if accelerator.is_main_process:
repo.push_to_hub(commit_message=f"Training in progress epoch {epoch}", blocking=False)
In [ ]:
Out[ ]:
In [ ]:
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
In [ ]:
이렇게 모델을 올리고 나면 내 허깅페이스 레파지토리에 잘 올라간 모습을 볼 수 있습니다.
이렇게 내 계정에 모델을 올리면 허깅페이스 API를 통해 간편하게
다른 코드에서 모델을 불러와 바로 적용할 수 있습니당~!
모델의 주소는 [계정명]/[모델명] 이렇게 작성하면 됩니다.
예시1. sillon/huggingface-tutorial
예시2. sillon/linux_test