728x90

- 기존 COLAB 환경에서 TPU로 학습된 코드를 GPU 환경에서 실행되도록 하였습니다.

- 마지막에 직접 커스텀 데이터셋 (본문과 질문)을 넣으면 해당 모델을 통해 기계독해를 하도록 구현한 것을 추가하였습니다.

- 평가함수에 대한 구현은 아직 미흡합니다.

코드 구현

korquad

In [1]:

import tensorflow as tf
print(tf.__version__)

2023-01-26 11:16:14.870332: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0

2.5.0

In [25]:

# !wget https://korquad.github.io/dataset/KorQuAD_v1.0_train.json -O KorQuAD_v1.0_train.json
# !wget https://korquad.github.io/dataset/KorQuAD_v1.0_dev.json -O KorQuAD_v1.0_dev.json

In [2]:

import os
import json
import numpy as np
from tqdm import tqdm
from pathlib import Path
from transformers import BertTokenizerFast
import tensorflow as tf

In [3]:

def read_squad(path):
    path = Path(path)
    with open(path, 'rb') as f:
        squad_dict = json.load(f)

    contexts = []
    questions = []
    answers = []
    for group in squad_dict['data']:
        for passage in group['paragraphs']:
            context = passage['context']
            for qa in passage['qas']:
                question = qa['question']
                for answer in qa['answers']:
                    contexts.append(context)
                    questions.append(question)
                    answers.append(answer)

    return contexts, questions, answers

train_contexts, train_questions, train_answers = read_squad('KorQuAD_v1.0_train.json')
val_contexts, val_questions, val_answers = read_squad('KorQuAD_v1.0_dev.json')

In [4]:

print('훈련 데이터의 본문 개수 :', len(train_contexts))
print('훈련 데이터의 질문 개수 :', len(train_questions))
print('훈련 데이터의 답변 개수 :', len(train_answers))
print('테스트 데이터의 본문 개수 :', len(val_contexts))
print('테스트 데이터의 질문 개수 :', len(val_questions))
print('테스트 데이터의 답변 개수 :', len(val_answers))

훈련 데이터의 본문 개수 : 60407
훈련 데이터의 질문 개수 : 60407
훈련 데이터의 답변 개수 : 60407
테스트 데이터의 본문 개수 : 5774
테스트 데이터의 질문 개수 : 5774
테스트 데이터의 답변 개수 : 5774

In [5]:

print('첫번째 샘플의 본문')
print('-----------------')
print(train_contexts[0])

첫번째 샘플의 본문
-----------------
1839년 바그너는 괴테의 파우스트을 처음 읽고 그 내용에 마음이 끌려 이를 소재로 해서 하나의 교향곡을 쓰려는 뜻을 갖는다. 이 시기 바그너는 1838년에 빛 독촉으로 산전수전을 다 걲은 상황이라 좌절과 실망에 가득했으며 메피스토펠레스를 만나는 파우스트의 심경에 공감했다고 한다. 또한 파리에서 아브네크의 지휘로 파리 음악원 관현악단이 연주하는 베토벤의 교향곡 9번을 듣고 깊은 감명을 받았는데, 이것이 이듬해 1월에 파우스트의 서곡으로 쓰여진 이 작품에 조금이라도 영향을 끼쳤으리라는 것은 의심할 여지가 없다. 여기의 라단조 조성의 경우에도 그의 전기에 적혀 있는 것처럼 단순한 정신적 피로나 실의가 반영된 것이 아니라 베토벤의 합창교향곡 조성의 영향을 받은 것을 볼 수 있다. 그렇게 교향곡 작곡을 1839년부터 40년에 걸쳐 파리에서 착수했으나 1악장을 쓴 뒤에 중단했다. 또한 작품의 완성과 동시에 그는 이 서곡(1악장)을 파리 음악원의 연주회에서 연주할 파트보까지 준비하였으나, 실제로는 이루어지지는 않았다. 결국 초연은 4년 반이 지난 후에 드레스덴에서 연주되었고 재연도 이루어졌지만, 이후에 그대로 방치되고 말았다. 그 사이에 그는 리엔치와 방황하는 네덜란드인을 완성하고 탄호이저에도 착수하는 등 분주한 시간을 보냈는데, 그런 바쁜 생활이 이 곡을 잊게 한 것이 아닌가 하는 의견도 있다.

In [6]:

print('첫번째 샘플의 질문')
print('-----------------')
print(train_questions[0])

첫번째 샘플의 질문
-----------------
바그너는 괴테의 파우스트를 읽고 무엇을 쓰고자 했는가?

In [7]:

print('첫번째 샘플의 답변')
print('-----------------')
print(train_answers[0])

첫번째 샘플의 답변
-----------------
{'text': '교향곡', 'answer_start': 54}

In [8]:

def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
        answer['text'] = answer['text'].rstrip()
        gold_text = answer['text']
        start_idx = answer['answer_start']
        end_idx = start_idx + len(gold_text)

        assert context[start_idx:end_idx] == gold_text, "end_index 계산에 에러가 있습니다."
        answer['answer_end'] = end_idx

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [9]:

print('첫번째 샘플의 답변')
print('-----------------')
print(train_answers[0])

첫번째 샘플의 답변
-----------------
{'text': '교향곡', 'answer_start': 54, 'answer_end': 57}

In [10]:

tokenizer = BertTokenizerFast.from_pretrained('klue/bert-base')

train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [11]:

def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    deleting_list = []
    

    for i in tqdm(range(len(answers))):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))

        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
            deleting_list.append(i)

        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
            if i not in deleting_list:
              deleting_list.append(i)

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
    return deleting_list

In [12]:

deleting_list_for_train = add_token_positions(train_encodings, train_answers)
deleting_list_for_test = add_token_positions(val_encodings, val_answers)

100%|██████████| 60407/60407 [00:00<00:00, 454631.67it/s]
100%|██████████| 5774/5774 [00:00<00:00, 470259.84it/s]

In [13]:

train_answers[0]

Out[13]:

{'text': '교향곡', 'answer_start': 54, 'answer_end': 57}

In [14]:

def delete_samples(encodings, deleting_list):
  input_ids = np.delete(np.array(encodings['input_ids']), deleting_list, axis=0)
  attention_masks = np.delete(np.array(encodings['attention_mask']), deleting_list, axis=0)
  start_positions = np.delete(np.array(encodings['start_positions']), deleting_list, axis=0)
  end_positions = np.delete(np.array(encodings['end_positions']), deleting_list, axis=0)

  X_data = [input_ids, attention_masks]
  y_data = [start_positions, end_positions]

  return X_data, y_data

In [15]:

X_train, y_train = delete_samples(train_encodings, deleting_list_for_train)
X_test, y_test = delete_samples(val_encodings, deleting_list_for_test)

MODEL¶

In [16]:

from transformers import TFBertModel
class TFBertForQuestionAnswering(tf.keras.Model):
    def __init__(self, model_name):
        super(TFBertForQuestionAnswering, self).__init__()
        self.bert = TFBertModel.from_pretrained(model_name, from_pt=True)
        self.qa_outputs = tf.keras.layers.Dense(2,
                                                kernel_initializer=tf.keras.initializers.TruncatedNormal(0.02),
                                                name='qa_outputs')
        self.softmax = tf.keras.layers.Activation(tf.keras.activations.softmax)

    def call(self, inputs):
        input_ids, attention_mask = inputs
        outputs = self.bert(input_ids, attention_mask=attention_mask)
    
        sequence_output = outputs[0]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = tf.split(logits, 2, axis=-1)

        # start_logits = (batch_size, sequence_length,)
        # end_logits = (batch_size, sequence_length,)
        start_logits = tf.squeeze(start_logits, axis=-1)
        end_logits = tf.squeeze(end_logits, axis=-1)

        start_probs = self.softmax(start_logits)
        end_probs = self.softmax(end_logits)

        return start_probs, end_probs

In [17]:

strategy = tf.distribute.OneDeviceStrategy(device='GPU') # GPU 사용

2023-01-26 11:16:33.193212: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2023-01-26 11:16:33.252630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:19:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.252818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:1a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.252974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: 
pciBusID: 0000:67:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.253122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties: 
pciBusID: 0000:68:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.253148: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-26 11:16:33.256513: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2023-01-26 11:16:33.256592: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2023-01-26 11:16:33.276605: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2023-01-26 11:16:33.276896: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2023-01-26 11:16:33.277324: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2023-01-26 11:16:33.279173: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2023-01-26 11:16:33.279338: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2023-01-26 11:16:33.280571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
2023-01-26 11:16:33.281448: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-26 11:16:33.755117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:19:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.755299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: 
pciBusID: 0000:1a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.755444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: 
pciBusID: 0000:67:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.755584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties: 
pciBusID: 0000:68:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2023-01-26 11:16:33.756587: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3
2023-01-26 11:16:33.756634: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-26 11:16:34.876366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-01-26 11:16:34.876401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 2 3 
2023-01-26 11:16:34.876407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N N N N 
2023-01-26 11:16:34.876410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   N N N N 
2023-01-26 11:16:34.876412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2:   N N N N 
2023-01-26 11:16:34.876415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   N N N N 
2023-01-26 11:16:34.877860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22312 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:19:00.0, compute capability: 8.6)
2023-01-26 11:16:34.878401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22312 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:1a:00.0, compute capability: 8.6)
2023-01-26 11:16:34.878808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 22312 MB memory) -> physical GPU (device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:67:00.0, compute capability: 8.6)
2023-01-26 11:16:34.879207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 22293 MB memory) -> physical GPU (device: 3, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:68:00.0, compute capability: 8.6)

In [18]:

with strategy.scope():
  model = TFBertForQuestionAnswering("klue/bert-base")
  optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
  loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
  model.compile(optimizer=optimizer, loss=loss)

2023-01-26 11:16:37.338100: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2023-01-26 11:16:37.970159: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2023-01-26 11:16:37.970216: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'bert.embeddings.position_ids', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.

In [110]:

history = model.fit(
    X_train,
    y_train,
    epochs=3,
    verbose=1,
    batch_size=16,
)
     

2023-01-25 21:29:41.933407: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_72133"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.

Epoch 1/3
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.Socket(zmq.PUSH) at 0x7f82a16fb400>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.Socket(zmq.PUSH) at 0x7f82a16fb400>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:From /home/suyeon/anaconda3/envs/py39/lib/python3.9/site-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_11/bert/pooler/dense/kernel:0', 'tf_bert_model_11/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_11/bert/pooler/dense/kernel:0', 'tf_bert_model_11/bert/pooler/dense/bias:0'] when minimizing the loss.
3759/3759 [==============================] - 1597s 422ms/step - loss: 1.1263 - output_1_loss: 0.5031 - output_2_loss: 0.6232
Epoch 2/3
3759/3759 [==============================] - 1588s 422ms/step - loss: 0.6203 - output_1_loss: 0.2627 - output_2_loss: 0.3576
Epoch 3/3
3759/3759 [==============================] - 1588s 422ms/step - loss: 0.4417 - output_1_loss: 0.1826 - output_2_loss: 0.2591

In [112]:

model.save_weights('./mrc_model')

In [19]:

def modeling():
    model = TFBertForQuestionAnswering("klue/bert-base")
    optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    model.compile(optimizer=optimizer, loss=loss)
    return model
    

In [21]:

# 모델을 저장하고 다시 불러올 때 사용
model = modeling()
model.load_weights('./mrc_model')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'bert.embeddings.position_ids', 'cls.predictions.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.

Out[21]:

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f3a006e9dc0>

In [22]:

def predict_test_data_by_idx(idx):
  context = tokenizer.decode(X_test[0][idx]).split('[SEP] ')[0]
  question = tokenizer.decode(X_test[0][idx]).split('[SEP] ')[1]
  print('본문 :', context[5:])
  print('질문 :', question)
  answer_encoded = X_test[0][idx][y_test[0][idx]:y_test[1][idx]+1]
  print('정답 :',tokenizer.decode(answer_encoded))
  output = model([tf.constant(X_test[0][idx])[None, :], tf.constant(X_test[1][idx])[None, :]])
  start = tf.math.argmax(tf.squeeze(output[0]))
  end = tf.math.argmax(tf.squeeze(output[1]))+1
  answer_encoded = X_test[0][idx][start:end]
  print('예측 :',tokenizer.decode(answer_encoded))
  print('----------------------------------------')

In [23]:

for i in range(0, 3):
  predict_test_data_by_idx(i)

본문 : 1989년 2월 15일 여의도 농민 폭력 시위를 주도한 혐의 ( 폭력행위등처벌에관한법률위반 ) 으로 지명수배되었다. 1989년 3월 12일 서울지방검찰청 공안부는 임종석의 사전구속영장을 발부받았다. 같은 해 6월 30일 평양축전에 임수경을 대표로 파견하여 국가보안법위반 혐의가 추가되었다. 경찰은 12월 18일 ~ 20일 사이 서울 경희대학교에서 임종석이 성명 발표를 추진하고 있다는 첩보를 입수했고, 12월 18일 오전 7시 40분 경 가스총과 전자봉으로 무장한 특공조 및 대공과 직원 12명 등 22명의 사복 경찰을 승용차 8대에 나누어 경희대학교에 투입했다. 1989년 12월 18일 오전 8시 15분 경 서울청량리경찰서는 호위 학생 5명과 함께 경희대학교 학생회관 건물 계단을 내려오는 임종석을 발견, 검거해 구속을 집행했다. 임종석은 청량리경찰서에서 약 1시간 동안 조사를 받은 뒤 오전 9시 50분 경 서울 장안동의 서울지방경찰청 공안분실로 인계되었다.
질문 : 임종석이 여의도 농민 폭력 시위를 주도한 혐의로 지명수배 된 날은?
정답 : 1989년 2월 15일
예측 : 1989년 2월 15일
----------------------------------------
본문 : 1989년 2월 15일 여의도 농민 폭력 시위를 주도한 혐의 ( 폭력행위등처벌에관한법률위반 ) 으로 지명수배되었다. 1989년 3월 12일 서울지방검찰청 공안부는 임종석의 사전구속영장을 발부받았다. 같은 해 6월 30일 평양축전에 임수경을 대표로 파견하여 국가보안법위반 혐의가 추가되었다. 경찰은 12월 18일 ~ 20일 사이 서울 경희대학교에서 임종석이 성명 발표를 추진하고 있다는 첩보를 입수했고, 12월 18일 오전 7시 40분 경 가스총과 전자봉으로 무장한 특공조 및 대공과 직원 12명 등 22명의 사복 경찰을 승용차 8대에 나누어 경희대학교에 투입했다. 1989년 12월 18일 오전 8시 15분 경 서울청량리경찰서는 호위 학생 5명과 함께 경희대학교 학생회관 건물 계단을 내려오는 임종석을 발견, 검거해 구속을 집행했다. 임종석은 청량리경찰서에서 약 1시간 동안 조사를 받은 뒤 오전 9시 50분 경 서울 장안동의 서울지방경찰청 공안분실로 인계되었다.
질문 : 1989년 6월 30일 평양축전에 대표로 파견 된 인물은?
정답 : 임수경
예측 : 임수경
----------------------------------------
본문 : 1989년 2월 15일 여의도 농민 폭력 시위를 주도한 혐의 ( 폭력행위등처벌에관한법률위반 ) 으로 지명수배되었다. 1989년 3월 12일 서울지방검찰청 공안부는 임종석의 사전구속영장을 발부받았다. 같은 해 6월 30일 평양축전에 임수경을 대표로 파견하여 국가보안법위반 혐의가 추가되었다. 경찰은 12월 18일 ~ 20일 사이 서울 경희대학교에서 임종석이 성명 발표를 추진하고 있다는 첩보를 입수했고, 12월 18일 오전 7시 40분 경 가스총과 전자봉으로 무장한 특공조 및 대공과 직원 12명 등 22명의 사복 경찰을 승용차 8대에 나누어 경희대학교에 투입했다. 1989년 12월 18일 오전 8시 15분 경 서울청량리경찰서는 호위 학생 5명과 함께 경희대학교 학생회관 건물 계단을 내려오는 임종석을 발견, 검거해 구속을 집행했다. 임종석은 청량리경찰서에서 약 1시간 동안 조사를 받은 뒤 오전 9시 50분 경 서울 장안동의 서울지방경찰청 공안분실로 인계되었다.
질문 : 임종석이 여의도 농민 폭력 시위를 주도한 혐의로 지명수배된 연도는?
정답 : 1989년
예측 : 1989년
----------------------------------------

In [24]:

def predict_answer(idx):
  context = tokenizer.decode(X_test[0][idx]).split('[SEP] ')[0]
  question = tokenizer.decode(X_test[0][idx]).split('[SEP] ')[1]
#   print('본문 :', context[5:])
#   print('질문 :', question)
  real_answer_encoded = X_test[0][idx][y_test[0][idx]:y_test[1][idx]+1]
  real_answer_decoded = tokenizer.decode(real_answer_encoded)
#   print('정답 :',tokenizer.decode(answer_encoded))
  output = model([tf.constant(X_test[0][idx])[None, :], tf.constant(X_test[1][idx])[None, :]])
  start = tf.math.argmax(tf.squeeze(output[0]))
  end = tf.math.argmax(tf.squeeze(output[1]))+1
  pred_answer_encoded = X_test[0][idx][start:end]
  pred_answer_decoded = tokenizer.decode(pred_answer_encoded)
  return [real_answer_decoded, pred_answer_decoded]

In [25]:

predict = []
except_idx = []
from tqdm import tqdm
for i in tqdm(range(len(val_questions))):
    try:
        predict.append(predict_answer(i))
    except:
        except_idx.append(i)

100%|██████████| 5774/5774 [06:33<00:00, 14.69it/s]

In [124]:

predict[:30]

Out[124]:

[['1989년 2월 15일', '1989년 2월 15일'],
 ['임수경', '임수경'],
 ['1989년', '1989년'],
 ['학생회관 건물 계단', '경희대학교 학생회관'],
 ['서울지방경찰청 공안분실', '서울지방경찰청 공안분실'],
 ['임종석', '임종석'],
 ['여의도 농민 폭력 시위', '여의도 농민 폭력 시위'],
 ['허영', '허영'],
 ['10차 개헌안 발표', '10차 개헌안 발표'],
 ['제89조', '제89조'],
 ['허영', '허영'],
 ['미국 육군 부참모 총장', '미국 육군 부참모 총장'],
 ['초대 국무장관직', '국무장관직'],
 ['로널드 레이건 대통령', '로널드 레이건'],
 ['알렉산더 메이그스 헤이그 2세', '알렉산더 메이그스 헤이그 2세'],
 ['미국 육군 부참모 총장', '미국 육군 부참모 총장'],
 ['1924년 12월 2일', '1924년 12월 2일'],
 ['국무장관', '국무장관'],
 ['경고 : 현실주의, 레이건과 외교 정책', '경고 : 현실주의, 레이건과 외교 정책'],
 ['퍼트리샤 앤토이넷 폭스', '퍼트리샤 앤토이넷 폭스'],
 ['1944년', '1944년'],
 ['3명', '3명'],
 ['노터데임 대학교', '노터데임 대학교'],
 ['퍼트리샤 앤토이넷 폭스', '퍼트리샤 앤토이넷 폭스'],
 ['노터데임 대학교', '노터데임 대학교'],
 ['정통 제병 연합부대', '정통 제병 연합부대'],
 ['퍼트리샤 앤토이넷 폭스', '퍼트리샤 앤토이넷 폭스'],
 ['1979년', '1979년'],
 ['닉슨 대통령', '닉슨'],
 ['5년', '5년']]

In [128]:

len(except_idx)

Out[128]:

In [127]:

print(len(val_questions),len(y_test[0]),len(y_test[1]))

5774 5717 5717

In [72]:

len(X_test[0])

Out[72]:

평가하기¶

MRC 평가방법¶

Exact Match / F1 Score

Ground-truth와 얼마나 일치하는 지를 측정하므로 extractive answer 및 multiple-choice answer dataset에 적합한 평가

ROUGE-L / BLEU

Ground-truth와 얼마나 overlap이 있는 지를 측정하며, 답변을 새로이 생성해내는 descriptive answer dataset에 적합한 평가

In [148]:

from collections import Counter
import string
import re

def normalize_answer(s):
    """Lower text and remove punctuation, articles and extra whitespace."""
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def exact_match_score(prediction, ground_truth):
    return (normalize_answer(prediction) == normalize_answer(ground_truth))

def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    return max(scores_for_ground_truths)

In [26]:

# f1_sum = 0

# for i in range(len(predict)):
#   f1 = metric_max_over_ground_truths(f1_score, predict[i][0], predict[i][1])
#   f1_sum += f1
# print("f1 score : ", f1_sum / len(predict))

학습된 모델에 새로운 데이터로 예측하기¶

In [47]:

input_contexts = ["1989년 2월 15일 여의도 농민 폭력 시위를 주도한 혐의 ( 폭력행위등처벌에관한법률위반 ) 으로 지명수배되었다. 1989년 3월 12일 서울지방검찰청 공안부는 임종석의 사전구속영장을 발부받았다. 같은 해 6월 30일 평양축전에 임수경을 대표로 파견하여 국가보안법위반 혐의가 추가되었다. 경찰은 12월 18일 ~ 20일 사이 서울 경희대학교에서 임종석이 성명 발표를 추진하고 있다는 첩보를 입수했고, 12월 18일 오전 7시 40분 경 가스총과 전자봉으로 무장한 특공조 및 대공과 직원 12명 등 22명의 사복 경찰을 승용차 8대에 나누어 경희대학교에 투입했다. 1989년 12월 18일 오전 8시 15분 경 서울청량리경찰서는 호위 학생 5명과 함께 경희대학교 학생회관 건물 계단을 내려오는 임종석을 발견, 검거해 구속을 집행했다. 임종석은 청량리경찰서에서 약 1시간 동안 조사를 받은 뒤 오전 9시 50분 경 서울 장안동의 서울지방경찰청 공안분실로 인계되었다."
                    , "1945년 8월 15일 제2차 세계대전에서 일본이 패망하자 한국은 해방을 맞았다. 해방부터 대한민국과 조선민주주의인민공화국이 성립되는 시기까지를 군정기 또는 해방정국이라 한다. 해방정국에서 남과 북은 극심한 좌우 갈등을 겪었다. 북에서는 조만식과 같은 우익 인사에 대한 탄압이 있었고, 남에서는 여운형과 같은 중도 좌파 정치인이 암살되었다. 국제사회에서는 모스크바 3상회의를 통해 소련과 미국에서 통일 임시정부 수립을 위한 신탁통치를 계획했지만, 한국에서의 극심한 반대와 함께 미소공동위원회의 결렬로 폐기되었다. 1948년 대한민국과 조선민주주의인민공화국의 정부가 별도로 수립되면서 일본군의 무장해제를 위한 군사분계선이었던 38선은 두 개의 국가로 분단되는 기준이 되었다. 분단 이후 양측 간의 긴장이 이어졌고 수시로 국지적인 교전이 있었다. 1950년 6월 25일 조선인민군이 일제히 38선을 넘어 대한민국을 침략하여 한국 전쟁이 발발하였다."]
input_questions =  ["1989년 6월 30일 평양축전에 대표로 파견 된 인물은? "
                    , "한국전쟁이 발발한 시기는?"]

In [48]:

input_encodings = tokenizer(input_contexts, input_questions, truncation=True, padding=True)

In [49]:

def encoding_input(encodings):
  input_ids = np.array(encodings['input_ids'])
  attention_masks = np.array(encodings['attention_mask'])

  input_data = [input_ids, attention_masks]
  return input_data

In [50]:

input_data = encoding_input(input_encodings)

In [51]:

def predict_test_data_by_idx(idx):
  context = tokenizer.decode(input_data[0][idx]).split('[SEP] ')[0]
  question = tokenizer.decode(input_data[0][idx]).split('[SEP] ')[1]
  print('본문 :', context[5:])
  print('질문 :', question)
  output = model([tf.constant(input_data[0][idx])[None, :], tf.constant(input_data[1][idx])[None, :]])
  start = tf.math.argmax(tf.squeeze(output[0]))
  end = tf.math.argmax(tf.squeeze(output[1]))+1
  answer_encoded = input_data[0][idx][start:end]
  print('예측 :',tokenizer.decode(answer_encoded))
  print('----------------------------------------')

In [52]:

for i in range(len(input_questions)):
  predict_test_data_by_idx(i)

In [ ]:

REFERENCE

원본 코드

https://github.com/ukairia777/tensorflow-nlp-tutorial/blob/main/18.%20Fine-tuning%20BERT%20(Cls%2C%20NER%2C%20NLI)/18-7.%20kor_bert_question_answering_tpu.ipynb

GitHub - ukairia777/tensorflow-nlp-tutorial: tensorflow를 사용하여 텍스트 전처리부터, Topic Models, BERT, GPT와

tensorflow를 사용하여 텍스트 전처리부터, Topic Models, BERT, GPT와 같은 최신 모델의 다운스트림 태스크들을 정리한 Deep Learning NLP 저장소입니다. - GitHub - ukairia777/tensorflow-nlp-tutorial: tensorflow를 사용하

github.com

데이터셋

https://korquad.github.io/KorQuad%201.0/

1.0 (한국어)

What is KorQuAD 1.0? KorQuAD 1.0은 한국어 Machine Reading Comprehension을 위해 만든 데이터셋입니다. 모든 질의에 대한 답변은 해당 Wikipedia article 문단의 일부 하위 영역으로 이루어집니다. Stanford Question Answer

korquad.github.io