[NLP Project] 3. 데이터 학습을 위한 준비 (학습 데이터와 테스트데이터)

공부정리/NLP

[NLP Project] 3. 데이터 학습을 위한 준비 (학습 데이터와 테스트데이터)

sillon 2022. 11. 5. 23:48

728x90

tokenization.py

from tensorflow.keras.preprocessing.text import Tokenizer

# 정제 및 빈도수가 높은 상위 단어들만 추출하기 위해 토큰화 작업
def Token(sentences,ner_tags):
    max_words = 4000
    src_tokenizer = Tokenizer(num_words=max_words,oov_token='-')
    src_tokenizer.fit_on_texts(sentences)

    tar_tokenizer = Tokenizer()
    tar_tokenizer.fit_on_texts(ner_tags)

    vocab_size = max_words
    tag_size = len(tar_tokenizer.word_index) + 1
    return src_tokenizer, tar_tokenizer

여기서 처리된 토큰들을 이제 학습에 필요한 적절한 형태의 데이터로 만들고자한다.

main.py

from data_load import file_load, tag_split
from tokenization import Token

if __name__ == "__main__":
    file_path = "data/train_data.txt"
    tagged_sentences = file_load(file_path)
    sentences, ner_tags = tag_split(tagged_sentences)
    src_tokenizer, tar_tokenizer = Token(sentences,ner_tags)

    # 데이터를 학습에 사용하기 위해 데이터를 배열로 변환
    X_train = src_tokenizer.texts_to_sequences(sentences)
    y_train = tar_tokenizer.texts_to_sequences(ner_tags)

Token 함수에서 받은 토큰들을 이제 tests_to_sequences 함수를 이용하여 배열데이터로 변환한다.

배열의 형태를 한줄만 출력해보자면 이렇다.

# X_train[0]
[19, 19, 85, 19, 19, 19, 19, 19, 19, 19]

적절하게 토큰화 되어 배열의 형태로 저장되었다!

그 다음 패딩을 할 것이다.

패딩이란?

자연어 처리를 하다보면 각 문장(또는 문서)은 서로 길이가 다를 수 있다.

그런데 기계는 길이가 전부 동일한 문서들에 대해서는 하나의 행렬로 보고, 한꺼번에 묶어서 처리할 수 있다.

다시 말해 병렬 연산을 위해서 여러 문장의 길이를 임의로 동일하게 맞춰주는 작업이 필요하다.

참고

    # 데이터를 학습에 사용하기 위해 데이터를 배열로 변환
    X_train = src_tokenizer.texts_to_sequences(sentences)
    y_train = tar_tokenizer.texts_to_sequences(ner_tags)
    # 데이터의 길이를 동일한 길이로 맞추기 위해 패딩
    max_len = 70
    X_train = pad_sequences(X_train, padding= "post",maxlen=max_len)
    y_train = pad_sequences(y_train, padding= "post",maxlen=max_len)
    # 훈련, 실험 데이터 분리
    X_train, X_test, y_train, y_test = train_test_split(X_train,y_train,test_size=.2,random_state=111)

    y_train = to_categorical(y_train,num_classes=tag_size) # 원핫인코딩
    y_test = to_categorical(y_test,num_classes=tag_size) 
    print(X_train.shape,X_test.shape)
    print(y_train.shape,y_test.shape)

패딩을 하고 사이킷런의 train_test_split 함수를 이용하여 테스트데이터와 트레인 데이터를 나눠주었다.

이 부분은 따로 함수로 구현 안했는데 조금 지저분해보이면 따로 빼야할 거 같다.

출력 결과는 다음과 같다.

main.py

from data_load import file_load, tag_split
from tokenization import Token
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
import numpy as np





if __name__ == "__main__":
    file_path = "data/train_data.txt"
    tagged_sentences = file_load(file_path)
    sentences, ner_tags = tag_split(tagged_sentences)
    src_tokenizer, tar_tokenizer,tag_size= Token(sentences,ner_tags)

    # 데이터를 학습에 사용하기 위해 데이터를 배열로 변환
    X_train = src_tokenizer.texts_to_sequences(sentences)
    y_train = tar_tokenizer.texts_to_sequences(ner_tags)
    # 데이터의 길이를 동일한 길이로 맞추기 위해 패딩
    max_len = 70
    X_train = pad_sequences(X_train, padding= "post",maxlen=max_len)
    y_train = pad_sequences(y_train, padding= "post",maxlen=max_len)
    # 훈련, 실험 데이터 분리
    X_train, X_test, y_train, y_test = train_test_split(X_train,y_train,test_size=.2,random_state=111)

    y_train = to_categorical(y_train,num_classes=tag_size) # 카테고리 데이터를 인덱스로 변환 
    y_test = to_categorical(y_test,num_classes=tag_size) 
    print(X_train.shape,X_test.shape)
    print(y_train.shape,y_test.shape)

(72000, 70) (18000, 70)
(72000, 70, 30) (18000, 70, 30)

728x90