sillon coding

기존의 인터넷 상에 있던 코드들은 모두 옛날 코드인지 잘 적용이 안됐었다. 긴 시간 구현해보고 파이토치로도 구현해보았는데 제대로 코드조차 실행이 안됐다. 사실은 다른 오픈소스를 다운받아 해결할 수 있었겠지만, 그렇게 하면 공부가 절대 되지 않을 것 같아 직접 발로 뛰며 구현했다. (사실 앉아만 있었음) 버트 인풋 만들기에서 조금 더 추가한 부분이 있다. 'token_type_ids' [CLS] 부분과 [SEP] 부분까지의 문장을 구분한다. 사실 내가 전처리한 데이터 셋에는 데이터셋 처음 부분은 [CLS], 마지막 부분은 [SEP]라서 문장 자체에서 이렇게 구분하진 않아도 된다. 그래도 일단... 인풋에 필요하다고 하니 넣어주기로 한다. # 'token_type_ids' ( [CLS]와 [SEP]를 구분해..

보호되어 있는 글입니다.

학습을 위해 데이터를 분리하여주겠습니다. from sklearn.model_selection import train_test_split # train 데이터에서 10% 만큼을 validation 데이터로 분리 tr_inputs, val_inputs, tr_tags, val_tags = train_test_split(input_ids, tags, random_state=222, test_size=0.1) tr_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids, random_state=222, test_size=0.1) main.py import pickle import numpy as np from tensorflow.keras.pr..

pickle file save & load main.py import pickle import numpy as np def file_open(filePath): f =open(filePath,"rb") data = pickle.load(f) f.close() return data if __name__ == "__main__": tokened_data = file_open('data/token_data.pkl') tokenized_texts = [token_label_pair[0] for token_label_pair in tokened_data] labels = [token_label_pair[1] for token_label_pair in tokened_data] ## padding print(np.q..

전처리한 데이터를 바탕으로 태그와 문장을 토큰화 해보겠습니다. 문장 토큰화 tokenization.py from transformers import BertTokenizer from transformers import BertTokenizer def tokenize_and_preserve_labels(sentence, text_labels): tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased') tokenized_sentence = [] labels = [] for word, label in zip(sentence, text_labels): tokenized_word = tokenizer.tokenize(word) n_subw..

이전에 구현했던 코드에서 데이터 전처리를 조금 더 진행해보겠다. data_load.py에서 구현했던 tag_split() 함수에서 해당 문자에서 한글, 영어, 소문자, 대문자 이외의 단어를 제거하는 함수를 작성해보겠다. cng_sentence = [] for str in sentence: # 한글, 영어, 소문자, 대문자 이외의 단어 제거 str = re.sub(r'[^ㄱ-ㅣ가-힣0-9a-zA-Z.]', "", str) cng_sentence.append(str) # 전처리한 데이터를 바탕으로 단어와 태그 나누기 def tag_split(tagged_sentences): sentences, ner_tags = [],[] for tagged_sentence in tagged_sentences: senten..

f1score란.. 블로그에서 해당 내용에 대해 다룬 게시글이 있으므로 나중에 링크 걸어두겠다. https://bestkcs1234.tistory.com/61 Tensorflow addons 을 이용한 F1 score 출력 tensorflow는 모델을 compile 할 때 다양한 metrics를 제공해준다. 다음과 같이 설정하면 매 epoch 마다 해당하는 수치들을 계산하여 출력해준다. 하지만 f1 score는 tf.metrics에서 찾아볼 수 없는데 Tensorflow a bestkcs1234.tistory.com from sklearn.metrics import classification_report,f1_score # f1 score p = model.predict(np.array(X_test)) ..

모델에 attention 층 추가 model.add(Bidirectional(LSTM(units=128, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))) model.add(SeqSelfAttention(attention_activation='sigmoid')) model.add(Bidirectional(LSTM(units=64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))) model.add(SeqSelfAttention(attention_activation='sigmoid')) model.add(Dense(1, activation='sigmoid')) model.py from ke..

지난번 게시글에 학습한 모델을 바탕으로 새로운 input값을 넣어 개채명을 인식해보겠습니다. 지난번 코드와 출력결과는 아래와 같습니다. input_predict.py from keras.preprocessing.text import text_to_word_sequence from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import load_model import numpy as np def tokenize(samples): tokenizer = Tokenizer() tokenizer..

모델명: test_model.h5 파라미터 설정 test-train split : 8:2 Optimaizer: Adam Epoch = 3 batch_size=128 Learning_rate = 0.001 성능 loss: 0.1447 - accuracy: 0.7625 모델명: test_model2.h5 파라미터 설정 test-train split : 8:2 Optimaizer: Adam Epoch = 20 batch_size=128 Learning_rate = 0.005 성능 loss: 0.1796 - accuracy: 0.7463 학습로그 더보기 Epoch 1/20 2022-11-06 01:29:32.795455: I tensorflow/stream_executor/platform/default/dso_..

출력한 값을 토대로 정답을 예측하는 함수를 만들었다. 이번에는 모델을 학습 시키고 저장한 뒤 로드한 모델로 출력하는 함수까지 구현하였다. def model_predict(src_tokenizer,tar_tokenizer,model): # 예측 idx2word = src_tokenizer.index_word idx2ner = tar_tokenizer.index_word idx2ner[0] = 'PAD' i = 10 y_predicted = model.predict(np.array([X_test[i]])) y_predicted = np.argmax(y_predicted, axis = -1) # 가장 높은 확률을 추출 (예측값) true = np.argmax(y_test[i],-1) # 실제 값 print("..

모델 구축에는 keras 함수를 이용하였다. 그리고 옵티마이저는 Adam 모델의 구성은 다음과 같다. 1. 입력을 실수 벡터로 임베딩 2. 양방향 LSTM 구성 3. Dense Layer를 통한 각 태그에 속할 확률을 예측 TimeDistributed는 상위 layer의 출력이 step에 따라 여러개로 출력되어 이를 적절하게 분배해주는 역할을 한다. model.py from keras.models import Sequential from keras.layers import Dense, Embedding, LSTM, Bidirectional, TimeDistributed from keras.optimizers import Adam def modeling(vocab_size, max_len,tag_size)..

티스토리툴바