[NLP] 수집한 데이터로 커스텀 개체명인식 데이터셋 구축하기 - (1)

728x90

from konlpy.tag import Mecab

# 개체명 사전 로드
ner_dict = {"PERSON": ["John", "Mary", "Peter"],
              "ORGANIZATION": ["Google", "Microsoft", "Apple"],
              "LOCATION": ["New York", "London", "Paris"]}

# 텍스트 입력 받기
text = "John works at Google in New York."

# 텍스트를 단어로 분리
tokenizer = Mecab()
words = tokenizer.morphs(text)
print(words)
# 개체명 사전을 사용하여 개체명을 태깅
tags = []
for word in words:
    for ner in ner_dict:
        if word in ner_dict[ner]:
            tags.append(ner)
        else:
            tags.append("O")

# 태그가 있는 텍스트 출력
print("Tagged text:", " ".join([word + " " + tag for word, tag in zip(words, tags)]))

Mecab 형태소 분석기를 이용해서 토큰화하여 개체명을 태깅합니다.

728x90

티스토리툴바