토큰화란?: 문장/다니어 단위로 쪼개기 (feat.코퍼스 Corpus) - nltk.sent_tokenize, okt.morphs, okt.pos, sent_tokenize, word

토큰화

문장/다니어 단위로 쪼개기

코퍼스 Corpus에서 분리자 Separator를 포함하지 않는 연속적인 문자열 단위로 문장, 단어 단위로 토큰화한다.

* 코퍼스 Corpus 자연어 처리에서 사용하는 대량의 텍스트 데이터 집합

- 토큰화 단위에 따라 문장, 단어단위로 토큰화한다. sent_tokenize

[
 'Natural Language Processing is interesting.',
 'I am studying NLP.',
 'Tokenization is important.'
]from nltk.tokenize import sent_tokenize

text = """
Natural Language Processing is interesting.
I am studying NLP.
Tokenization is important.
"""

sentences = sent_tokenize(text)

print(sentences)

[
 'Natural Language Processing is interesting.',
 'I am studying NLP.',
 'Tokenization is important.'
]

from nltk.tokenize import word_tokenize

sentence = "Natural Language Processing is interesting."

tokens = word_tokenize(sentence)

print(tokens)

['Natural', 'Language', 'Processing', 'is', 'interesting', '.']

- 어절단위토큰화는 영문에 적합하고, 형태소단위 토큰화, 공백 기준 분리 한글에 적합하다.

sentence = "자연어 처리는 매우 재미있습니다."

tokens = sentence.split()

print(tokens)

- subword는 형태소와 유사하나 의미대신 통계적 방법을 적용한다. 형태소단위 okt.morphs, 품사단위okt.pos

from konlpy.tag import Okt

okt = Okt()

sentence = "자연어처리는 매우 재미있습니다."

tokens = okt.morphs(sentence)

print(tokens)

['자연어', '처리', '는', '매우', '재미있습니다', '.']

from konlpy.tag import Okt

okt = Okt()

sentence = "학생들이 열심히 공부한다."

tokens = okt.pos(sentence)

print(tokens)

[
 ('학생', 'Noun'),
 ('들', 'Suffix'),
 ('이', 'Josa'),
 ('열심히', 'Adverb'),
 ('공부', 'Noun'),
 ('한다', 'Verb'),
 ('.', 'Punctuation')
]

- nltk Natural Language Toolkit 라이브러리를 사용한다. sent_tokenize, word_tokenize

from nltk.tokenize import sent_tokenize, word_tokenize

text = """
Python is easy.
NLP is powerful.
"""

sentences = sent_tokenize(text)

for sentence in sentences:
    print(word_tokenize(sentence))

저작자표시 (새창열림)

'이론' 카테고리의 다른 글

학습 기법 Teacher Forcing이란?: 오류 전파(Error Propagation) , Exposure Bias , teacher_forcing_ratio , teacher_force (0)	2026.06.26
RNN 순환신경망의 한 종류, LSTM이란?: 장기 의존성 문제를 해결하기 위해 개발된 딥러닝 모델 (feat.GRU) (0)	2026.06.26
자연어와 자연어처리란? , Natural Language, NLP Natural Language Processing (feat. 정형데이터와 비정형데이터): 데이터수집 - 데이터 정제 - 문장 분리 - 토큰화 - 형태소 분석 - 벡터화 - 모델학습 - 모델평가 (1)	2026.06.24
모델이 학습이 덜됬거나 과하게 됬거나: Underfitting (언더피팅), Overfitting (오버피팅)과 해결방법 - 데이터증강 ImageDataGenerator, 드롭아웃 nn.Dropout() , 조기종료 EarlyStoping(), 가중치 감쇠 weight_decay, 모델을 단순화, 배치 정규화 BatchNorm2d, BatchNorm1d, 데이터 추가, 전이학습, 교차검증 KFold(), 라벨스무딩 CrossEntropyLoss() (0)	2026.06.17
활성화함수종류: Step Function, Sigmoid, Tanh, ReLU, Leaky ReLU, PReLU, ELU, SELU, GELU, Swish, Mish, Softmax (0)	2026.06.16

" standout

" standout

토큰화란?: 문장/다니어 단위로 쪼개기 (feat.코퍼스 Corpus) - nltk.sent_tokenize, okt.morphs, okt.pos, sent_tokenize, word_tokenize

'이론' 카테고리의 다른 글

티스토리툴바

" standout

토큰화란?: 문장/다니어 단위로 쪼개기 (feat.코퍼스 Corpus) - nltk.sent_tokenize, okt.morphs, okt.pos, sent_tokenize, word_tokenize

'이론' 카테고리의 다른 글

'이론' Related Articles

티스토리툴바