자연어처리 : 텍스트 전처리 : 불용어처리 (Stopword)

정의

우리가 앞에서 길이가 짧은 단어들이나, 등장 빈도가 적은 단어, 대, 소문자 통합, 어간추출 및 표제어 추출을 통해 noise 없애거나 정규화를 진행하였다.

하지만 글 속에서 자주 등장하는 단어지만 관용적인 표현이라 많이 쓴 표현이라면, 이 단어는 실질적인 의미가 없는 단어이다. 이를 제하기 위해 자주 등장하지만 실제 의미분석에 의미 없는 단어를 없애주는 단어를 stopword라고 한다.

1. nltk에서 불용어 확인하기

from nltk.corpus import stopwords  

print(stopwords.words('english'))

2. nltk를 통해서 불용어 제거하기

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "Family is not an important thing. It's everything."
stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(example)

result = []
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 

print(word_tokens) 
print(result)

"""
['Family', 'is', 'not', 'an', 'important', 'thing', '.', 'It', "'s", 'everything', '.']
['Family', 'important', 'thing', '.', 'It', "'s", 'everything', '.']
"""

cf) 한국어 불용어 리스트

https://www.ranks.nl/stopwords/korean

Korean Stopwords

www.ranks.nl

저작자표시 비영리 동일조건

'AI' 카테고리의 다른 글

자연어처리 : 텍스트 처리 : one-hot-encoding (0)	2020.03.04
자연어처리 : 텍스트 처리 : 정수 인코딩 (Integer Encoding) (0)	2020.03.04
자연어처리 : 텍스트 전처리 : 정규표현식 정리 (2)	2020.03.04
자연어처리 : 텍스트 전처리 : 정제 및 정규화 : 어간 추출 및 표제어 추출 (stemming & Lemmatization) (0)	2020.03.04
자연어처리 : 텍스트 전처리 : 정제 및 정규화 (Cleaning & Normalization) (0)	2020.03.04

월곡동 로봇팔의 대학원일지

자연어처리 : 텍스트 전처리 : 불용어처리 (Stopword)

정의

1. nltk에서 불용어 확인하기

2. nltk를 통해서 불용어 제거하기

'AI' 카테고리의 다른 글

댓글

티스토리툴바

자연어처리 : 텍스트 전처리 : 불용어처리 (Stopword)

정의

1. nltk에서 불용어 확인하기

2. nltk를 통해서 불용어 제거하기

'AI' 카테고리의 다른 글

관련글

댓글

티스토리툴바