20개 뉴스 그룹 data에 대한 이해¶

import pandas as pd
from sklearn.datasets import fetch_20newsgroups
# %matplotlib inline : ipython에서 함수 결과값을 inline 웹브라우저상 안에서 보여주는 code
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

news = fetch_20newsgroups(subset='train') # train을 기재하면 훈련데이터만 리턴

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

print(news.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

print('훈련용 샘플의 갯수 : {}'.format(len(news.target)))
print('총 주제의 개수 : {}'.format(len(news.target_names)))
print('총 주제의 list : {}'.format(news.target_names))

훈련용 샘플의 갯수 : 11314
총 주제의 개수 : 20
총 주제의 list : ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

목적 : test_data에서 이메일 본문을 보고 20개의 주제 중 어떤 주제인지를 맞추는 것이다.

print('0번째의 주제 : ',news.target_names[0])
print('0번째의 주제의 index : ',news.target[0])
print('0번째의 내용 : \n\n',news.data[0])

0번째의 주제 :  alt.atheism
0번째의 주제의 index :  7
0번째의 내용 : 

 From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

# 새로운 dataframe 생성 후 재배열
data = pd.DataFrame(news.data,  columns=['email'])
data['target'] = pd.Series(news.target)

data.head()

# data.isnull() 은 dataframe에서 혹시 null값이 있는지 있다면 True를 return
# any() 는 어느 하나라도 True가 존재한다면 True를 내뱉는다.
print(data.isnull().any())

email     False
target    False
dtype: bool

nunique() : 중복을 제외한다.

print('중복을 제외한 샘플의 수 : {}'.format(data['email'].nunique()))
print('중복을 제외한 주제의 수 : {}'.format(data['target'].nunique()))

중복을 제외한 샘플의 수 : 11314
중복을 제외한 주제의 수 : 20

print(type(data['target']))
data['target'].value_counts().plot(kind='bar')

<class 'pandas.core.series.Series'>

<matplotlib.axes._subplots.AxesSubplot at 0x29529543a48>

# groupby 는 index로 묶은 후 이를 list화 시킨다. ex) 1 : [1,2,3,4]
# size는 target으로 groupby 로 묶은 target에 해당하는 부분 list의 길이를 알려준다.
# reset_index는 index를 새로 설정해준 것이다.
print(data.groupby('target').size().reset_index(name='count'))

    target  count
0        0    480
1        1    584
2        2    591
3        3    590
4        4    578
5        5    593
6        6    585
7        7    594
8        8    598
9        9    597
10      10    600
11      11    595
12      12    591
13      13    594
14      14    593
15      15    599
16      16    546
17      17    564
18      18    465
19      19    377

newsdata_test = fetch_20newsgroups(subset='test', shuffle=True) # 'test'를 기재하면 테스트 데이터만 리턴한다.
train_email = data['email'] # 훈련 데이터의 본문 저장
train_label = data['target'] # 훈련 데이터의 레이블 저장
test_email = newsdata_test.data # 테스트 데이터의 본문 저장
test_label = newsdata_test.target # 테스트 데이터의 레이블 저장

tokenizer로 전처리 실행¶

max_words = 10000 # 실습에 사용할 단어의 최대 갯수
num_classes = 20 # 레이블의 수

# mode는 count, binary, tfidf, freq 중 하나를 고르는 것이다.
# data 전처리에서 어떤 mode를 하느냐에 따라 달라짐.
def prepare_data(train_data, test_data, mode):
    t=Tokenizer(num_words = max_words)
    t.fit_on_texts(train_data) # train_data에 맞춰서 word_index 만들어두기
    X_train = t.texts_to_matrix(train_data, mode=mode)
    X_test = t.texts_to_matrix(test_data, mode=mode)
    
    return X_train, X_test, t.index_word

X_train, X_test, index_to_word = prepare_data(train_email, test_email, 'binary')
y_train = to_categorical(train_label, num_classes)
y_test=to_categorical(test_label, num_classes)
# print(train_email)

print('훈련 샘플 본문의 크기 : {}'.format(X_train.shape))
print('훈련 샘플 레이블의 크기 : {}'.format(y_train.shape))
print('테스트 샘플 본문의 크기 : {}'.format(X_test.shape))
print('테스트 샘플 레이블의 크기 : {}'.format(y_test.shape))

훈련 샘플 본문의 크기 : (11314, 10000)
훈련 샘플 레이블의 크기 : (11314, 20)
테스트 샘플 본문의 크기 : (7532, 10000)
테스트 샘플 레이블의 크기 : (7532, 20)

print('빈도수 상위 1번 단어 : {}'.format(index_to_word[1]))
print('빈도수 상위 9999번 단어 : {}'.format(index_to_word[9999]))

빈도수 상위 1번 단어 : the
빈도수 상위 9999번 단어 : mic

MLP (Multilayer Perceptron) 사용하여 텍스트 분류하기¶

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential

def fit_and_evaluate(X_train, y_train, X_test, y_test):
    model = Sequential()
    model.add(Dense(256, input_shape=(max_words,), activation='relu')) # max_words는 x_train의 1행의 길이이다.
    model.add(Dropout(0.5))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
    # 다중클래스 분류
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(X_train, y_train, batch_size=128, epochs=5, verbose=1, validation_split=0.1)
    score = model.evaluate(X_test, y_test, batch_size=128, verbose=0)
    return score[1]

modes = ['binary', 'count', 'tfidf', 'freq'] # 4개의 모드를 리스트에 저장.

for mode in modes: # 4개의 모드에 대해서 각각 아래의 작업을 반복한다.
    X_train, X_test, _ = prepare_data(train_email, test_email, mode) # 모드에 따라서 데이터를 전처리
    score = fit_and_evaluate(X_train, y_train, X_test, y_test) # 모델을 훈련하고 평가.
    print(mode+' 모드의 테스트 정확도:', score)

Train on 10182 samples, validate on 1132 samples
Epoch 1/5
10182/10182 [==============================] - 5s 539us/sample - loss: 2.2860 - accuracy: 0.3311 - val_loss: 0.9490 - val_accuracy: 0.8180
Epoch 2/5
10182/10182 [==============================] - 5s 462us/sample - loss: 0.8610 - accuracy: 0.7624 - val_loss: 0.4673 - val_accuracy: 0.8852
Epoch 3/5
10182/10182 [==============================] - 4s 429us/sample - loss: 0.4350 - accuracy: 0.8850 - val_loss: 0.3712 - val_accuracy: 0.8949
Epoch 4/5
10182/10182 [==============================] - 4s 417us/sample - loss: 0.2510 - accuracy: 0.9370 - val_loss: 0.3136 - val_accuracy: 0.9117
Epoch 5/5
10182/10182 [==============================] - 4s 415us/sample - loss: 0.1820 - accuracy: 0.9533 - val_loss: 0.2977 - val_accuracy: 0.9178
binary 모드의 테스트 정확도: 0.8296601
Train on 10182 samples, validate on 1132 samples
Epoch 1/5
10182/10182 [==============================] - 7s 696us/sample - loss: 2.6556 - accuracy: 0.2591 - val_loss: 1.5433 - val_accuracy: 0.7253
Epoch 2/5
10182/10182 [==============================] - 5s 509us/sample - loss: 1.4116 - accuracy: 0.6343 - val_loss: 0.7099 - val_accuracy: 0.8507
Epoch 3/5
10182/10182 [==============================] - 4s 435us/sample - loss: 0.7594 - accuracy: 0.8127 - val_loss: 0.5012 - val_accuracy: 0.8781
Epoch 4/5
10182/10182 [==============================] - 5s 446us/sample - loss: 0.4919 - accuracy: 0.8747 - val_loss: 0.4105 - val_accuracy: 0.8905
Epoch 5/5
10182/10182 [==============================] - 4s 437us/sample - loss: 0.4235 - accuracy: 0.9159 - val_loss: 0.3882 - val_accuracy: 0.8949
count 모드의 테스트 정확도: 0.8232873
Train on 10182 samples, validate on 1132 samples
Epoch 1/5
10182/10182 [==============================] - 5s 500us/sample - loss: 2.2481 - accuracy: 0.3538 - val_loss: 0.7963 - val_accuracy: 0.8507
Epoch 2/5
10182/10182 [==============================] - 4s 393us/sample - loss: 0.8688 - accuracy: 0.7658 - val_loss: 0.4372 - val_accuracy: 0.8931
Epoch 3/5
10182/10182 [==============================] - 4s 388us/sample - loss: 0.4344 - accuracy: 0.8865 - val_loss: 0.3540 - val_accuracy: 0.9108
Epoch 4/5
10182/10182 [==============================] - 4s 392us/sample - loss: 0.2918 - accuracy: 0.9272 - val_loss: 0.3376 - val_accuracy: 0.9081
Epoch 5/5
10182/10182 [==============================] - 4s 414us/sample - loss: 0.2167 - accuracy: 0.9485 - val_loss: 0.3181 - val_accuracy: 0.9090
tfidf 모드의 테스트 정확도: 0.8360329
Train on 10182 samples, validate on 1132 samples
Epoch 1/5
10182/10182 [==============================] - 5s 505us/sample - loss: 2.9761 - accuracy: 0.0909 - val_loss: 2.9225 - val_accuracy: 0.2367
Epoch 2/5
10182/10182 [==============================] - 4s 436us/sample - loss: 2.7270 - accuracy: 0.1906 - val_loss: 2.4198 - val_accuracy: 0.3931
Epoch 3/5
10182/10182 [==============================] - 5s 470us/sample - loss: 2.2311 - accuracy: 0.3121 - val_loss: 1.9362 - val_accuracy: 0.5936
Epoch 4/5
10182/10182 [==============================] - 5s 495us/sample - loss: 1.7848 - accuracy: 0.4584 - val_loss: 1.5035 - val_accuracy: 0.6528
Epoch 5/5
10182/10182 [==============================] - 4s 405us/sample - loss: 1.4103 - accuracy: 0.5782 - val_loss: 1.1839 - val_accuracy: 0.7253
freq 모드의 테스트 정확도: 0.68016464

	email	target
0	From: lerxst@wam.umd.edu (where's my thing)\nS...	7
1	From: guykuo@carson.u.washington.edu (Guy Kuo)...	4
2	From: twillis@ec.ecn.purdue.edu (Thomas E Will...	4
3	From: jgreen@amber (Joe Green)\nSubject: Re: W...	1
4	From: jcm@head-cfa.harvard.edu (Jonathan McDow...	14

DL : RNN : LSTM (Long-Short-Term Memory) (0)	2020.03.10
DL : RNN (Recurrent Neural Network) (0)	2020.03.10
DL : Keras : Sequential vs Functional API (0)	2020.03.10
DL : Keras texts_to_matrix 이해하기 (0)	2020.03.10
DL : Keres 기초 (0)	2020.03.10

월곡동 로봇팔의 대학원일지

DL : Keras : 20개 뉴스 판별하기 project

20개 뉴스 그룹 data에 대한 이해¶

tokenizer로 전처리 실행¶

MLP (Multilayer Perceptron) 사용하여 텍스트 분류하기¶

'AI' 카테고리의 다른 글

댓글

티스토리툴바

DL : Keras : 20개 뉴스 판별하기 project

20개 뉴스 그룹 data에 대한 이해¶

tokenizer로 전처리 실행¶

MLP (Multilayer Perceptron) 사용하여 텍스트 분류하기¶

'AI' 카테고리의 다른 글

관련글

댓글

티스토리툴바