code - 1
import sklearn
def data_split(examples, labels, train_frac, random_state=None):
''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
param data: Data to be split
param train_frac: Ratio of train set to whole dataset
Randomly split dataset, based on these ratios:
'train': train_frac
'valid': (1-train_frac) / 2
'test': (1-train_frac) / 2
Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
'''
assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"
X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
examples, labels, train_size=train_frac, random_state=random_state)
X_val, X_test, Y_val, Y_test = sklearn.model_selection.train_test_split(
X_tmp, Y_tmp, train_size=0.5, random_state=random_state)
return X_train, X_val, X_test, Y_train, Y_val, Y_test
code - 2
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test
= train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val
= train_test_split(X_train, y_train, test_size=0.2, random_state=1)
크게 두 가지 방법이 존재한다.
하나는 code-1 처럼 새로 define 함수를 한다.
나머지는 code-2 처럼 sklearn.cross_validation 의 train_test_split 함수를 이용해서,
처음에 train과 test 데이터로 나누고, 그 다음에는 train변수를 train과 validation으로 나눈다.
'AI' 카테고리의 다른 글
ML : Model : (Gaussian) Naive Bayes Classifier (0) | 2020.02.09 |
---|---|
ML : 오차 vs 잔차 (0) | 2020.02.01 |
ML&DL : 정규성, 독립성, 등분산성 검증 (0) | 2020.02.01 |
Statistics : 14-1, 2 : 분산분석 (0) | 2020.01.22 |
Statistics : 5-5 : 베이즈정리 심화 (0) | 2020.01.18 |
댓글