Logistic Regression

728x90

데이터 불러오기(Setosa 제거 binary classfication 진행)

import pandas as pd
import numpy as np
iris = pd.read_csv('iris_short2.csv')

numpy 배열로 만들기

iris_np = iris.values

Species순서로 정렬되어있기 때문에 random shuffle 진행

np.random.shuffle(iris_np)

독립변수 / 종속변수

iris_features = iris_np[:,:-1] # 독립변수
iris_labels = iris_np[:,-1] # 종속변수

종속변수 label Encoding

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
input_classes = ['versicolor','virginica']
le.fit(input_classes)
iris_labels = le.transform(iris_labels)
iris_labels


# array([1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
#        0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0,
#        1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
#        1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
#        1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0])

train / test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_features, iris_labels, test_size=0.3, random_state=303)

Logistic model(no penalty)

from sklearn.linear_model import LogisticRegression

# penalty='none': regularization 사용하지 않음
lr_base = LogisticRegression(penalty='none') 

lr_base.fit(X_train, y_train)
# LogisticRegression(penalty='none')

lr_base.coef_
# array([[-647.48802098, -582.10834216,  961.09948106,  766.1531375 ]])

부호해석이 중요
b1, b2의 부호는 음수이므로 Sepal의 Length, Width가 커질수록 y=1(virginica)일 확률 감소하고,
b3, b4의 부호는 양수이므로 Petal의 Length, Width가 커질수록 y=1(virginica)일 확률 증가

참고 : coef의 절대값이 큰 경우는 해당 독립변수가 종속변수의 결정에 가장 큰 영향을 미치는 것

test

lr_base.score(X_test, y_test)

y_predictions = lr_base.predict(X_test)

y_predictions
# array([0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0,
#        0, 0, 0, 0, 1, 0, 1, 0])

accuracy

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predictions)
# 0.9666666666666667

Logistic model(penalty : L1)

# C: 1/penalty strength(람다)
# solver: 경사하강법의 방법 (saga: Stochastic Gradient Descent)

lr_p1 = LogisticRegression(C=0.3, penalty='l1', solver='saga', max_iter=1000)

C = Inverse of regularization strength, 즉 C 값이 작을수록 penalty를 많이 준다는 것입니다.
penalty를 많이 준다는 뜻은 L1(Lasso) 같은 경우는 feature의 수를 그만큼 많이 줄인다는 뜻이고
L2(Ridge)인 경우는 weight 값을 더 0에 가깝게 한다는 뜻입니다.
multi_class, 'ovr' (= one vs. rest)는 하나의 data point가 특정 class일 확률과 그렇지 않을 확률을 계산
solver => 어떠한 방법을 사용해서 optimal한 parameter의 방법을 찾는지에 대한것

saga 는 Stochastic Gradient Descent

model fit

lr_p1.fit(X_train, y_train)

y_predictions = lr_p1.predict(X_test)

lr_p1.coef_
# array([[0.        , 0.        , 2.45410396, 0.18317547]])
# b1, b2가 제거됨

lr_p1.score(X_test, y_test)
# 0.9333333333333333

Logistic model(penalty : L2)

r_p2 = LogisticRegression(C=0.5, penalty='l2', solver='sag', max_iter=1000) 

lr_p2.fit(X_train, y_train)

lr_p2.score(X_test, y_test)

lr_p2.coef_
# array([[-0.02618001, -0.31438273,  2.02752722,  1.41780166]])

728x90

저작자표시

'데이터분석' 카테고리의 다른 글

Linear Regression Gradient Descent (0)	2020.12.30
RMSE, Grid Search python 구현 (0)	2020.11.12
seaborn 시각화 python (0)	2020.11.12
데이터 전처리 python (0)	2020.11.12
python 복사 단순 객체복사 vs shallow copy vs deep copy (0)	2020.10.19

추린이 추천시스템 공부

Logistic Regression

'데이터분석' 카테고리의 다른 글

댓글

티스토리툴바

Logistic Regression

'데이터분석' 카테고리의 다른 글

관련글

댓글

티스토리툴바