본문 바로가기
Computer Science/Machine Learning

데이터 전처리 [이직을 희망하는 직원 예측 문제] (2)

by BaekDaBang 2024. 6. 9.

1. Data Preprocessing

(1) Train과 Test 데이터의 자료형 확인

train.info()

test.info()

(2) index, enrolled_id, city

예측에 영향을 끼치지 않는 정보로 삭제

# index, enrollee_id, city 지우기
x_train = x_train.iloc[:,3:]
x_test = x_test.iloc[:,3:]

 

(3) gender, relevent_experience, enrolled_university, education_level, major_discipline, company_type, training_hours

자료형이 object로, label encoder를 이용하여 labeling 진행

# gender
le = LabelEncoder()
x_train['gender'] = le.fit_transform(x_train['gender'])
x_test['gender'] = le.transform(x_test['gender'])

# relevent_experience
le = LabelEncoder()
x_train['relevent_experience'] = le.fit_transform(x_train['relevent_experience'])
x_test['relevent_experience'] = le.transform(x_test['relevent_experience'])

# enrolled_university
le = LabelEncoder()
x_train['enrolled_university'] = le.fit_transform(x_train['enrolled_university'])
x_test['enrolled_university'] = le.transform(x_test['enrolled_university'])

# education_level
le = LabelEncoder()
x_train['education_level'] = le.fit_transform(x_train['education_level'])
x_test['education_level'] = le.transform(x_test['education_level'])

# major_discipline
le = LabelEncoder()
x_train['major_discipline'] = le.fit_transform(x_train['major_discipline'])
x_test['major_discipline'] = le.transform(x_test['major_discipline'])

# company_type
le = LabelEncoder()
x_train['company_type'] = le.fit_transform(x_train['company_type'])
x_test['company_type'] = le.transform(x_test['company_type'])

# training_hours
le = LabelEncoder()
x_train['training_hours'] = le.fit_transform(x_train['training_hours'])
x_test['training_hours'] = le.transform(x_test['training_hours'])

 

(3) experience

x_train['experience'].unique()

# experience (train)
xx = x_train['experience'].copy()

xx[xx == '>20'] = 21
xx[xx == '<1'] = 0
xx[xx.isnull()] = -1

x_train['experience'] = xx.astype('int64')
# experience (test)
xx = x_test['experience'].copy()

xx[xx == '>20'] = 21
xx[xx == '<1'] = 0
xx[xx.isnull()] = -1

x_test['experience'] = xx.astype('int64')

 

(4) company_size

x_train['company_size'].unique()

# company_size (train)
xx = x_train['company_size'].copy()

xx[xx == '10000+'] = 1
xx[xx == '5000-9999'] = 2
xx[xx == '1000-4999'] = 3
xx[xx == '500-999'] = 4
xx[xx == '100-500'] = 5
xx[xx == '50-99'] = 6
xx[xx == '10/49'] = 7
xx[xx == '<10'] = 8
xx[xx.isnull()] = -1

x_train['company_size'] = xx.astype('int64')
# company_size (test)
xx = x_test['company_size'].copy()

xx[xx == '10000+'] = 1
xx[xx == '5000-9999'] = 2
xx[xx == '1000-4999'] = 3
xx[xx == '500-999'] = 4
xx[xx == '100-500'] = 5
xx[xx == '50-99'] = 6
xx[xx == '10/49'] = 7
xx[xx == '<10'] = 8
xx[xx.isnull()] = -1

x_test['company_size'] = xx.astype('int64')

 

(5) last_new_job

x_train['last_new_job'].unique()

# last_new_job (train)
xx = x_train['last_new_job'].copy()

xx[xx == '>4'] = 5
xx[xx == '4'] = 4
xx[xx == '3'] = 3
xx[xx == '2'] = 2
xx[xx == '1'] = 1
xx[xx == 'never'] = 0
xx[xx.isnull()] = -1

x_train['last_new_job'] = xx.astype('int64')
# last_new_job (test)
xx = x_test['last_new_job'].copy()

xx[xx == '>4'] = 5
xx[xx == '4'] = 4
xx[xx == '3'] = 3
xx[xx == '2'] = 2
xx[xx == '1'] = 1
xx[xx == 'never'] = 0
xx[xx.isnull()] = -1

x_test['last_new_job'] = xx.astype('int64')

 

 

(7) Train과 Test 데이터의 자료형 확인

x_train.info()

x_test.info()

 

2. Train & Test

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

clf = QuadraticDiscriminantAnalysis()
clf.fit(x_train, y_train)
y_test = clf.predict(x_test).astype('int64')