Feature Extract : Speech [음악 장르 분류] (4)

1. 음악 데이터 재생

# Ipython 라이브러리를 이용해 학습에 사용될 음성을 직접 들어볼 수 있습니다.
from IPython.display import Audio, display

filename = '/kaggle/input/2024-ml-project4/train/train_000.wav'
y, sr = librosa.load(filename)
audio_wdt = Audio(data=y,rate=sr)
display(audio_wdt)

2. 학습 및 제출에 사용할 csv 파일 불러오기

# 학습할 음악들의 장르 정답 csv 파일 불러오기
train_info_csv = pd.read_csv('train_labels.csv')

# 제출에 사용할 csv 파일 불러오기
submit = pd.read_csv('submit.csv')
test_info_csv = submit['id']

# 어떤 장르들을 분류해야 하는지 확인해보기
train_info_csv['genre'].unique()

3. 데이터 불러오기 및 Handcrafted-Feature 추출

Empty Module #1: extract_rhythm_features()

첫 번째 Empty Module에서는 rhythm 도메인의 두 가지 feature인 autocorrelation tempogram과 fourier tempogram을 추출합니다.
단, 첫 번째 Empty Module에서는 librosa.load() 함수를 통해 얻은 time-series (y, sr)를 입력으로 주어 두 feature를 얻는 것을 목표로 합니다.

(1-1) librosa.load(): extract_rhythm_features() 함수의 인자로 넘겨 받은 file_path에 대하여 1초 당 22050개의 샘플을 추출한 time-series를 load합니다.
(1-2) librosa.onset.onset_strength(): 앞서 load한 time-series를 함수의 입력으로 주어 onset_envelope를 추출합니다.
(1-3) librosa.feature.tempogram(): 앞서 추출한 onset_envelope와 sr을 함수의 입력으로 주어 autocorrelation tempogram feature를 추출합니다.
(1-4) autocorrelation tempogram에 대해 시간 축으로 평균을 내고 절대값을 취해 복소수를 제거함으로써 tempogram_feature를 얻습니다.

각 단계별 세부 내용은 아래 함수를 참고하시기 바랍니다.

# Empty Module #1
def extract_rhythm_features(file_path):
    # 반환할 feature list 선언
    feature = []
    
    # (1-1) librosa.load(): extract_rhythm_features() 함수의 인자로 넘겨 받은 file_path에 대하여 1초 당 22050개의 샘플을 추출한 time-series를 load합니다.
    y, sr = librosa.load(file_path, sr=22050)
    
    # (1-2) librosa.onset.onset_strength(): 앞서 load한 time-series를 함수의 입력으로 주어 onset_envelope를 추출합니다.
    onset_envelope = librosa.onset.onset_strength(y=y, sr=sr)
    
    # autocorrelation tempogram
    # (1-3) librosa.feature.tempogram(): 앞서 추출한 onset_envelope와 sr을 함수의 입력으로 주어 autocorrelation tempogram feature를 추출합니다. 함수의 입력 변수를 잘 확인하세요.
    tempogram = librosa.feature.tempogram(onset_envelope=onset_envelope, sr=sr)  # (384, 1293)
    
    # (1-4) autocorrelation tempogram에 대해 시간 축으로 평균을 내고 절대값을 취해 복소수를 제거함으로써 tempogram_feature를 얻습니다.
    tempogram_feature = np.abs(np.mean(tempogram, axis=1))  # (384, 1293) -> (384, ), 복소수를 없애기 위해 절대값 처리
    
    feature = tempogram_feature
    
    return feature

Empty Module #2: extract_spectral_features()

두 번째 Empty Module에서는 spectral 도메인의 두 가지 feature인 MFCC와 chromagram을 추출합니다.
단, 두 번째 Empty Module에서는 librosa 라이브러리의 함수를 통해 spectrogram, power_spectrogram, melspectrogram, melspectrogram_db를 순차적으로 얻고,
이를 입력으로 주어 두 feature를 얻는 것을 목표로 합니다.

(2-1) librosa.load(): extract_spectral_features() 함수의 인자로 넘겨 받은 file_path에 대하여 1초 당 22050개의 샘플을 추출한 time-series를 load합니다.
(2-2) librosa.stft(): 앞서 load한 time-series를 함수의 입력으로 주어 stft를 추출하고, 절대값을 취해 spectrogram을 추출합니다.
(2-3) 앞서 얻은 spectrogram에 제곱 연산을 취해 power_spectrogram을 얻습니다.
(2-4) librosa.feature.melspectrogram(): 앞서 얻은 power_spectrogram을 함수의 입력으로 주어 melspectrogram을 추출합니다.
(2-5) librosa.power_to_db(): 앞서 얻은 melspectrogram을 함수의 입력으로 주어 db scale로 변환된 melspectrogram_db를 추출합니다.
(2-6) librosa.feature.chroma_stft(): (2-3)에서 얻은 power_spectrogram을 함수의 입력으로 주어 chromagram을 추출합니다.
(2-7) chromagram에 대해 시간 축으로 평균을 내어 feature로 사용할 수 있도록 변환합니다.
(2-8) librosa.feature.mfcc(): (2-5)에서 얻은 melspectrogram_db를 함수의 입력으로 주어 mfcc를 추출합니다.
(2-9) mfcc에 대해 시간 축으로 평균을 내어 feature로 사용할 수 있도록 변환합니다.
(2-10) chromagram과 mfcc를 하나의 feature로 반환할 수 있도록 concatenate를 수행합니다.

각 단계별 세부 내용은 아래 함수를 참고하시기 바랍니다.

# Empty Module #2
def extract_spectral_features(file_path):
    # 반환할 feature list 선언
    feature = []
    
    # (2-1) librosa.load(): extract_spectral_features() 함수의 인자로 넘겨 받은 file_path에 대하여 1초 당 22050개의 샘플을 추출한 time-series를 load합니다.  
    y, sr = librosa.load(file_path, sr=22050)

    # (2-2) librosa.stft(): 앞서 load한 time-series를 함수의 입력으로 주어 stft를 추출하고, 절대값을 취해 복소수를 제거한 spectrogram을 추출합니다.  
    # * 이 때, 93ms의 물리적 간격으로 나뉘어 spectrogram이 생성될 수 있도록 하이퍼파라미터 n_fft를 조절합니다. 어떠한 값을 넣어야 하는지는 공식 문서를 참고하세요.
    n_fft = int(0.093 * sr)
    spectrogram = np.abs(librosa.stft(y, n_fft=n_fft))
    
    # (2-3) 앞서 얻은 spectrogram에 제곱 연산을 취해 power_spectrogram을 얻습니다.  
    power_spectrogram = spectrogram**2
    
    # (2-4) librosa.feature.melspectrogram(): 앞서 얻은 power_spectrogram을 함수의 입력으로 주어 melspectrogram을 추출합니다.  
    melspectrogram = librosa.feature.melspectrogram(S=power_spectrogram, sr=sr)
    
    # (2-5) librosa.power_to_db(): 앞서 얻은 melspectrogram을 함수의 입력으로 주어 db scale로 변환된 melspectrogram_db를 추출합니다.  
    melspectrogram_db = librosa.power_to_db(melspectrogram)
    
    # chromagram
    # (2-6) librosa.feature.chroma_stft(): (2-3)에서 얻은 power_spectrogram을 함수의 입력으로 주어 chromagram을 추출합니다. 함수의 입력 변수를 잘 확인하세요.
    chromagram = librosa.feature.chroma_stft(S=power_spectrogram, sr=sr)  # (12, 1293)
    
    # (2-7) chromagram에 대해 시간 축으로 평균을 내어 chromagram_feature를 얻습니다. 
    chromagram_feature = np.mean(chromagram, axis=1)  # (12, )
    
    # mfcc
    # (2-8) librosa.feature.mfcc(): (2-5)에서 얻은 melspectrogram_db를 함수의 입력으로 주어 mfcc를 추출합니다. 함수의 입력 변수를 잘 확인하세요.
    mfcc = librosa.feature.mfcc(S=melspectrogram_db, sr=sr, n_mfcc=20)  # (20, 1293)
    
    # (2-9) mfcc에 대해 시간 축으로 평균을 내어 mfcc_feature를 얻습니다.  
    mfcc_feature = np.mean(mfcc, axis=1)  # (20, )
    
    # (2-10) chromagram과 mfcc를 하나의 feature로 반환할 수 있도록 concatenate를 수행합니다.
    feature = np.concatenate((chromagram_feature, mfcc_feature), axis=0)

    return feature

def feature_loader(data_info, split=None, rootpath=None, domain=None):
    split = split.upper()
    info_dict = {}
    
    if split=='TRAIN':
        train_path = os.path.join(rootpath, 'train')
        file_list = data_info['id']
        label_list = data_info['genre']
        
        for file, label in zip(tqdm(file_list), label_list):
            # 손상된 wav파일 제외
            if file == 'train_412.wav':
                continue   
                
            file_dict = {}     
            file_dict['label'] = label
            
            file_path = os.path.join(train_path, file)
            if domain == 'spectral':
                features = extract_spectral_features(file_path)
            elif domain == 'rhythm':
                features = extract_rhythm_features(file_path)
            else:
                raise Exception("Check domain")
                
            file_dict['features'] = features
            info_dict[file] = file_dict
            
        return info_dict
        
    elif split=='TEST':
        test_path = os.path.join(rootpath, 'test')
        file_list = data_info
        
        for file in tqdm(file_list):
            file_dict = {}
            file_path = os.path.join(test_path, file)
            
            if domain == 'spectral':
                features = extract_spectral_features(file_path)
            elif domain == 'rhythm':
                features = extract_rhythm_features(file_path)
            else:
                raise Exception("Check domain")
                
            file_dict['features'] = features
            info_dict[file] = file_dict
            
        return info_dict
    
    else:
        raise Exception("Check split")

rootpath = 'music_genre_classification'

# 'spectral' or 'rhythm'
domain = 'spectral'

# 선택한 domain의 feature 추출, dictionary 반환 받기
train_data = feature_loader(train_info_csv, split='train', rootpath=rootpath, domain=domain)
test_data = feature_loader(test_info_csv, split='test', rootpath=rootpath, domain=domain)

4. 학습, 테스트 데이터 구축 및 분류기 설계

# dictionary에 담겨 있던 feature와 라벨들을 학습에 사용하기 위해 list에 담아줍니다.

x_train = []
y_train = []

for key in train_data.keys():
    x_train.append(train_data[key]['features'])
    y_train.append(train_data[key]['label'])
    
x_train = np.asarray(x_train) # (799, feature_size)
y_train = np.asarray(y_train) # (799, )

# 음악 장르는 string으로 되어있으므로 학습을 위해 LabelEncoding을 진행합니다.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_train = le.fit_transform(y_train)

# dictionary에 담겨 있던 feature들을 테스트에 사용하기 위해 list에 담아줍니다.

x_test = []

for key in test_data.keys():
    x_test.append(test_data[key]['features'])
    
x_test = np.asarray(x_test) # (200, feature_size)

Empty Module #3: RandomForestClassifier(random_state=seed)

세 번째 Empty Module에서는 앞서 얻은 feature를 활용할 분류기를 설계합니다.
베이스라인의 분류기는 RandomForestClassifier(random_state=seed)로, 분류기에 대한 별다른 하이퍼파라미터 튜닝을 진행하지 않은 성능입니다.

분류기를 선언하고 앞서 얻은 feature와 라벨로 fit, predict하는 과정을 구현하시면 됩니다.

# Empty Module #3
from sklearn.ensemble import RandomForestClassifier

# 분류기 선언
rfc = RandomForestClassifier(random_state=seed)

# 분류기 fit
rfc.fit(x_train, y_train)

# 예측
pred_rfc = rfc.predict(x_test)

submit['genre'] = le.inverse_transform(pred_rfc)
# submit.to_csv(f'{domain}_feature_baseline.csv', index=False)
submit.to_csv(f'combine_feature_baseline.csv', index=False)

4. train data에 대한 성능 확인 (Optional)

임의의 validation set을 만들어 성능을 측정해보는 것도 좋은 방법입니다.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

train_pred = rfc.predict(x_train)

print(f"Accuracy score: {accuracy_score(y_train, train_pred)*100}")
print(confusion_matrix(y_train, train_pred))

5. 성능을 올릴 수 있는 방법

0) 베이스라인에서 사용하는 feature가 음성 데이터의 무엇을 추출하는 것인지 확인

1) librosa 라이브러리에서 제공하는 또 다른 feature 활용

2) 베이스라인에서 사용했던 spectral feature와 rhythm feature의 조합

3) 최종 feature를 만들 때 베이스라인 코드처럼 단순 concat이 아닌 다른 방법

** 이외에도 음성 데이터 추출 시 하이퍼파라미터 튜닝, 데이터 전처리, 데이터 증강 기법, 분류기 최적화 등 다양한 방법이 존재하지만 텀프로젝트의 목적 상 1)을 위주로 진행하시기 바랍니다.

'Computer Science > Machine Learning' 카테고리의 다른 글

Feature Extract : CV [2D 이미지 데이터를 활용한 이미지 분류] (6) (1)	2024.06.10
Feature Extract : Speech [영어 음성 국제 분류] (5) (1)	2024.06.10
Feature Extract : Speech [음악 장르 분류] (3) (1)	2024.06.10
Feature Extract : NLP [한국어 텍스트 데이터를 활용한 영화 리뷰 분류] (2) (1)	2024.06.10
Feature Extract : NLP [한국어 텍스트 데이터를 활용한 영화 리뷰 분류] (1) (2)	2024.06.10