Dataset 불러오기

Pytorch에서 제공하는 Dataset, Dataloader를 이용하여 데이터셋을 불러오고 모델에 적용시킬 수 있도록 하는 것이 이번 포스팅의 목적이다.

데이터 로드를 위한 Class 만들기

custom dataset을 만들기 위해서는 class 형태로 만드는 것이 규칙처럼 되어있다. 이 class를 구성하는 함수에는 크게 3가지가 있다.

•

__init__(self)

◦

data를 초기화하고 저장된 위치에서 불러오는 함수이다.

•

__getitem__(self, index)

◦

특정 데이터에 접근하기 위해 사용하는 함수이다.

•

__len__(self)

◦

dataset의 size를 알려주는 함수이다.

실제 코드를 보면서 확인해보자

import torch
import torchvision
from torch.utils.data import Dataset, DataLoader
import numpy as np
import math

class WineDataset(Dataset):

    # 1) data 초기화 및 불러오기
    def __init__(self):
				# delimiter = 구분요소 / skiprows = dataset이 아닌부분 무시
        xy = np.loadtxt('./path/to/wine.csv', delimiter=',', dtype=np.float32, skiprows=1)
        self.n_samples = xy.shape[0]

				# class 와 features label을 구별하고 torch tensor 형태로 만들자
        self.x_data = torch.from_numpy(xy[:, 1:]) # size [n_samples, n_features]
        self.y_data = torch.from_numpy(xy[:, [0]]) # size [n_samples, 1]

		# 2) 특정 데이터에 접근하는 함수
    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    # 3) dataset의 길이 출력하는 함수
    def __len__(self):
        return self.n_samples

dataset = WineDataset()

# Load whole dataset with DataLoader
# shuffle: shuffle data, good for training
# num_workers: faster loading with multiple subprocesses
train_loader = DataLoader(dataset=dataset,
                          batch_size=4,
                          shuffle=True,
                          num_workers=2)

# Training Loop 돌리기
num_epochs = 2
total_samples = len(dataset)
n_iterations = math.ceil(total_samples/4) #올림 연산 -> 자연수로 만듬

print(total_samples, n_iterations) #전체 data 개수 및 iteration 나타냄

for epoch in range(num_epochs):

# enumerate() 함수는 인자로 넘어온 목록을 기준으로 인덱스와 원소를 
# 차례대로 접근하게 해주는 반복자(iterator) 객체를 반환해주는 함수입니다.
    for i, (inputs, labels) in enumerate(train_loader):
        
        # here: 178 samples, batch_size = 4, n_iters=178/4=44.5 -> 45 iterations
        if (i+1) % 5 == 0:
            print(f'Epoch: {epoch+1}/{num_epochs}, Step {i+1}/{n_iterations}| Inputs {inputs.shape} | Labels {labels.shape}')
Python
복사

Dataset And Dataloader - PyTorch Beginner 09 | Python Engineer

Learn all the basics you need to get started with this deep learning framework! In this part we see how we can use the built-in Dataset and DataLoader classes and improve our pipeline with batch training. See how we can write our own Dataset class and use available built-in datasets.

https://www.python-engineer.com/courses/pytorchbeginner/09-dataset-and-dataloader/