Understanding Dataset Shift (미완성)

Dataset Shift란? Training dataset과 test dataset이 다른 분포를 가지는 것을 말한다.

Definition 1. Dataset shift appears when training and test joint distributions are different. That is $\mathrm{P}_{\text {tra }}(\mathrm{y}, \mathrm{x}) \neq \mathrm{P}_{\text {tst }}(\mathrm{y}, \mathrm{x})$

•

Kaggle Challenges 에도 자주 등장하는 문제상황 중 하나이다.

•

dataset shift 이외에도 concept shift, changing of environments .. 등등으로 불리운다.

•

dataset shift의 문제는 아래와 같은 이유들 떄문에 발생한다.
(1) input features가 이용되는 방법에 따라
(2) Training datasets과 test datasets를 뽑는 방법에 따라
(3) 데이터 희소성
(4) 비정상적인 환경으로 인한 데이터 분포 변화
(5) network상의 activation pattern의 변화 

왜 dataset shift 문제는 중요할까요?

dataset shift를 의심할수 있는 이상징후들

1. Covariate Shift

2. Prior Probability Shift

covariate shift가 feature 분포 xxx의 변화에 집중했다면, (입력의 분포가 어떻게 변함?)
Prior Probability Shift는 class 변수 yyy의 변화에 집중한다. (출력의 분포가 어떻게 변함?)

Definition 3. Prior probability shift appears only in $\mathrm{Y} \rightarrow \mathrm{X}$ problems, and is defined as the case where $\mathrm{P}_{\mathrm{tra}}(\mathrm{x} \mid \mathrm{y})=\mathrm{P}_{\mathrm{tst}}(\mathrm{x} \mid \mathrm{y}) \text { and } P_{\mathrm{tra}}(y) \neq P_{\mathrm{tst}}(y)$

3. Concept Drift

Definition 4. Concept shift is defined as

\bullet P_{\text {tra }}(y \mid x) \neq P_{\text {tst }}(y \mid x) \text { and } P_{\text {tra }}(x)=P_{\text {tst }}(x) \text { in } X \rightarrow Y \text { problems. } \\ \bullet P_{\text {tra }}(x \mid y) \neq P_{\text {tst }}(x \mid y) \text { and } P_{\text {tra }}(y)=P_{\text {tst }}(y) \text { in } Y \rightarrow X \text { problems. }

Dataset Shift의 원인

1. Sample selection bias

데이터를 다루거나 알고리즘을 조정할 때 생기는 편향이 아니고 데이터 수집이나 라벨링 과정에서 생기는 결함을 의미합니다. 모집단에서 일정한 형태가 아닌 방법으로 뽑기 때문에 필연적으로 학습 도중 편향이 생길 수 밖에 없습니다.

Definition 5. Sample selection bias, in general, causes the data in the training set to follow $P_{\text {tra }}=P(s=1 \mid x, y)$ , while the data in the test set follows $P_{\text {tst }}=P(y, x)$ . where s is a binary selection variable that decides whether a datum is included in the training sample process( $s$ = 1) or rejected from it ( $s$ = 0)

\mathrm{P}_{\text {tra }}=\mathrm{P}(\mathrm{s}=1 \mid \mathrm{y}, \mathrm{x}) \mathrm{P}(\mathrm{y} \mid \mathrm{x}) \mathrm{P}(\mathrm{x}) \text { and } \mathrm{P}_{\mathrm{tst}}=\mathrm{P}(\mathrm{y} \mid \mathrm{x}) \mathrm{P}(\mathrm{x}) \text { in } \mathrm{X} \rightarrow \mathrm{Y} \text { problems. }\\ \mathrm{P}_{\text {tra }}=\mathrm{P}(\mathrm{s}=1 \mid \mathrm{y}, \mathrm{x}) \mathrm{P}(\mathrm{x} \mid \mathrm{y}) \mathrm{P}(\mathrm{y}) \text { and } \mathrm{P}_{\mathrm{tst}}=\mathrm{P}(\mathrm{x} \mid \mathrm{y}) \mathrm{P}(\mathrm{y}) \text { in } \mathrm{Y} \rightarrow \mathrm{X} \text { problems. }

이러한 현상은 특히 개체 수가 불균형한 분류 과정일 때 큰 영향을 준다. 소수의 class들은 수가 적기 때문에 하나의 분류 오류에도 민감하게 반응하게 된다. 이러한 경우 퍼포먼스에 큰 영향을 주게 된다.

2. Non-Stationary Environments

실제 상황에서, data 자체가 정적이지 않은 경우가 많다. 스팸 필터링과 같은 adversarial classification 문제에서 야기됩니다.

.가장 관련성이 높은 비상정 시나리오 중 하나는 스팸 필터링 및 네트워크 침입 탐지와 같은 적대적 분류 문제를 포함한다.이러한 유형의 문제는 머신러닝 분야에서 점점 더 많은 관심을 받고 있으며, 기존 분류기의 학습된 개념을 중심으로 작업하려는 적이 존재하기 때문에 대개 정지하지 않은 환경에 대처한다. 기계 학습 과제 측면에서, 이 상대는 테스트 세트가 훈련 세트와 달라지도록 뒤틀려서 가능한 모든 종류의 데이터 세트 이동을 도입한다.