슬램과 함께 춤을

Linear Algebra

2. Least Squares Problem

슬램과 함께 춤을

Linear Algebra

2. Least Squares Problem

👞

2. Least Squares Problem

태그

Least Squares

Gram-Schmidt

QR-decomposition

Inner Product

벡터

\mathbf{u}, \mathbf{v} \in \mathbb{R}^{n}

가 주어졌을 때 각 벡터는

n \times 1

행렬로 생각할 수 있고,

\mathbf{u}^{T}

는

1 \times n

행렬로 생각할 수 있다.

\mathbf{u}^{T}\mathbf{v}

를 inner product 혹은 dot product 로 정의하며

\mathbf{u} \cdot \mathbf{v}

로 표기한다.

inner product는 아래와 같은 성질을 만족한다. (

\mathbf{u}, \mathbf{v}

는 벡터,

c

는 스칼라)

\begin{array}{l}\mathbf{u} \cdot \mathbf{v}=\mathbf{v} \cdot \mathbf{u} \\(\mathbf{u}+\mathbf{v}) \cdot \mathbf{w}=\mathbf{u} \cdot \mathbf{w}+\mathbf{v} \cdot \mathbf{w} \\(c \mathbf{u}) \cdot \mathbf{v}=c(\mathbf{u} \cdot \mathbf{v})=\mathbf{u} \cdot(c \mathbf{v}) \\\mathbf{u} \cdot \mathbf{u} \geq \mathbf{0}, \text { and } \mathbf{u} \cdot \mathbf{u}=\mathbf{0} \text { if and only if } \mathbf{u}=\mathbf{0}\end{array}

또한 inner product는 각 벡터의 norm과 angle을 이용해 표현할 수도 있다.

\mathbf{u} \cdot \mathbf{v}=\|\mathbf{u}\|\|\mathbf{v}\| \cos \theta

이 때, 만약

\mathbf{u} \cdot \mathbf{v}=0

을 만족한다면 서로 orthogonal 하다고 한다.

\theta = 90\degree

를 만족하기 때문에 내적값이 0이 된 것이며 벡터

\mathbf{u}, \mathbf{v}

는 서로 수직이다.

Vector Norm

\mathbf{v} \in \mathbb{R}^{n}

에 대하여, length (혹은 norm)은

\mathbf{v} \cdot \mathbf{v}

의 제곱근으로 정의되며, 양수의 스칼라값으로 정의된다.

\|\mathbf{v}\|=\sqrt{\mathbf{v} \cdot \mathbf{v}}=\sqrt{v_{1}^{2}+v_{2}^{2}+\cdots+v_{n}^{2}} \text { and }\|\mathbf{v}\|^{2}=\mathbf{v} \cdot \mathbf{v}

이러한 norm은 기하학적으로 원점으로부터

\mathbf{v}

까지를 잇는 선의 길이를 의미한다.

또한 임의의 scalar 값

c

에 대해,

c\mathbf{v}

의 length는

\mathbf{v}

의 길이에

|c|

만큼 곱한 값이다.

\|c \mathbf{v}\|=|c|\|\mathbf{v}\|

참고로, length가 1인 벡터를 unit vector 라고 한다.

vector를 normalize 한다는 뜻은, 벡터

\mathbf{v}

가 주어졌을 때, 그것의 length로 나누어주는 것을말한다.

\mathbf{u}=\frac{1}{\|\mathbf{v}\|} \mathbf{v}

\mathbf{u}

는

\mathbf{v}

와 같은 방향을 가리키며, 길이만 다르게 된다. (

\mathbf{v}

의 length는 1)

Distance between Vectors in $\mathbb{R}^{n}$

벡터

\mathbf{u}, \mathbf{v} \in \mathbb{R}^{n}

에 대해 distance between

\mathbf{u}\,\,\text{and}\,\, \mathbf{v}

는 아래와 같이 정의됩니다.

\operatorname{dist}(\mathbf{u}, \mathbf{v})=\|\mathbf{u}-\mathbf{v}\|

Least Squares 문제

변수의 개수(

n

) 보다 방정식의 개수(

m

)가 더 많은 경우, Over-determined Linear System이다. 일반적으로 이러한 경우는 해가 없다. 하지만 우리는 근사적이라도 값을 구하고 싶다!

Overdetermined system

A \mathbf{x} \simeq \mathbf{b}

에 대하여, (

A \in \mathbb{R}^{m \times n}, \mathbf{b} \in \mathbb{R}^{m}, \text { and } m \gg n

) 이러한 시스템의 해

\hat{\mathbf{x}}

는 아래 와 같이 정의 된다.

\hat{\mathbf{x}}=\arg \min _{\mathbf{x}}\|\mathbf{b}-A \mathbf{x}\|

어떤

\mathbf{x}

를 선택하더라도, 벡터

A \mathbf{x}

는

\text {Col } A

의 영역 안에 존재하게 될 것이다. 따라서, 우린 벡터

\mathbf{b}

와 가장 가깝게 위치하는

\text {Col } A

위의 한 점을 찾는것, 그 때의

\hat{\mathbf{x}}

를 찾는것이 목표이다.

위의 설명을 그림으로 표현하면 아래와 같다.

이 때, 점

\hat{\mathbf{b}}

은

\text {Col } A

에 존재하는 점 중 점

\mathbf{b}

와 가장 가까운 점이다. 이러한 조건 하에서 벡터

\mathbf{b}-A \widehat{\mathbf{x}}

는

\text {Col } A

와 반드시 orthogonal 하게 된다. 여기서 벡터

\mathbf{a}_n

은

\text {Col } A

를 구성하는 basis 벡터를 말한다.

(\mathbf{b}-A \hat{\mathbf{x}}) \perp\left(x_{1} \mathbf{a}_{1}+x_{2} \mathbf{a}_{2} \cdots+x_{p} \mathbf{a}_{n}\right) \text { for any vector } \mathbf{x}

이와 동일하게 서술하면,

\begin{array}{c}(\mathbf{b}-A \widehat{\mathbf{x}}) \perp \mathbf{a}_{1} \\(\mathbf{b}-A \hat{\mathbf{x}}) \perp \mathbf{a}_{2} \\\vdots \\(\mathbf{b}-A \widehat{\mathbf{x}}) \perp \mathbf{a}_{m}\end{array}

\begin{array}{c}\mathbf{a}_{1}^{\top}(\mathbf{b}-A \hat{\mathbf{x}})=0 \\\mathbf{a}_{2}^{\top}(\mathbf{b}-A \hat{\mathbf{x}})=0 \\\vdots \\\mathbf{a}_{m}^{\top}(\mathbf{b}-A \hat{\mathbf{x}})=0\end{array}

따라서 least squres problem을 푸는 식이 완성된다.

A^{\top}(\mathbf{b}-A \hat{\mathbf{x}})=0

Normal Equation

앞서 제시한 Least Squares Problem을 푸는 방법 중 하나가 바로 normal equation입니다. 주어진 least square problem

A \mathbf{x} \simeq \mathbf{b}

의 식을 풀기 위해 정답에 가장 가까운 해를 구할 것 입니다.

A^{\top} A \hat{\mathbf{x}}=A^{\top} \mathbf{b}

해를 구하기 위해 사용하는 식이 위에 제시한 normal equation 입니다.

이러한 형태는 새로운 Linear System

C \mathbf{x}=\mathbf{d}

의 꼴로 표현할 수 있습니다.

C=A^{\top} A \in \mathbb{R}^{n \times n} \quad\quad \mathbf{d}=A^{\top} \mathbf{b} \in \mathbb{R}^{n}

만약

C=A^{T} A

가 invertible 하다면, 그 해는 아래와 같이 구해질 수 있다.

\widehat{\mathbf{x}}=\left(A^{\top} A\right)^{-1} A^{\top} \mathbf{b}

따라서 앞서 주어진 그림을 살펴보았을 때, 점

\mathbf{b}

에서

\operatorname{Col} A

로의 orthogonal projection 과정을 아래와 같이 나타낼 수 있다.

\hat{\mathbf{b}}=f(\mathbf{b})=A \hat{\mathbf{x}}=A\left(A^{\top} A\right)^{-1} A^{\top} \mathbf{b}

만약 $C=A^{\top} A$ 가 invertible 하지 않다면?

•

CCC 가 invertible 하지 않기 위해서는 행렬 AAA 가 Linearly dependent 해야한다.
(즉, lineary dependent하면 invertible 하다)

•

이러한 경우 해는 항상 무수히 많은 경우가 된다! (→ 해가 없는 경우는 사실상 존재하지 않는다)

•

일반적으로, 현실 데이터셋을 다룰 때 invertible 하지 않은 경우는 없다! (거의 95% 이상)
데이터셋이 많아지면 많아질 수록 feature 간에 linearly dependent한 경우가 높은 확률로 사라지기 때문이다.

Orthogonal and Orthonormal Sets

벡터의 집합

\left\{\mathbf{u}_{1}, \ldots, \mathbf{u}_{p}\right\} \text { in } \mathbb{R}^{n}

에 대햐여, 각 벡터쌍들이

\mathbf{u}_{i} \cdot \mathbf{u}_{j}=0 \text { whenever } i \neq j

를 만족하면 orthogonal set이라고 정의한다.

벡터의 집합

\left\{\mathbf{u}_{1}, \ldots, \mathbf{u}_{p}\right\} \text { in } \mathbb{R}^{n}

에 대하여, orthogonal set이라고 가정했을 때, 각 벡터들이 unit vector이다면, orthonormal set이라고 정의한다.

벡터들이 Orthogonal하다면 linearly independent하다고 볼수 있지만, 벡터들이 linearly independent하다고 해서 Orthogonal 하다고 할수 없다.

Orthogonal Projection $\hat{\mathbf{y}}$ of $\mathbf{y}$

(1) line에 projection하기

one-dimensional subspace

L=\operatorname{Span}\{\mathbf{u}\}

에 projection을 한다면 아래와 같은 식으로 표현 가능하다

\hat{\mathbf{y}}=\operatorname{proj}_{L} \mathbf{y}=\frac{\mathbf{y} \cdot \mathbf{u}}{\mathbf{u} \cdot \mathbf{u}} \mathbf{u}

만약, 벡터

\mathbf{u}

가 unit vector라면, 아래와 같이 표현할 수도 있다.

\hat{\mathbf{y}}=\operatorname{proj}_{L} \mathbf{y}=(\mathbf{y} \cdot \mathbf{u}) \mathbf{u}

(2) Plane에 projection하기

two-dimensional subspace

W=\operatorname{Span}\{\mathbf{u_1,u_2}\}

에 projection 한다면 아래와 같이 표현한다.

\hat{\mathbf{y}}=\operatorname{proj}_{L} \mathbf{y}=\frac{\mathbf{y} \cdot \mathbf{u}_{1}}{\mathbf{u}_{1} \cdot \mathbf{u}_{1}} \mathbf{u}_{1}+\frac{\mathbf{y} \cdot \mathbf{u}_{2}}{\mathbf{u}_{2} \cdot \mathbf{u}_{2}} \mathbf{u}_{2}

만약, 벡터

\mathbf{u_1,u_2}

가 unit vector라면, 아래와 같이 표현할 수도 있다.

\hat{\mathbf{y}}=\operatorname{proj}_{L} \mathbf{y}=\left(\mathbf{y} \cdot \mathbf{u}_{1}\right) \mathbf{u}_{1}+\left(\mathbf{y} \cdot \mathbf{u}_{2}\right) \mathbf{u}_{2}

Transformation: Orthogonal Projection

위에 제시한 orthogonal projection 과정을 식으로 표현하면,

\begin{aligned}\hat{\mathbf{b}} &=f(\mathbf{b})=\left(\mathbf{b} \cdot \mathbf{u}_{1}\right) \mathbf{u}_{1}+\left(\mathbf{b} \cdot \mathbf{u}_{2}\right) \mathbf{u}_{2} \\&=\left(\mathbf{u}_{1}^{T} \mathbf{b}\right) \mathbf{u}_{1}+\left(\mathbf{u}_{2}^{T} \mathbf{b}\right) \mathbf{u}_{2} \\&=\mathbf{u}_{1}\left(\mathbf{u}_{1}^{T} \mathbf{b}\right)+\mathbf{u}_{2}\left(\mathbf{u}_{2}^{T} \mathbf{b}\right) \\&=\left(\mathbf{u}_{1} \mathbf{u}_{1}^{T}\right) \mathbf{b}+\left(\mathbf{u}_{2} \mathbf{u}_{2}^{T}\right) \mathbf{b} \\&=\left(\mathbf{u}_{1} \mathbf{u}_{1}^{T}+\mathbf{u}_{2} \mathbf{u}_{2}^{T}\right) \mathbf{b}\end{aligned} \\ \left[\begin{array}{ll} \mathbf{u}_{1} & \mathbf{u}_{2} \end{array}\right]\left[\begin{array}{l} \mathbf{u}_{1}^{T} \\ \mathbf{u}_{2}^{T} \end{array}\right] \mathbf{b}=U U^{T} \mathbf{b} \Rightarrow \text { linear transformation! }

Gram-Schmidt Orthogonalization

이 방법은

\mathbb{R}^{n}

상의 subspace에서 orthogonal 하거나 orthonormal한 basis를 만들기 위해 사용되는 단순한 알고리즘입니다.

W=\operatorname{Span}\left\{\mathbf{x}_{1}, \mathbf{x}_{2}\right\}

하는 상황을 가정하였을 때,

\mathbf{x}_{1}=\left[\begin{array}{l}3 \\6 \\0\end{array}\right] , \mathbf{x}_{2}=\left[\begin{array}{l}1 \\2 \\2\end{array}\right]

이다. 이러한 상황에서 subspace

W

에서의 orthogonal basis를 만들기 위해 사용한다.

\mathbf{v}_{1}=\mathbf{x}_{1}

이라고 가정한 뒤,

\mathbf{v}_{2}

를

\mathbf{x}_{1}

에 orthogonal한 벡터라고 생각한다면, 아래 식과 같이 표현할 수 있다.

\mathbf{v}_{2}=\mathbf{x}_{2}-\frac{\mathbf{x}_{2} \cdot \mathbf{x}_{1}}{\mathbf{x}_{1} \cdot \mathbf{x}_{1}} \mathbf{x}_{1}=\left[\begin{array}{l}1 \\2 \\2\end{array}\right]-\frac{15}{45}\left[\begin{array}{l}3 \\6 \\0\end{array}\right]=\left[\begin{array}{l}0 \\0 \\2\end{array}\right]

실제로

\mathbf{v}_{1}\cdot\mathbf{v}_{2}

를 행해보면 값이 0이 나온다는 것을 알 수있다.

QR Factorization

위에서 언급한 Gram-Schmidt 방법을 이용해 orthonormal vector를 구했는데, 이를 이용해 행렬을 분해하는 방법이 QR Factorization (= QR decomposition)이다. 수학적인 정의는 아래와 같다.

행렬 AAA 가 m×nm \times nm×n 의 크기를 갖는 행렬로 주어지고, 
각 column들이 linearly independent하다고 가정하면, 
A=QRA = QRA=QR 의 형태로 표현할 수 있다. QQQ 는 m×nm \times nm×n 크기의 행렬이고,
 각 column들은 Col⁡A\operatorname{Col} AColA 를 구성하는 orthonormal basis이다. 
RRR 은 n×nn \times nn×n 크기의 upper trinangular 형태를 가지는 행렬이며, 역행렬이 존재한다.

분해하는 과정을 짧게 나마 서술해보면.

주어진 행렬의 column vector들을 나눠놓은 후 각각 Gram-Schmidt 방법을 적용시킨다.

orthogonalize 된 벡터들을 normalize 하기 위해 각각의 norm으로 나눠준다 (크기를 1로 만듬= 정규화)

이 벡터들을 순서대로 모으면 행렬 QQQ 가 되고, 이 행렬은 orthonormal 하다.

orthonormal 행렬의 성질 QQ⊤=Q⊤Q=IQ Q^{\top}=Q^{\top} Q=IQQ⊤=Q⊤Q=I 을 이용해 R=Q−1A=Q⊤AR = Q^{-1} A=Q^{\top} AR=Q−1A=Q⊤A 를 유도할 수 있습니다. 이를 이용해 행렬 RRR 을 구하면 factorization이 끝난다.