Big Picture¶

Dataset: $D = { (x^i, y^i) }^{n} _ {i=1}, x^i \in \mathbb{R}^{M}, y^{(i)} \in \mathbb{R}$
Goal: Predict real-valued outputs
Objective function: $$ MSE = \frac{1}{N} \sum_{i=1}^{N} e_i = \frac{1}{N} \sum_{i=1}^{N} (y^{(i)}-(w^T\cdot x^{(i)}+b^{(i)}))^2 $$
Notation:
- $x' = [1, x_1, ..., x_M]^T$
- $\theta = [b, w_1, ..., w_M]^T$
- $h_{\theta} (x') = \theta ^{T} x'$
Key Idea: Given objective function $J(\theta): \mathbb{R}^M \rightarrow \mathbb{R}$, find $\hat{\theta} = \arg\min_{\theta \in \mathbb{R}} J(\theta)$

Examples¶

KNN Regression¶

Algorithm 1: k=1 Nearest Neighbor Regression - Train: store all (x, y) pairs - Predict: pick the nearest x in training data and return its y

Algorithm 2: k = 2 NN Weighted Regression - Train: store all (x, y) pairs - Predict: pick the nearest two instances $x^{(n1)}$ and $x^{(n2)}$ in training data and return the weighted average of their y values

Decision Tree Regression¶

Decision tree for classification / regression
During learning, choose the attribute that minimizes an appropriate splitting criterion (e.g. mean squared error, mean absolute error)

Optimization¶

Unconstrained Optimization¶

Def: try minimize (or maximize) a function with no constraints on the inputs to the function
Given objective function $J(\theta): \mathbb{R}^M \rightarrow \mathbb{R}$, find $\hat{\theta} = \arg\min_{\theta \in \mathbb{R}} J(\theta)$
Linear Regression as Function Approximation: ![[Screen Shot 2024-02-19 at 05.00.05.png]]

Method 1: Gradient Descent¶

Algorithm:
- 1. Choose initial point $\vec{\theta _0}$
- 1. Repeat $t=1, 2, ...$
  2. a) Compute gradient $$\vec{g}=\nabla J(\vec{\theta } )=\begin{bmatrix} \frac{\mathrm{d} J(\vec{\theta } )}{\mathrm{d} \theta _1} \ \frac{\mathrm{d} J(\vec{\theta } )}{\mathrm{d} \theta _2}\ ...\ \frac{\mathrm{d} J(\vec{\theta } )}{\mathrm{d} \theta _M} \end{bmatrix}$$
  3. b) Step size $\gamma_t$
  4. c) Update $\vec{\theta} \leftarrow \vec{\theta} - \gamma_t \cdot \vec{g}$
- 1. Return $\vec{\theta}$ if stopping criteria reached
Remarks:
- Initial point
  - randomly
  - all zeros
- Step size
  - fixed value 0.1
  - set a schedule $\gamma_t=\frac{\gamma _0}{(t-1)\gamma_0+1}$
  - line search
- Stopping criteria
  - $\vec{g}\approx 0$
  - $||\nabla J(\vec{\theta } )||_2 < \epsilon$ for $\epsilon = 10^{-8}$

![[Screen Shot 2024-02-19 at 05.44.41.png]] ![[Screen Shot 2024-02-19 at 05.44.52.png]] ![[Screen Shot 2024-02-19 at 05.45.02.png]]

Convexity¶

A function $f : \mathbb{R}^D \rightarrow \mathbb{R}$ is convex if $\forall x^{(1)} \in \mathbb{R}^D, x^{(2)} \in \mathbb{R}^D$ and $0 \leq c \leq 1$, $(cx^{(1)} + (1 - c)x^{(2)}) \leq cf(x^{(1)}) + (1 - c)f(x^{(2)})$
Gradient descent is a local optimization algorithm – it will converge to a local minimum (if it converges) Works great if the objective function is convex! Not ideal if the objective function is non-convex.

Method 2: Closed Form¶

Idea: find the critical points of the objective function, specifically the ones where $\nabla J(\theta) = 0$ (the vector of all zeros), and check if any of them are local minima
Notation
- design matrix / target vector ![[Screen Shot 2024-03-26 at 10.19.56.png]]
Minimizing MSE
- ![[Screen Shot 2024-03-26 at 10.20.44.png]]
$\hat{\theta} = (X^TX)^{-1}X^Ty$

Questions 1. Is $X^TX$ invertible? 1. When N ≫ D + 1, $X^TX$ is (almost always) full rank and therefore, invertible! 2. If $X^TX$ is not invertible (occurs when one of the features is a linear combination of the others), then there are infinitely many solutions 2. If so, how computationally expensive is inverting $X^TX$? 1. $X^TX \in \mathbb{R}^{D+1\times D+1}$ so inverting $X^TX$ takes $O(D^3)$ time… 1. Computing $X^TX$ takes $O(ND^2)$ time 2. Can use gradient descent to (potentially) speed things up when N and D are large!