10 Learning Theory

Learning Theory¶

Given a classifier with zero training error, what can we say about true error (aka. generalization error)? (Sample Complexity, Realizable Case)
Given a classifier with low training error, what can we say about true error (aka. generalization error)? (Sample Complexity, Agnostic Case)
Is there a theoretical justification for regularization to avoid overfitting? (Structural Risk Minimization)

Given: $x^{(i)} \sim p^(\cdot) \quad$ $\quad y^{(i)} = c^(x^{(i)})$
Learning algo chooses $h\in H$ with lowest training error $\hat{h}_{h \in H}=\arg \min \hat{R}(h)$
Goal: choose an $h$ with low generalization error $R(h)$
only concerned with binary classification

True error (expected risk) $$ R(h) = P_{x \sim p^(x)}(c^(x) \neq h(x)) $$
Train error (empirical risk) $$ \begin{align} \hat{R}(h) &= P_{x \sim S}(c^*(x) \neq h(x)) \

&= \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(c^*(x^{(i)}) \neq h(x^{(i)}))\

&= \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(y^{(i)} \neq h(x^{(i)})) \end{align} $$ - Intuition: Sample size large enough, train error就能接近true error

true function $y^{(i)} = c^*(x^{(i)}), \forall i$
expected risk minimizer $h^* = \underset{h \in \mathcal{H}}{\mathrm{argmin}} \ R(h)$
empirical risk minimizer $\hat{h} = \underset{h \in H}{\mathrm{argmin}} \ \hat{R}(h)$

PAC = Probably Approximately Correct
A PAC learner yields a hypothesis $h \in H$ which is
- approximately correct $R(h) \approx R(h^*)$
- with high probability $Pr(R(h) \approx R(h^*))\approx 1$
PAC Criterion: $\forall h \in H, Pr(|R(h)-\hat{R}(h)|<\epsilon)\geq 1-\delta$
Sample Complexity: min number of training examples, $N$, st. satisfies PAC criterion
Consistent Learner: $h \in H$ is consistent if $\hat{R}(h)=0$

![[Screen Shot 2024-03-20 at 05.09.21.png]]

Realizable: $c^* \in H$
Agnostic: $c^*$ is not necessarily in $H$
Finite H: : H = the set of all decision trees of depth D over binary feature vectors of length M
Infinite H: H = the set of all linear decision boundaries in M dimensions

Given training dataset S, $|S|=N$, all $h\in H$ with $\hat{R}(h)=0$ have $R(h) \leq \frac{1}{N} \left( \ln(|\mathcal{H}|) + \ln\left(\frac{1}{\delta}\right) \right)$ with $P>1-\delta$.

Given training dataset S, $|S|=N$, all $h\in H$ have $R(h) \leq \hat{R}(h) + \sqrt{\frac{1}{2N} \left( \ln(|\mathcal{H}|) + \ln\left(\frac{2}{\delta}\right) \right)}$ with $P>1-\delta$.

$|H|=$ how many kinds of functions it can model
Labeling
- $\mathcal{S} = {x^{(1)}, \ldots, x^{(N)}}$
- The set of labelings induced by H on S is: $\mathcal{H}(\mathcal{S}) = { [h(x^{(1)}), \ldots, h(x^{(N)})] \ | \ h \in \mathcal{H} }$, $H(S)\leq 2^{|S|}$
Shattering: learning algo can correctly separate all possible arrangements (labelings) of those points into 2 different categories
H(S) is the set of all labelings induced by H on S
- If |S| = N, then $|H(S)|\leq 2^N$
- H shatters S if $|H(S)|= 2^N$
VC(H): the size of the largest set S that can be shattered by H. If H can shatter arbitrarily large finite sets, then VC(H) = $\infty$
To prove that VC(H)=d, you need to show
1. $VC(H)\geq d$, ∃ some set of d data points that H can shatter and
2. $VC(H)\leq d$,∄ a set of d + 1 data points that H can shatter

Linear Classifier (e.g., Perceptron):
- Model: A straight line dividing the plane.
- VC Dimension: 3.
Polynomial of degree d:
- Model: A curve that can be as complex as a polynomial of degree d.
- VC Dimension: d + 1
Decision Trees:
- Model: A tree structure where each node represents a decision based on some feature.
- VC Dimension: Depends on the depth and features used, but generally increases with more nodes.

![[Screen Shot 2024-03-20 at 08.24.25.png]]

![[Screen Shot 2024-03-20 at 08.24.17.png]]

![[Screen Shot 2024-03-20 at 08.24.38.png]]

![[Screen Shot 2024-03-20 at 08.24.53.png]]

![[Screen Shot 2024-03-20 at 08.26.38.png]] ![[Screen Shot 2024-03-20 at 08.26.50.png]]

![[Screen Shot 2024-03-20 at 08.27.12.png]] ![[Screen Shot 2024-03-20 at 08.27.36.png]]