Algorithm Pseudocode¶

def set_hyperparameters(k, d):
    store k (# of hyperparams)
    store d(·,·)

def train(D):
    store D

def h(x'):
    Let S = the set of k points in D nearest to x' according to distance function d(u, v) 
    Let v = majority_vote(S) 
    return v

Decision Boundaries¶

Decision rule / Hypothesis $h: \mathbb{X} \rightarrow \mathbb{Y}$, $h: \mathbb{R}^M \rightarrow {+, - }$
Decision boundaries Ex: 2D Binary classifier: Linear / quadratic

How to handle ties? - Consider another point - Remove farthest of k points - Weight votes by distance - Consider another distance metric

Evaluation / Computational Efficiency¶

1NN classifier: - requires no training - always zero training error (A data point is always its own nearest neighbor)

Suppose N training examples, M features, the computational complexity of 1NN model is:

Task	Naive	k-d Tree
Train	O(1)	~O(MN logN)
Predict	O(MN)	~O($2^M \log(N)$) on avg
- Problem: fast for small M, very slow for large M
- In practice: use stochastic approximations
## Theoretical Guarantees

Let h(x) be a Nearest Neighbor (k=1) binary classifier. As the number of training examples N goes to infinity… $$error_{true} (h) < 2 \times Bayes \space Error \space Rate$$ Bayes error rate: "the best you could possibly do"

Hyper-parameters¶

Distance function¶

Distance function: $d: \mathbb{R}^{M} \times \mathbb{R}^{M} \rightarrow \mathbb{R}$
Euclidean distance: $$ d(u, v) = \sqrt{\sum_{m=1}^{M}(u_m - v_m)^2 } $$

Setting $k$¶

Theorem: If $k$ is some function of N s.t. $k(N)\rightarrow \infty$ and $\frac{k(N)}{N}\rightarrow 0$ as $N\rightarrow \infty$, then true error of a kNN model $\rightarrow$ Bayes error rate

Inductive Bias¶

Similar points should have similar labels.
All dimensions are created equally.

Pros and Cons¶

Pros:
- Intuitive / explainable
- No training / retraining
- Provably near-optimal in terms of true error rate
Cons:
- Computationally expensive, large runtime
  - Always needs to store all data: $O(ND)$
  - Finding the k closest points in D dimensions: $O(ND+N\log(k))$
- Affected by feature scale