In the class, we have shown that the SVC corresponds to the following optimisation problem: \[
\underset{(\beta_0, \beta, \xi) \in \Re \times \Re^p \times \Re^n_{\geq 0}}{\min} \, \frac{1}{2} \| \beta \|_2^2 + \rho \sum_{i = 1}^n \xi_i, \quad \text{s.t. } y_i (\beta_0 + \beta^\top x_i) \geq 1 - \xi_i, \, \forall i = 1, \ldots, n \,,
\] where \(\rho \geq 0\) is a tuning cost hyperparameter. Explain how \(\rho\) controls the bias-variance trade-off of the SVC.
Let \(x = (x_1, x_2)^\top \in \Re^2\) and \(\Phi(x) = (x_1, x_2, x_1 x_2)^\top \in \Re^3\). Give the expression for the kernel \(\mathcal{k}(x, x')\) of \(\Phi\).
Show that the function \(\mathcal{k}(x, x') = \cos(x - x')\) for \(x, x' \in \Re\) is a kernel for some \(\Phi\), and that \(\mathcal{k}\) is a symmetric and positive definite kernel.
Show
\(\cos(x - x')\) is a kernel for some feature map \(\Phi(x)\)
\(k(x,x')\) is symmetric and positive definite
Know
\(k(x,x') = \Phi(x)^\top \Phi(x')\)
Question 1.c
Symmetry, PSD
For any \(r \in \mathbb{N}\), \(x_1, \ldots, x_r \in \Re\),
Need: For any \(r, \in \mathbb{N}\), \(x_1, \ldots, x_r \in \Re\), \(v \in \Re^r\), \(v^\top K v \geq 0\)
\[
v^\top \mathcal K v= v^\top [\Phi(x_1), \ldots, \Phi(x_r)]^\top [\Phi(x_1), \ldots, \Phi(x_r)] v = \tilde v^\top \tilde v \geq 0 \,,
\]
for \[
\tilde v = [\Phi(x_1), \ldots, \Phi(x_r)] v \,.
\]
Question 1.d
Consider the polynomial kernel \[
\mathcal{k}(x, x') = (c + \gamma x^\top x')^d
\] for \(d = 2\), \(\gamma > 0\) and \(c \ge 0\), where \(x, x' \in \Re^p\). Show the feature map \(\Phi\) of this kernel.
Consider the SVM with a radial basis kernel, which takes the form \[
\mathcal{k}(x, x') = \exp\left(-\gamma \|x - x'\|_2^2\right),
\] where \(\gamma > 0\). Explain how the parameter \(\gamma\) influences the bias and variance of the SVM classifier.
If data change, the decsion boundary will change drastically (high variance)
If \(\gamma \to 0\): \(k(x,x_i) \to 1\), so prediction function becomes constant
Low variance, high bias
Generally, for large gamma, more points around \(x\) will contribute to score, lower variance but bias increases
Learnings
How kernels relate to inner product
How differnt kernels and hyperparameters give different decision boundaries
Soft-margin SVC/SVM cost parameter
Question 2
SVM and XOR data. Let \(f ∶ \{0, 1\}^2 \to \{0, 1\}\) be the exclusive OR function, where $f (a,b) = 1 if (a,b) = (0, 1) or (a,b) = (1, 0), and \(f (a, b) = 0\) otherwise. Consider the XOR dataset \((x_1, y_1), \ldots , (x_4, x_4)\), defined such that
and would need \(\beta_0 = \beta_1 = \beta_2 = 0\), and \(H(0, [0,0]^\top)\) is not a \(1\)D hyperplane.
Question 2
Consider the SVM using the quadratic polynomial kernel \[
\mathcal{k}(x,x') = (c + \gamma x^\top x')^2
\] for \(\gamma > 0\), \(c \geq 0\). Show that in the transformed feature domain using this kernel, the observation points are linearly separable.