Serre’s (modularity) conjecture, first made clear in their 1975 paper, Valeurs propres des opérateurs de Hecke modulo l, states the following:

Conjecture. (Serre) Let $\rho : \text{Gal} (\overline{ \mathbf Q}/\mathbf Q) \to \text{GL}_2 (\overline{\mathbf F_p})$ be a continuous, odd, irreducible Galois representation. Then there exists normalized eigenform $f \in S_{k (\rho)} (N (\rho), \epsilon (\rho); \overline{\mathbf F_p})$ . with associated Galois representation $\rho_f$ such that $\rho_f \simeq \rho$ . Furthermore, $N(\rho)$ and $k (\rho)$ are the minimial weight and level for which there exists such a form $f$ .

The whole point of this log is to go through the nuts and bolts to really appreciate this conjecture (which is now a theorem!) and see some of its applications. In fact, a direct application is FLT, so this gives another proof.

Modular Forms

Let $\mathfrak h$ denote the upper half plane, i.e. $\mathfrak h = \{ z \in \mathbf C \colon \text{Im} (z) > 0 \}$ . Take $z \in \mathfrak h$ and let $a,b,c,d \in \mathbf R$ with $ad -bc > 0$ . Then $\frac{az + b}{cz + d} \in \mathfrak h$ as $\text{Im}\left(\frac{az+b}{cz+d}\right)=\frac{(ad-bc)y}{|cz+d|^2}$ with $ad-bc>0$ and $|cz+d|^2>0$ , so we have $\text{Im}\left(\frac{az+b}{cz+d}\right)>0$ , so $\frac{az+b}{cz+d}\in\mathfrak{h}.$ Thereby $\text{SL}_2 (\mathbf Z) = \Gamma (1) \curvearrowright \mathfrak h$ via $(\gamma, z) \mapsto \frac{az+ b}{cz+d}$ where $\gamma = \begin{pmatrix} a & b \\ c & d\end{pmatrix}$ . Thereby we get an equivalence relation on $\mathfrak h$ via $z \sim z^\prime$ if there exists $\gamma \in \Gamma (1)$ such that $\gamma z = z^\prime$ . We write $\Gamma(1)\backslash\mathfrak{h},$ to emphasize this equivalence relation, and the set $\mathfrak h/\Gamma(1)$ denotes the set of orbits of this action. Importantly, there is a bijection $j \colon \mathfrak h/\Gamma(1) \to \mathbf C$ and that $\mathfrak h /\Gamma(1)$ is isomorphic to a compact Riemann surface with one point missing. We can further extend the bijection $j$ to an isomorphism $j \colon \mathfrak h/\Gamma(1) \cup \{ \infty\} \to \mathbf P^1 (\mathbf C)$ of compact Riemann surfaces. To go into more detail, write $\mathfrak h^\ast = \mathfrak h \cup \mathbf P^1 (\mathbf Q)$ , and note that $\Gamma(1) \curvearrowright \mathbf P^1 (\mathbf Q)$ via $\left( \begin{pmatrix} a & b \\ c & d \end{pmatrix} , (x_0 \colon x_1) \right ) \mapsto (a x_0 + b x_1 \colon c x_0 + d x_1).$ Defining $X(1) = \mathfrak h^\ast /\Gamma(1)$ . Then $X(1) = \mathfrak h/\Gamma(1) \cup \{ \infty \}$ is a compact Riemann surface.

Note that $-I$ , where $I$ is the identity of $\Gamma(1)$ , acts trivially on the upper half-plane as $-I z = \frac{-z + 0}{0 -1} = z$ , and the group $\text{PSL}_2 (\mathbf Z) := \text{SL}_2(\mathbf Z)/\{ \pm I \} = \Gamma(1)/\{ \pm I \}$ acts on $\mathfrak h$ in a natural way.

Definition. Let $k$ be a nonnegative integer. A holomorphic function $f$ on $\mathfrak h$ is a *weak modular form* of weight $k$ if $f (\gamma z ) = (cz + d)^k f(z)$ for all $z \in \mathfrak h$ where $\gamma = \begin{pmatrix} a & b \\ c & d \end{pmatrix} \in \Gamma(1)$ and $\gamma z = \frac{az + b} {cz + d}$ .

If $k$ were odd, then for $\gamma = -I$ we would have $f(\gamma z) = f(z) = (-1)^k f(z) = - f(z)$ , and so $f = 0$ . Therefore if $f \neq 0$ , then $k$ must be even. So the weight of any nonzero weak modular form is even. In order to check that $f(z)$ is a weak modular form, it suffices to check that $f(z + 1) = f(z)$ and $f(\frac{-1}{z}) = z^k f(z)$ as $\Gamma(1)$ is generated by $T = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}$ and $S = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}$ . As any weak modular form $f$ satisfies $f(z + 1) = f(z)$ , so periodic of period $1$ as a function of its real part, we obtain a Fourier series. We can write $f(z) = \sum_{n \in \mathbf Z} a_n q^n$ where $q = e^{2\pi i z}$ . If $a_n = 0$ for $n<0$ then $f$ is said to be holomorphic at infinity.

Definition. A weak modular form $f$ of weight $k$ is a modular form of weight $k$ if it is holomorphic at infinity. If, furthermore, we have that $a_0 = 0$ in its $q$ -expansion, then $f$ is said to be a cusp form.

Let $k$ be a nonnegative even integer. We write $M_k (\Gamma(1))$ to denote the vector space of modular forms of weight $k$ for $\Gamma(1)$ , and we write $S_k (\Gamma(1))$ to denote the space of cusp forms of weight $k$ over $\Gamma(1)$ .

Lemma. Let $f$ be a modular form of weight $k$ and $g$ a modular form of weight $k^\prime$ , both over $\Gamma(1)$ . Then $fg$ is a modular form of weight $k+k^\prime$ . Furthermore, $\bigoplus _{k = 0}^\infty M_k (\Gamma(1))$ is a graded algebra.

Proof. Write $h = fg$ . Then $h( \gamma z) = f(\gamma z) g(\gamma z) = (cz + d)^k (cz + d)^{k ^\prime} f(z) g(z) = (cz + d)^{k + k^\prime} h(z)$ . Now write $f(z) = \sum_{n \in \mathbf Z} a_n q^n$ and $g(z) = \sum_{n \in \mathbf Z } b_n q^n$ . So, $h(z) = \sum_{n \in \mathbf Z} c_n q^n$ with $c_n = \sum_{m \in \mathbf Z} a_m b_{n-m} = \sum_{m \geq 0 } a_n b_{n-m} + \sum_{m < 0} a_n b_{n-m}$ . If $n < 0$ , then $n-m < 0$ , and as $f$ and $g$ are modular forms, we have that $a_n$ and $b_n$ both vanish, so $c_n = 0 + 0 = 0$ . Therefore $h = fg$ is a modular form with weight $k + k^\prime$ . We have that $\bigoplus _{k = 0}^\infty M_k (\Gamma(1))$ is a graded algebra as $f \in M_k (\Gamma(1))$ and $g \in M_{k^\prime} (\Gamma(1))$ implies $fg \in M_k(\Gamma(1))$ , where addition is defined component wise and $M_0(\Gamma(1)) \cong \mathbf C$ acts as the scalars. The last claim follows from the following argument: Let $f \in M_0(\Gamma(1))$ . Then $f(\gamma z) = f (z)$ for all $\gamma \in \Gamma (1)$ , so $f$ is periodic under $T \colon z \mapsto z + 1$ . Hence, $f(z) = F(q)$ where $q = e^{2 \pi iz}$ , and holomorphic at the cusp implies that $F$ is holomorphic at $q = 0$ . Thereby $F$ is holomorphic on $\overline{\mathbf {D}}$ . By maximum modulus principle, it must obtain its maximum in the interior, so $F$ (and hence $f$ ) must be constant.

Definition. An $n$ -dimensional

Introduction

Interestingly, a significant portion of machine learning revolves around optimization and the reason this is is because of the task of a model needing to learn to better improve their predictions for the target outputs—in machine-learning adjacent roles, you might see that that they want someone with experience with specifically convex optimization. We’ll take this challenge of teaching a model to learn with a mathematical spirit, and this journal entry aims to explain this concisely but also, hopefully, clearly.

When measuring how well our model’s prediction is matching the target outputs with a loss function, the goal of learning is to find the model parameters (e.g., weights, biases, etc) which minimize the loss function. In practice, one of the most common strategies for optimization is the technique of Gradient Descent—the term “gradient” making some allusions to analysis.

Suppose we have a dataset $D_N = \{(x_i, y_i ) \}_{i=1}^N$ where each $x_i \in \mathbf R^d$ is an input/feature vector and $y_i$ is the corresponding target output (e.g. $y_i$ could be real for regression, or a class label for classification). We prepose a model function $f_\theta \colon \mathbf R^d \to \mathbf R^k$ governed by parameters $\theta \in \mathbf R^m$ —this model function could be a linear model such as linear regression, a neural network (NN). Now define a loss function $\mathcal{L}\bigl(f_\theta(x), y\bigr)$ which measures the discrepancy between the prediction, $f_\theta(x)$ , and the true label, $y$ . For classification, a common choice is the cross-entropy loss; for regression, we typically use MSE: $\mathcal{L}\bigl(f_\theta(x), y\bigr) = \frac{1}{N} \sum_{i=1}^N \bigl(f_\theta(x_i) - y_i\bigr)^2,$ and for a linear model we have $f_\theta(x) = w^T x + b$ and $\theta = \{w, b \}$ where $w$ is a vector of coefficients and $b$ is the bias. In general, the overall cost (or objective) function is typically expressed as an average loss over all training samples:
$C(\theta) \;=\; \frac{1}{N} \sum_{i=1}^N \mathcal{L}\bigl(f_\theta(x_i), y_i \bigr),$ and our aim is to find $\theta^* \;=\; \underset{\theta}{\operatorname{argmin}}\; C(\theta).$

Into gradient descent

The gradient descent technique is an iterative process to find (local) minima of a real valued function, and in our narrow case we want to minimize $C(\theta)$ . Recall that the gradient of $C(\theta)$ is just $\nabla_\theta C(\theta) = \left ( \frac{\partial C}{\partial \theta_1 }, \frac{\partial C}{\partial \theta_2}, \ldots, \frac{\partial C}{\partial \theta_m} \right ).$ Importantly, this gradient vector points in the direction of steepest increase of C, which is an essential property in our gradient descent process: gradient descent states that if you want to move $\theta$ in a direction that reduces $C(\theta)$ , you should step against (i.e., in the negative direction of) the gradient. So, with this in mind, we setup the updating rule $\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_\theta C(\theta)$ where $\eta > 0$ is the learning rate (typically a small scaler). We perform this updating (often called the “gradient descent step”) many times until we hopefully) converge to a minimum:

$\theta^{(t+1)} \;=\;\theta^{(t)} \;-\; \eta \,\nabla_\theta C\bigl(\theta^{(t)}\bigr).$

In an ideal world, to check that we have a minimum on our hands, we would want to check that we have a stationary point, $\nabla_\theta C(\theta) = 0$ but we would also need to check that the Hessian matrix of $C(\theta)$ , $\nabla^2_\theta C(\theta)$ , is positive semidefinite—this would indicate a local minimum. However, machine learning gives rise to computationally difficult world whereby this isn’t usually the method/heuristic we use to verify that we have a minimum. In practical machine learning, especially high-dimensional (lots and lots of parameters, i.e. $\theta \in \mathbf R^N$ where $N$ is big), non-convex landscapes, we instead rely on stopping criteria—useful heuristics which inform us that our iterative procedure is “good enough” or unlikely to produce something substantially better. Some common stopping criteria are:

Stop when $\|\nabla_\theta C(\theta)\|$ drops below a small threshold (e.g. $10^{-5}$ ). The reasoning behind this is that a very small gradient suggests we are near a stationary point (although it could be a minimum, maximum, or saddle).
Stop when $\bigl|C(\theta^{(t+1)}) - C(\theta^{(t)})\bigr|$ becomes negligibly small. If the objective is no longer decreasing significantly, further iterations may yield minimal improvement.
Stop when $\|\theta^{(t+1)} - \theta^{(t)}\|$ is below a threshold. If the parameters hardly change with each step, it indicates that the algorithm is converging or stuck.
In practice, due to time or resource limits, we might set a maximum number of epochs or a time budget.

What’s also important to note here is that you’re strategize/technique for gradient descent will vary if whether or not you’re in a convex vs. non-convex environment. In the non-convex world, functions, like those that arise in deep neural networks, have many stationary points, including a local minima and saddle points. So, a small gradient indicates that you’ve found a stationary point, but this is not guaranteed to be a global or even local minimum. Formally proving you’ve reached the global minimum is usually intractable for non-convex problems. But these standard convergence heuristics are sufficient in most machine learning contexts to declare that “training is done.”

Example. Implementing gradient descent for single variable linear regression:

Convex vs. non-convex

In optimization theory, the convex vs. non-convex distinction is crucial—fundamentally altering how easy or hard it is to find (as well as verify) a global minimum. To further dig into why this is crucial, recall that a function $f \colon \mathbf R^n \to \mathbf R$ is convex if for all $x, y \in \mathbf R^n$ and all $\lambda \in [0,1]$ , $f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)$ holds. This property is realized, geometrically, as any line segment between two points on the graph of $f$ lies above or on the graph, e.g. Also, convex functions have the property that any local minimum is automatically a global minimum.

Lemma. If $f$ is convex, then a local minimum is in fact a global minimum.

Proof. Let $f$ be a convex function, and let $x^*$ be a local minimum. By definition of a local minimum, there exists a neighborhood $N$ around $x^*$ such that $f(x^*) \leq f(x)$ for all $x \in N$ . Suppose, for contradiction, that $x^*$ is not a global minimum. Then there exists some $y$ such that $f(y) < f(x^*)$ . By convexity of $f$ , for any $\lambda \in (0, 1)$ : $f(\lambda x^* + (1-\lambda)y) \leq \lambda f(x^*) + (1-\lambda)f(y) < \lambda f(x^*) + (1-\lambda)f(x^*) = f(x^*).$ For $\lambda$ sufficiently close to 1, $\lambda x^* + (1-\lambda)y$ lies in the neighborhood $N$ , contradicting the fact that $x^*$ is a local minimum. Therefore, $x^*$ must be a global minimum. $\square$