Serre's Conjecture

Serre’s (modularity) conjecture, first made clear in their 1975 paper, Valeurs propres des opérateurs de Hecke modulo l, states the following:

Conjecture. (Serre) Let ρ:Gal(¯Q/Q)GL2(¯Fp) be a continuous, odd, irreducible Galois representation. Then there exists normalized eigenform fSk(ρ)(N(ρ),ϵ(ρ);¯Fp). with associated Galois representation ρf such that ρfρ. Furthermore, N(ρ) and k(ρ) are the minimial weight and level for which there exists such a form f.

The whole point of this log is to go through the nuts and bolts to really appreciate this conjecture (which is now a theorem!) and see some of its applications. In fact, a direct application is FLT, so this gives another proof.

Modular Forms


Let h denote the upper half plane, i.e. h={zC:Im(z)>0}. Take zh and let a,b,c,dR with adbc>0. Then az+bcz+dh as Im(az+bcz+d)=(adbc)y|cz+d|2 with adbc>0 and |cz+d|2>0, so we have Im(az+bcz+d)>0, so az+bcz+dh. Thereby SL2(Z)=Γ(1)h via (γ,z)az+bcz+d where γ=(abcd). Thereby we get an equivalence relation on h via zz if there exists γΓ(1) such that γz=z. We write Γ(1)h, to emphasize this equivalence relation, and the set h/Γ(1) denotes the set of orbits of this action. Importantly, there is a bijection j:h/Γ(1)C and that h/Γ(1) is isomorphic to a compact Riemann surface with one point missing. We can further extend the bijection j to an isomorphism j:h/Γ(1){}P1(C) of compact Riemann surfaces. To go into more detail, write h=hP1(Q), and note that Γ(1)P1(Q) via ((abcd),(x0:x1))(ax0+bx1:cx0+dx1). Defining X(1)=h/Γ(1). Then X(1)=h/Γ(1){} is a compact Riemann surface.

Note that I, where I is the identity of Γ(1), acts trivially on the upper half-plane as Iz=z+001=z, and the group PSL2(Z):=SL2(Z)/{±I}=Γ(1)/{±I} acts on h in a natural way.

Definition. Let k be a nonnegative integer. A holomorphic function f on h is a *weak modular form* of weight k if f(γz)=(cz+d)kf(z) for all zh where γ=(abcd)Γ(1) and γz=az+bcz+d.

If k were odd, then for γ=I we would have f(γz)=f(z)=(1)kf(z)=f(z), and so f=0 . Therefore if f0, then k must be even. So the weight of any nonzero weak modular form is even. In order to check that f(z) is a weak modular form, it suffices to check that f(z+1)=f(z) and f(1z)=zkf(z) as Γ(1) is generated by T=(1101) and S=(0110). As any weak modular form f satisfies f(z+1)=f(z), so periodic of period 1 as a function of its real part, we obtain a Fourier series. We can write f(z)=nZanqn where q=e2πiz. If an=0 for n<0 then f is said to be holomorphic at infinity.

Definition. A weak modular form f of weight k is a modular form of weight k if it is holomorphic at infinity. If, furthermore, we have that a0=0 in its q-expansion, then f is said to be a cusp form.

Let k be a nonnegative even integer. We write Mk(Γ(1)) to denote the vector space of modular forms of weight k for Γ(1), and we write Sk(Γ(1)) to denote the space of cusp forms of weight k over Γ(1).

Lemma. Let f be a modular form of weight k and g a modular form of weight k, both over Γ(1). Then fg is a modular form of weight k+k. Furthermore, k=0Mk(Γ(1)) is a graded algebra.

Proof. Write h=fg. Then h(γz)=f(γz)g(γz)=(cz+d)k(cz+d)kf(z)g(z)=(cz+d)k+kh(z). Now write f(z)=nZanqn and g(z)=nZbnqn. So, h(z)=nZcnqn with cn=mZambnm=m0anbnm+m<0anbnm. If n<0, then nm<0, and as f and g are modular forms, we have that an and bn both vanish, so cn=0+0=0. Therefore h=fg is a modular form with weight k+k. We have that k=0Mk(Γ(1)) is a graded algebra as fMk(Γ(1)) and gMk(Γ(1)) implies fgMk(Γ(1)), where addition is defined component wise and M0(Γ(1))C acts as the scalars. The last claim follows from the following argument: Let fM0(Γ(1)). Then f(γz)=f(z) for all γΓ(1), so f is periodic under T:zz+1. Hence, f(z)=F(q) where q=e2πiz, and holomorphic at the cusp implies that F is holomorphic at q=0. Thereby F is holomorphic on ¯D. By maximum modulus principle, it must obtain its maximum in the interior, so F (and hence f) must be constant.

Definition. An n-dimensional

Share

How do models learn?

Introduction

Interestingly, a significant portion of machine learning revolves around optimization and the reason this is is because of the task of a model needing to learn to better improve their predictions for the target outputs—in machine-learning adjacent roles, you might see that that they want someone with experience with specifically convex optimization. We’ll take this challenge of teaching a model to learn with a mathematical spirit, and this journal entry aims to explain this concisely but also, hopefully, clearly.

When measuring how well our model’s prediction is matching the target outputs with a loss function, the goal of learning is to find the model parameters (e.g., weights, biases, etc) which minimize the loss function. In practice, one of the most common strategies for optimization is the technique of Gradient Descent—the term “gradient” making some allusions to analysis.

Suppose we have a dataset DN={(xi,yi)}Ni=1 where each xiRd is an input/feature vector and yi is the corresponding target output (e.g. yi could be real for regression, or a class label for classification). We prepose a model function fθ:RdRk governed by parameters θRm—this model function could be a linear model such as linear regression, a neural network (NN). Now define a loss function L(fθ(x),y) which measures the discrepancy between the prediction, fθ(x), and the true label, y. For classification, a common choice is the cross-entropy loss; for regression, we typically use MSE: L(fθ(x),y)=1NNi=1(fθ(xi)yi)2,and for a linear model we have fθ(x)=wTx+b and θ={w,b} where w is a vector of coefficients and b is the bias. In general, the overall cost (or objective) function is typically expressed as an average loss over all training samples:
C(θ)=1NNi=1L(fθ(xi),yi), and our aim is to find θ=argminθC(θ).

Into gradient descent

The gradient descent technique is an iterative process to find (local) minima of a real valued function, and in our narrow case we want to minimize C(θ). Recall that the gradient of C(θ) is just θC(θ)=(Cθ1,Cθ2,,Cθm). Importantly, this gradient vector points in the direction of steepest increase of C, which is an essential property in our gradient descent process: gradient descent states that if you want to move θ in a direction that reduces C(θ), you should step against (i.e., in the negative direction of) the gradient. So, with this in mind, we setup the updating rule θnew=θoldηθC(θ) where η>0 is the learning rate (typically a small scaler). We perform this updating (often called the “gradient descent step”) many times until we hopefully) converge to a minimum:

θ(t+1)=θ(t)ηθC(θ(t)).

In an ideal world, to check that we have a minimum on our hands, we would want to check that we have a stationary point, θC(θ)=0 but we would also need to check that the Hessian matrix of C(θ), 2θC(θ), is positive semidefinite—this would indicate a local minimum. However, machine learning gives rise to computationally difficult world whereby this isn’t usually the method/heuristic we use to verify that we have a minimum. In practical machine learning, especially high-dimensional (lots and lots of parameters, i.e. θRN where N is big), non-convex landscapes, we instead rely on stopping criteria—useful heuristics which inform us that our iterative procedure is “good enough” or unlikely to produce something substantially better. Some common stopping criteria are:

  1. Stop when θC(θ) drops below a small threshold (e.g. 105). The reasoning behind this is that a very small gradient suggests we are near a stationary point (although it could be a minimum, maximum, or saddle).

  2. Stop when |C(θ(t+1))C(θ(t))| becomes negligibly small. If the objective is no longer decreasing significantly, further iterations may yield minimal improvement.

  3. Stop when θ(t+1)θ(t) is below a threshold. If the parameters hardly change with each step, it indicates that the algorithm is converging or stuck.

  4. In practice, due to time or resource limits, we might set a maximum number of epochs or a time budget.

What’s also important to note here is that you’re strategize/technique for gradient descent will vary if whether or not you’re in a convex vs. non-convex environment. In the non-convex world, functions, like those that arise in deep neural networks, have many stationary points, including a local minima and saddle points. So, a small gradient indicates that you’ve found a stationary point, but this is not guaranteed to be a global or even local minimum. Formally proving you’ve reached the global minimum is usually intractable for non-convex problems. But these standard convergence heuristics are sufficient in most machine learning contexts to declare that “training is done.”

Example. Implementing gradient descent for single variable linear regression:

Convex vs. non-convex

In optimization theory, the convex vs. non-convex distinction is crucial—fundamentally altering how easy or hard it is to find (as well as verify) a global minimum. To further dig into why this is crucial, recall that a function f:RnR is convex if for all x,yRn and all λ[0,1], f(λx+(1λ)y)λf(x)+(1λ)f(y) holds. This property is realized, geometrically, as any line segment between two points on the graph of f lies above or on the graph, e.g. Also, convex functions have the property that any local minimum is automatically a global minimum.

Lemma. If f is convex, then a local minimum is in fact a global minimum.

Proof. Let f be a convex function, and let x be a local minimum. By definition of a local minimum, there exists a neighborhood N around x such that f(x)f(x) for all xN. Suppose, for contradiction, that x is not a global minimum. Then there exists some y such that f(y)<f(x). By convexity of f, for any λ(0,1): f(λx+(1λ)y)λf(x)+(1λ)f(y)<λf(x)+(1λ)f(x)=f(x). For λ sufficiently close to 1, λx+(1λ)y lies in the neighborhood N, contradicting the fact that x is a local minimum. Therefore, x must be a global minimum.

Share