This lecture is part of a course on probability, estimation theory, and random signals. Hello, welcome back to the lecture slide set on estimation theory. In the last topic, we looked at the least squares approach to parameter estimation by minimizing the squared difference between some observed data x(n) and an assumed or hidden signal model, s(n). We developed this cost function here, which is the sum of squared errors. And the so-called least squared error estimator is found by minimizing this cost function with respect to the parameter of interest. This is a quadratic cost function, also called the L2 norm. We considered the least squares problems from different perspectives. But we also noticed that, in some situations, the cost function is very nonlinear in terms of the parameter that you're trying to estimate. So we looked at the sinusoidal frequency estimation problem, where the signal model was a co-sine with an unknown frequency f_0, and the cost function was highly nonlinear. This was solved using a method called non-linearly squares, which is effectively an optimization problem using grid searches or iterative minimization methods. Now in today's video, we're going to look at a problem called linear least squares. And that's one where we can get a nice analytical solution. So, in linear least squares, we then have an observed signal or set of observations x(n) from n is 0 to N -1. And this is a perturbed of an unknown signal, which is denoted s(n) from n is 0 to N-1. Now, as we've done before, each of these processes can be written as random vectors. So these are just denoted by "s" and "x" here. Now in the linear least squares problem, we can assume that the signal s(n) can be written as a linear combination of known functions. So, in this equation here, the functions h(k), for k is between one and p, so there are p different functions. So typical examples involve polynomial expressions such as h_k(n) is n to the k. So if you set k equal to 0, you just get a constant, if you set k equal to one, either a linear term, k equal to 2 is a quadratic, and so forth. But there were many other different known basis functions that can also be chosen. So the weightings of this linear combination are denoted by these parameters, theta_k, for k is one to p. We've already seen problems earlier in this chapter on estimation theory, where we were, for example, fitting straight lines. So in that case we would come up with a basis functions. h_0(n) is a constant. h_1(n) is n is equal to itself. And therefore, s(n) could be written as theta_0 times h two of n plus phi2 one times h one of n. And that would therefore be of a form. A plus b n i iv linear form. Now, if we consider writing this equation out for different values of n, So consider watching out for n equal to 0. So you still got some of phi k times h k But of z over this time, then you can write this out as a inner product. So if we define a vector consisting of the basis functions evaluated at n equal to 0. But for the index, take one food p. And when you multiply that by the different parameters v to z down to phi2 p. And you can see that that is an inner products. So that becomes the first element of a vector S here and a row times a column. Now Creusot of writing that out for different values of n. And you can construct by putting together the different rows for the different basis functions evaluated at different times. And then you can create a set of simultaneous equations. And noticed that the parameters which are constant for our various equations, or indeed the unknown parameters theta. So what we can see this latch, it is possible to write this linear combination of basis functions in a matrix-vector form. And this expression simplifies to the form S equals H times feta. The member phi2 is unknown, but S is one, but we're trying to estimate. So the least squares error is found again by minimizing me squared error between the observed data and the signal model. So you defining a vector S is equal to x minus h times S. And noting that the sum of the square terms of various elements is equivalent to writing e transpose times e when you arrive at this equation at the bottom. Now expanding this quadratic form. So we have X transpose times X minus. Now HP to transpose, the transpose of a product becomes the product of the transposes in reverse order. Then we've got x transpose minus h feta, and then we've got quadratic term in feature there. Now, again, as we've seen in previous videos, J is a scalar. So because of that, if we consider these two terms, they have an hour notice fat P2 transpose times H transpose x. If I were to transpose all of that. And it reverses the order and transparency as individual terms appropriately. You can see that that is equivalent to the second term here. And these are identical. So that gives us the cost function J animals lives here. And now we need to be able to differentiate this with respect to the parameter theta. Now that means that because feature is that change in J, you would end up with a vector differential. So you'd end up with DJ defeat to one, up to d JD P2P. Now you could do this element by element, and it is possible to convert these products. Mhc is I'm the primative that feeds back into a summation forms so that you could make the differential easy. However, were awesome that to characters results that we can draw upon to make this simpler and to effectively stake within a matrix-vector formulation. So the two things to note are the following. So if you've got an inner products, so you have two vectors, b and e. Well, if you were to differentiate this with respect to a, you get a linear term. So let me just explain where that comes from. If we were to use standard notation for the elements of the vectors being for example, BI, an AI that imagine differentiating this term with respect to the elements of a and of crude is a j. Now we notice that the differential of a sum is the sum of differentials. And you end up with this term which we have seen in previous lectures about what happens if you did differentiate an element of a vector with respect to a different element of the vector. While this term here equal to one if the elements are the same and equal to 0 otherwise. And that's because these elements are independent of each other. So you can see that in this summation, you end up, you end up with just one term surviving, which is b j here. Now that's differentiate u with respect to a j. So what I want you to visualize is if you do that for all the different values of a j, then you will end up forming this vector B. Okay? So that, now that makes sense because B transpose times a is a scalar. And so differentiating with respect to a vector gives me a column vector and that's what I'm expecting. So you can see if our dimensionality works where in a same way what happens if you differentiate this quadratic term with respect to a while? I'm going to leave you to prove this yourself. It's simply a matter of writing this expression out in terms of a double summation of the elements of the matrix B times the elements of a transpose and the elements of a. So you have a summation like this. And if you differentiate this with respect to a is possible to prove that this ends up being the matrix B plus itself transpose multiplied by the vector a. Now 19 term we're differentiating this quadratic. We see that when we differentiate it with respect to a, we end up with a linear term. And this very much make sense. So let's go back to the cost function which is here, and let's make some comparisons. So we're obviously setting a equal to theta. Here we can see that B cross H transpose times h. H is set. B transpose is exactly the same. B transpose equals b. Now comparing these terms, a, because a is equal to theta, then we can see that V is equal to H transpose times x if you were to compare arrays. So now defied changing j using these two formulas differentiating this expression. Now saying the x transposed times x differentiated with respect to a, we differentiate to 0. Then the derivative of this function is equal to minus two dot. So b, remember b was h transpose times x plus two lots of h transpose h times theta. Now if we set y equal to 0, we get the result by H transpose times h times theta is equal to H transpose times x. Now, you will then need to invert this and For the purposes of today, what I'm going to assume that the number of data points n is much, much greater than the number of unknown parameters. And in that case, we can assume that the inverse of this matrix H transpose times H exists. But assuming that's the case, then we can just take the matrix inverse of his product on both sides, which gives you this equation here. Now this set of equations I had on a previous slide is known as the normal equations. And it's a very famous set of equations occurs and a number of different situations. And you should really become very familiar with. Elsewhere you might observe that pseudo-inverse is used to represent this generic equation here. You may see the pseudo-inverse discussed elsewhere. I'm not going to focus too much in this lecture because this isn't really supposed to be a linear algebra optimization course to a whole variety of geometric interpretations of this matrix. Inverse and Val are incredibly important. But you will really cover varies in a linear algebra course. So as I've mentioned, if h is full rank fat guarantees invertibility. This matrix H transpose times h. And then given the optimal parameter, say econ, them work out what the actual minimum lease squares ever will be in that particular situation. So I substituting this expression into the quadratic error term which we had on the previous slide number. This is the error term times itself. You will end up with this second expression here. Now, we have a little bit of rearrangement by taking, for example, an x transpose out of each equation, you end up with this form here. Now what is interesting is that this matrix here, the identity matrix of size n. So just to remind you, h is of size n by p. And that means that H transpose times H is of size p by P. And therefore h times h transpose h to the minus one times eight shown space. If you were to do a domino effect, homelessness is n by p, p by p, p by n. Whole thing is N by N. And that's why we have the identity matrix of size n. Now, this turns out that the identity matrix minus H times the pseudoinverse has very particular property. And that is called idempotent. And that means that if you define this matrix by a, a times itself is equal to a. And you may want to verify that yourself is very interesting property that can come up in a lot of linear algebra. And it also has very interesting projection and of a Linear Algebra properties. If you substitute for a and you can see that the minimum error term simplifies to this value here. And this is quite familiar expression that you might see a number of times. So obviously depends on eight values, but otherwise doesn't, doesn't depend on the estimated value of parameters. So the key takeaway is a set of equations there, a solution to the normal equations. Let's have a look at how we would apply this to a particular problem. So to do that, I'm going to look at Fourier series estimation theory in its application of general linear model in Fourier series estimation. Suppose that a signal. It is modeled as a sum of sinusoids. So here we have S of n is a sum of p terms of sinusoidal component. Now, I've chosen fundamental frequency here, omega naught. And so the frequency terms of each of these sinusoids is P times omega naught is just the standard Fourier series expression. And the amplitudes of a sinusoidal or a cosinusoidal terms are given by API and BP. These are unknown amplitudes. Now for this problem here, we're going to consider the fundamental frequency omega naught and the number of sinusoids and K sinusoids to be known in advance. Now, if you don't know what the value of omega nought is, we will need to ease the non-linear least squares that I referred to in a previous topic. Actually you would end up with a combination of linear least squares and then non-linear least squares. Now if you didn't model order, plus a bit more complicated, but we can return to our model order selection at a later date. So we assume that the signal x of n as observed in noise, I'm not been specified what noise a is. It's just that the observed signal is not equal to the true signal. So the question here is just write down the least square solution. And that effectively amounts to writing this expression in a form of h times feta. So watching the observation out to explicitly, we end up with this equation here, which really is just a sub n substituted in for the key step is to notice that this trigonometric Fourier series can be written in what school with linear in the parameters form. In other words, in the form H times theta. And you can see that is going to consist of sinusoidal terms and pay sinusoidal terms. And what's changing as you go down the rows is essentially the sample index n. And as you go across the columns where you're alternating between signs and cosigns. But this is increasing frequency. And you can see that from the first row here, cuz you have Sinai, we're gonna call them, we're going to look sine two omega, no cost to omega naught increases until you get to the peak frequency component. Okay, so if we define a parameter vectors consisting of 0s on nine amplitudes, a1, b1, a2, b2. Then we have a signal model that S dizzying deed equals h times theta. And therefore, linear least squares estimator is given by this form what we had before. Now we notice that feat P is of dimension two times p because I'm choose sinusoids and amp, choose lowercase and sultans, and therefore, h is n times 2pi. Now it turns out fat using new forgone narrative of basis functions, if you were to actually try and calculate what h transpose h was, he can get some additional simplifications or I'm not going to cover in today's topic by would notice a recent equations are very important and actually lead to a number of other properties about this particular linear least square solution. Now, one thing I'd encourage you to do is to go and try a linear least squares fit yourself. And I've put some MATLAB code onto learn where in this diagram. What we have in the blue line is a sawtooth waveform, AC and line waveform doesn't start at 0, say it is not an even on, it's not an odd function. Which means that if our to try and represent that using a Fourier series and we would need both sinusoids and K sinusoidal terms. But what I've done is I've a blue line is the true signal. And the orange line here, oh, brown, depending on your monitor, is phi true signal observed in some noise. So I've actually added some noise. In this case it is white Gaussian noise, but from a least squares perspective, it doesn't matter what kind of type of noise is, because least squares is a non-probabilistic approach to solving the problem. What I've done in scale, and we don't actually know what do we owe signal is. So I modeled it using sector Fourier basis functions. So I've actually chosen p equal to 20 sinusoids, well, actually a 1000 data points. And I've actually assumed Arduino one fundamental frequency is because for this particular waveform, actually repeats through the period of T. So V is cancel. And actually in this case, for fundamental frequency is equal to pi. Now given an EVA is, I'm literally going to apply the formula we had on the previous slide. So we're going to apply the solution to the normal equations. I can create my matrix H by populating this term, I will call omega naught equal to pi. Now once I've got the estimated coefficients, flew the solutions of these equations, I can then create an estimate to the underlying signal value of a signal model S hat as being h times theta hat. And so I can recreate waveform. And that gives rise to this red curve, which you can see here. And you can see that it's an interesting approximation to the saw tooth function. And you can see some of our familiar ringing artifacts that you might expect with Gibbs phenomena. But this aside, different service of pure continuous time sinusoidal fit the mike elsewhere. But you can see that actually the way curve is, it's not a terrible approximation to the data. It clearly isn't the best model to use. But this is the key point of least squares models. And in fact, many estimation modules is a pie or if I don't know what the underlying signal model really is, an IEEE is a less than efficient signal model. Benefit. Fit isn't going to repair. Eventually, you're gonna get some approximations. But within the constraints of a signal model, what we've actually chosen, then this red lines, not a terrible approximation. I will leave it to you to have a bit more of a think about of a signal models where either Gs to represent this underlying data. I would encourage you to think about how you would choose that model without necessarily using lots of additional insight that comes from human design. Now, think about how would you create an algorithm to do that automatically. So in today's video, it was an introduction to the linear least squares estimation technique is just for least squares estimation that we saw in previous videos. But the point about linear least squares is, or if he can represent your underlying signal as this linear in parameters model, then you can get some very elegant solutions to minimizing the squares. Solutions are indeed found just by solving a set of linear equations, which is very efficient computation in very, very easy. Say this video brings us towards the end of our chapter on looking at estimation techniques and designing estimators, There is one big section left for you to read and as a self study time, and that's on Bayesian estimation theory. Whilst that section is predominantly self study, a late today I will add some videos which go into more depth on Bayesian estimation theory. But until then, thank you very much.