Applying the least squares method in Excel. Least Squares in Excel

The least squares method is a mathematical procedure for constructing a linear equation that most closely matches a set of two series of numbers. The purpose of this method is to minimize the total squared error. Excel has tools that can be used to apply this method in calculations. Let's see how it's done.

The method of least squares (LSM) is a mathematical description of the dependence of one variable on another. It can be used for forecasting.

Enable the Solver add-in

In order to use OLS in Excel, you need to enable the add-in "Search for a Solution", which is disabled by default.

Now the function Finding a solution in Excel is activated, and its tools appear on the ribbon.

Conditions of the problem

Let us describe the application of LSM on a specific example. We have two rows of numbers x and y , the sequence of which is shown in the image below.

This dependence can most accurately be described by the function:

At the same time, it is known that x=0 y also equal 0 . Therefore, this equation can be described by the dependence y=nx .

We have to find the minimum sum of squares of the difference.

Solution

Let us proceed to the description of the direct application of the method.

As you can see, the application of the least squares method is a rather complicated mathematical procedure. We have shown it in action with the simplest example, but there are much more complex cases. However, the Microsoft Excel toolkit is designed to simplify the calculations as much as possible.

The method of least squares (LSM) belongs to the field of regression analysis. It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities from the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem on a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS is of interest to us from the point of view of regression analysis (in Excel, its methods are implemented using built-in functions), we should immediately proceed to consider a specific problem.

So, let X be the selling area of a grocery store, measured in square meters, and Y be the annual turnover, defined in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has one or another retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built with data for n stores.

According to mathematical statistics, the results will be more or less correct if the data on at least 5-6 objects are examined. Also, "anomalous" results cannot be used. In particular, an elite small boutique can have a turnover many times greater than the turnover of large outlets of the “masmarket” class.

The essence of the method

The table data can be displayed on the Cartesian plane as points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution of the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n .

Of course, you can use a high degree polynomial, but this option is not only difficult to implement, but simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for a straight line y = ax + b, which best approximates the experimental data, and more precisely, the coefficients - a and b.

Accuracy score

For any approximation, the assessment of its accuracy is of particular importance. Denote by e i the difference (deviation) between the functional and experimental values for the point x i , i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, preference should be given to the one that has the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations, there will practically be negative ones.

You can solve the problem using the deviation modules or their squares. The latter method is the most widely used. It is used in many areas, including regression analysis (in Excel, its implementation is carried out using two built-in functions), and has long been proven to be effective.

Least square method

In Excel, as you know, there is a built-in autosum function that allows you to calculate the values of all values located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation, this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding a straight line that best describes a specific relationship between X and Y amounts to calculating the minimum of a function of two variables:

This requires equating to zero partial derivatives with respect to new variables a and b, and solving a primitive system consisting of two equations with 2 unknowns of the form:

After simple transformations, including dividing by 2 and manipulating the sums, we get:

Solving it, for example, by Cramer's method, we obtain a stationary point with certain coefficients a * and b * . This is the minimum, i.e. to predict what turnover the store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of \u200b\u200bwhether buying a store on credit for a particular area will pay off.

How to implement the least squares method in Excel

Excel has a function for calculating the value of the least squares. It has the following form: TREND (known Y values; known X values; new X values; constant). Let's apply the formula for calculating the OLS in Excel to our table.

To do this, in the cell in which the result of the calculation using the least squares method in Excel should be displayed, enter the “=” sign and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

range of known values for Y (in this case data for turnover);
range x 1 , …x n , i.e. the size of retail space;
and known and unknown values of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, there is a logical variable "Const" in the formula. If you enter 1 in the field corresponding to it, then this will mean that calculations should be carried out, assuming that b \u003d 0.

If you need to know the forecast for more than one x value, then after entering the formula, you should not press "Enter", but you need to type the combination "Shift" + "Control" + "Enter" ("Enter") on the keyboard.

Some Features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables - "TREND" - can be used even by those who have never heard of the least squares method. It is enough just to know some features of its work. In particular:

If you place the range of known values of the variable y in one row or column, then each row (column) with known values of x will be perceived by the program as a separate variable.
If the range with known x is not specified in the TREND window, then in case of using the function in Excel, the program will consider it as an array consisting of integers, the number of which corresponds to the range with the given values of the variable y.
To output an array of "predicted" values, the trend expression must be entered as an array formula.
If no new x values are specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already given parameters y.
The range containing the new x values must have the same or more rows or columns as the range with the given y values. In other words, it must be proportionate to the independent variables.
An array with known x values can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values of x and y be commensurate. In the case of several variables, it is necessary that the range with the given y values fit in one column or one row.

FORECAST function

Regression analysis in Excel is implemented using several functions. One of them is called "PREDICTION". It is similar to TREND, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know the Excel formulas for dummies that allow you to predict the value of the future value of an indicator according to a linear trend.

4.1. Using built-in functions

calculation regression coefficients carried out using the function

LINEST(Values_y; Values_x; Konst; statistics),

Values_y- array of y values,

Values_x- optional array of values x if array X omitted, it is assumed that this is an array (1;2;3;...) of the same size as Values_y,

Konst- a boolean value that indicates whether the constant is required b was equal to 0. If Konst has the meaning TRUE or omitted, then b calculated in the usual way. If the argument Konst is FALSE, then b is assumed to be 0 and the values a are chosen so that the relation y=ax.

Statistics- a boolean value that indicates whether additional regression statistics are required to be returned. If the argument Statistics has the meaning TRUE, then the function LINEST returns additional regression statistics. If the argument Statistics has the meaning LYING or omitted, then the function LINEST returns only the coefficient a and permanent b.

It must be remembered that the result of the functions LINEST() is a set of values - an array.

For calculation correlation coefficient function is used

CORREL(Array1;Array2),

returning the values of the correlation coefficient, where Array1- array of values y, Array2- array of values x. Array1 and Array2 must be the same size.

EXAMPLE 1. Addiction y(x) is presented in the table. Build regression line and calculate correlation coefficient.

y		0.5		1.5		2.5		3.5
x		2.39	2.81	3.25	3.75	4.11	4.45	4.85	5.25

Let's enter a table of values into MS Excel sheet and build a scatter plot. The worksheet will take the form shown in Fig. 2.

In order to calculate the values of the regression coefficients a and b select cells A7:B7, let's turn to the function wizard and in the category Statistical choose a function LINEST. Fill in the dialog box that appears as shown in Fig. 3 and press OK.

As a result, the calculated value will appear only in the cell A6(Fig. 4). For a value to appear in a cell B6 you need to enter edit mode (key F2) and then press the key combination CTRL+SHIFT+ENTER.

To calculate the value of the correlation coefficient per cell C6 the following formula was introduced:

C7=CORREL(B3:J3;B2:J2).

Knowing the regression coefficients a and b calculate the values of the function y=ax+b for given x. To do this, we introduce the formula

B5=$A$7*B2+$B$7

and copy it to the range С5:J5(Fig. 5).

Let's plot the regression line on the diagram. Select the experimental points on the chart, right-click and select the command Initial data. In the dialog box that appears (Fig. 5), select the tab Row and click on the button Add. Fill in the input fields, as shown in Fig. 6 and press the button OK. A regression line will be added to the experimental data plot. By default, its graph will be displayed as dots not connected by smoothing lines.

Rice. 6

To change the appearance of the regression line, perform the following steps. Right-click on the points depicting the line graph, select the command Chart type and set the type of scatter plot, as shown in Fig. 7.

The line type, color and thickness can be changed as follows. Select the line on the diagram, press the right mouse button and select the command in the context menu Data Series Format… Next, make settings, for example, as shown in Fig. eight.

As a result of all the transformations, we get a graph of experimental data and a regression line in one graphic area (Fig. 9).

4.2. Using a trend line.

The construction of various approximating dependencies in MS Excel is implemented as a chart property - trend line.

EXAMPLE 2. As a result of the experiment, some tabular dependence was determined.

0.15	0.16	0.17	0.18	0.19	0.20
4.4817	4.4930	5.4739	6.0496	6.6859	7.3891

Select and build an approximating dependence. Build graphs of tabular and fitted analytical dependencies.

The solution of the problem can be divided into the following stages: input of initial data, construction of a scatter plot and addition of a trend line to this plot.

Let's consider this process in detail. Let's enter the initial data into the worksheet and plot the experimental data. Next, select the experimental points on the chart, right-click and use the command Add l trend line(Fig. 10).

The dialog box that appears allows you to build an approximating dependence.

The first tab (Fig. 11) of this window indicates the type of approximating dependence.

The second one (Fig. 12) defines the construction parameters:

the name of the approximating dependence;

Forecast forward (backward) on n units (this parameter determines how many units forward (backward) it is necessary to extend the trend line);

whether to show the point of intersection of the curve with the line y=const;

whether to show the approximating function on the diagram or not (show the equation on the diagram parameter);

Whether to place the value of the standard deviation on the diagram or not (the parameter put the value of the approximation reliability on the diagram).

Let us choose a polynomial of the second degree as an approximating dependence (Fig. 11) and derive an equation describing this polynomial on the graph (Fig. 12). The resulting diagram is shown in fig. thirteen.

Similarly, with trend lines you can choose the parameters of such dependencies as

linear y=a∙x+b,

logarithmic y=a ln(x)+b,

exponential y=a∙eb,

power y=a x b,

polynomial y=a∙x 2 +b∙x+c, y=a∙x 3 +b∙x 2 +c∙x+d and so on, up to and including the 6th degree polynomial,

Linear filtering.

4.3. Using the Decider

Of considerable interest is the implementation in MS Excel of the selection of parameters by the least squares method using a decision block. This technique allows you to choose the parameters of a function of any kind. Let's consider this possibility on the example of the following problem.

EXAMPLE 3. As a result of the experiment, the dependence z(t) presented in the table

0,66	0,9	1,17	1,47	1,7	1,74	2,08	2,63	3,12
38,9	68,8	64,4	66,5	64,95	59,36	82,6	90,63	113,5

Select dependency coefficients Z(t)=At 4 +Bt 3 +Ct 2 +Dt+K by the least squares method.

This problem is equivalent to the problem of finding the minimum of a function of five variables

Consider the process of solving the optimization problem (Fig. 14).

Let the values A, V, WITH, D and TO stored in cells A7:E7. Calculate the theoretical values of the function Z(t)=At4+Bt3+Ct2+Dt+K for given t(B2:J2). To do this, in the cell B4 enter the value of the function at the first point (cell B2):

B4=$A$7*B2^4+$B$7*B2^3+$C$7*B2^2+$D$7*B2+$E$7.

Copy this formula into the range С4:J4 and get the expected value of the function at points, the abscissas of which are stored in cells B2:J2.

To cell B5 we introduce a formula that calculates the square of the difference between the experimental and calculated points:

B5=(B4-B3)^2,

and copy it to the range С5:J5. In a cell F7 we will store the total quadratic error (10). To do this, we introduce the formula:

F7 = SUM(B5:J5).

Let's use the command Service®Search for a solution and solve the optimization problem without constraints. Fill in the appropriate input fields in the dialog box shown in Fig. 14 and press the button Run. If a solution is found, the window shown in Fig. 15.

The result of the decision block will be the output to the cells A7:E7parameter values functions Z(t)=At4+Bt3+Ct2+Dt+K. In cells B4:J4 we get expected function value at starting points. In a cell F7 will be kept total squared error.

You can display the experimental points and the fitted line in the same graphic area if you select the range B2:J4, call Chart Wizard, and then format the appearance of the resulting graphs.

Rice. 17 displays the MS Excel worksheet after the calculations have been made.

5. REFERENCES

1. Alekseev E.R., Chesnokova O.V., Solving problems of computational mathematics in the packages Mathcad12, MATLAB7, Maple9. – NT Press, 2006.–596s. :ill. – (Tutorial)

2. Alekseev E.R., Chesnokova O.V., E.A. Rudchenko, Scilab, solving engineering and mathematical problems. –M., BINOM, 2008.–260s.

3. I. S. Berezin and N. P. Zhidkov, Methods of Computation, Moscow: Nauka, 1966.

4. Garnaev A.Yu., The use of MS EXCEL and VBA in economics and finance. - St. Petersburg: BHV - Petersburg, 1999.-332p.

5. B. P. Demidovich, I. A. Maron, and V. Z. Shuvalova, Numerical Methods of Analysis.–M.: Nauka, 1967.–368p.

6. Korn G., Korn T., Handbook of mathematics for scientists and engineers.–M., 1970, 720p.

7. Alekseev E.R., Chesnokova O.V. Guidelines for performing laboratory work in MS EXCEL. For students of all specialties. Donetsk, DonNTU, 2004. 112 p.

Well, at work they reported to the inspection, the article was written at home for the conference - now you can write in the blog. While I was processing my data, I realized that I could not help but write about a very cool and necessary add-in in Excel, which is called . So the article will be devoted to this particular add-in, and I will tell you about it using an example of using least squares method(LSM) to search for unknown coefficients of the equation in the description of experimental data.

How to enable the add-on "search for a solution"

First, let's figure out how to enable this add-on.

1. Go to the "File" menu and select "Excel Options"

2. In the window that appears, select "Search for a solution" and click "go".

3. In the next window, put a checkmark in front of the "search for a solution" item and click "OK".

4. The add-in is activated - now it can be found in the "Data" menu item.

Least square method

Now briefly about least squares method (LSM) and where it can be applied.

Let's say we have a data set after we have performed some experiment where we studied the effects of the X value on the Y value.

We want to describe this influence mathematically, so that later we can use this formula and know that if we change the value of X by so much, we will get the value of Y such and such ...

Let's take a super-simple example (see picture).

No brainer that the points are located one after another as if in a straight line, and therefore we safely assume that our dependence is described by a linear function y=kx+b. At the same time, we are sure that when X is equal to zero, the value of Y is also equal to zero. This means that the function describing the dependence will be even simpler: y=kx (remember the school curriculum).

In general, we have to find the coefficient k. This is what we will do with MNC using the "search for a solution" add-on.

The method is to (here - attention: you need to think about it) the sum of the squared differences between the experimentally obtained and the corresponding calculated values was minimal. That is, when X1=1 the actual measured value Y1=4.6, and the calculated y1=f (x1) is 4, the square of the difference will be (y1-Y1)^2=(4-4.6)^2=0.36 . The following is the same with the following: when X2=2, the actual measured value Y2=8.1, and the calculated y2 is 8, the square of the difference will be (y2-Y2)^2=(8-8.1)^2=0.01. And the sum of all these squares should be as small as possible.

So, let's start training on the use of LSM and Excel add-ins "search for solution" .

Application of add-in find solution

1. If you didn’t enable the “search for a solution” add-on, then return to step How to enable the add-on "search for a solution" and enable 🙂

2. In cell A1, enter the value "1". This unit will be the first approximation to the real value of the coefficient (k) of our functional dependence y=kx.

3. In column B we have the values of the parameter X, in column C - the values of the parameter Y. In the cells of column D we enter the formula: “coefficient k multiplied by the value X”. For example, in cell D1, enter "=A1*B1", in cell D2, enter "=A1*B2", and so on.

4. We believe that the coefficient k is equal to one and the function f (x) \u003d y \u003d 1 * x is the first approximation to our solution. We can calculate the sum of squared differences between the measured values of Y and those calculated using the formula y=1*x. We can do all this manually by driving the appropriate cell references into the formula: "=(D2-C2)^2+(D3-C3)^2+(D4-C4)^2... etc. In the end we are mistaken and understand that we have lost a lot of time.In Excel, for calculating the sum of squared differences, there is a special formula, "SUMQDIFF", which will do everything for us.Let's enter it in cell A2 and set the initial data: the range of measured values Y (column C) and range of calculated Y values (column D).

4. The sum of the differences of the squares was calculated - now go to the "Data" tab and select "Search for a solution".

5. In the menu that appears, select cell A1 as the cell to be changed (the one with the coefficient k).

6. As the target, select cell A2 and set the condition "set equal to the minimum value." Remember that this is the cell where we calculate the sum of the squared differences between the calculated and measured values, and this amount should be minimal. We press "execute".

7. Coefficient k is selected. Now it can be seen that the calculated values are now very close to the measured ones.

P.S.

In general, of course, for the approximation of experimental data in Excel, there are special tools that allow you to describe the data using a linear, exponential, power and polynomial function, so you can often do without add-ons "Search for a solution". I talked about all these methods of approximation in my article, so if you are interested, take a look. But when it comes to some exotic function with one unknown coefficient or optimization problems, then here superstructure as well as possible.

Add-in "search for a solution" can be used for other tasks, the main thing is to understand the essence: there is a cell where we select a value, and there is a target cell in which a condition is set for selecting an unknown parameter.
That's all! In the next article I will tell a fairy tale about a vacation, so in order not to miss the release of the article,