Variational series and its characteristics. Variational series

Grouping- This is a division of the population into groups that are homogeneous in some way.

Service purpose... Using the online calculator, you can:

  • build a variation series, build a histogram and a polygon;
  • find indicators of variation (mean, mode (including and graphically), median, range of variation, quartiles, deciles, quartile coefficient of differentiation, coefficient of variation and other indicators);

Instruction. To group the series, you must select the type of the resulting variation series (discrete or interval) and indicate the amount of data (number of lines). The resulting solution is saved in a Word file (see an example of grouping statistics).

If the grouping has already been carried out and given discrete variation series or interval series, then you need to use the online calculator Variation indicators. Testing the hypothesis about the type of distribution is performed using the service Study of the form of distribution.

Types of statistical groupings

Variational series... In the case of observations of a discrete random variable, the same value can be encountered several times. Such values ​​x i of a random variable are written down indicating n i the number of times it appears in n observations, this is the frequency of this value.
In the case of a continuous random variable, grouping is used in practice.
  1. Typological grouping- This is the division of the studied qualitatively heterogeneous population into classes, socio-economic types, homogeneous groups of units. To construct this grouping, use the Discrete variation series parameter.
  2. A structural grouping is called, in which a homogeneous population is divided into groups that characterize its structure according to some varying feature. To build this grouping, use the Interval series parameter.
  3. A grouping that identifies the relationship between the phenomena under study and their features is called analytical group(see analytic grouping of a series).

Example # 1. According to table 2, construct distribution series for 40 commercial banks of the Russian Federation. Using the obtained distribution series, determine: the average profit per one commercial bank, loan investments on average per one commercial bank, the modal and median value of profit; quartiles, deciles, range of variation, mean linear deviation, standard deviation, coefficient of variation.

Solution:
In chapter "View of the statistical series" choose a Discrete Series. Click Insert from Excel. Number of groups: Sturgess formula

Principles of building statistical groupings

A series of observations, ordered in ascending order, is called a variation series. Grouping sign is called the attribute by which the population is broken down into separate groups. It is called the base of the group. The grouping can be based on both quantitative and qualitative characteristics.
After determining the basis of the grouping, it is necessary to decide the question of the number of groups into which the studied population should be divided.

When using personal computers to process statistical data, the grouping of object units is performed using standard procedures.
One of these procedures is based on the use of the Sturgess formula to determine the optimal number of groups:

k = 1 + 3.322 * log (N)

Where k is the number of groups, N is the number of units in the population.

The length of the partial intervals is calculated as h = (x max -x min) / k

Then count the number of hits of observations in these intervals, which are taken as frequencies n i. Small frequencies, the values ​​of which are less than 5 (n i< 5), следует объединить. в этом случае надо объединить и соответствующие интервалы.
The midpoints of the intervals x i = (c i-1 + c i) / 2 are taken as new values ​​for the variant.

Example No. 3. As a result of 5% proper random sampling, the following distribution of products by moisture content was obtained. Calculate: 1) the average percentage of moisture; 2) indicators characterizing the variation in humidity.
The solution was obtained using a calculator: Example # 1

Construct a variation series. Construct a distribution polygon, a histogram, and a cumulative based on the found series. Determine fashion and median.
Download solution

Example... Based on the results of selective observation (sample A, appendix):
a) make up a variation series;
b) calculate the relative frequencies and the accumulated relative frequencies;
c) build a polygon;
d) compose an empirical distribution function;
e) plot the empirical distribution function;
f) calculate the numerical characteristics: arithmetic mean, variance, standard deviation. Solution

Based on the data given in Table 4 (Appendix 1) and corresponding to your option, perform:

  1. On the basis of the structural grouping, construct the variational frequency and cumulative distribution series using equal closed intervals, taking the number of groups equal to 6. The results are presented in the form of a table and displayed graphically.
  2. Analyze the variation series of the distribution by calculating:
    • the arithmetic mean of the feature;
    • fashion, median, 1st quartile, 1st and 9th deciles;
    • standard deviation;
    • the coefficient of variation.
  3. Draw conclusions.

Required: to rank a series, build an interval series of distribution, calculate the average value, variability of the mean, mode and median for the ranked and interval series.

Based on the initial data, construct a discrete variation series; present it in the form of a statistical table and statistical graphs. 2). Based on the initial data, construct an interval variation series with equal intervals. Choose the number of intervals yourself and explain this choice. Present the obtained variation series in the form of a statistical table and statistical graphs. Indicate the types of tables and graphs used.

In order to determine the average duration of customer service in a pension fund, the number of clients of which is very large, a survey of 100 clients was carried out according to the scheme of a random, non-repeatable sample. The survey results are presented in the table. Find:
a) the boundaries within which, with a probability of 0.9946, lies the average service time of all clients of the pension fund;
b) the probability that the share of all clients of the fund with a service duration of less than 6 minutes differs from the share of such clients in the sample by no more than 10% (in absolute value);
c) the volume of the repeated sample, in which it can be argued with a probability of 0.9907 that the share of all clients of the fund with a service duration of less than 6 minutes differs from the share of such clients in the sample by no more than 10% (in absolute value).
2. According to the data of problem 1, using the X 2 Pearson criterion, at the significance level α = 0.05, test the hypothesis that the random variable X - customer service time - is distributed according to the normal law. Construct a histogram of the empirical distribution and the corresponding normal curve in one drawing.
Download solution

A sample of 100 elements is given. Necessary:

  1. Build a ranked variation series;
  2. Find the maximum and minimum terms of the series;
  3. Find the range of variation and the number of optimal intervals for constructing an interval series. Find the length of the interval of the interval series;
  4. Construct an interval series. Find the frequencies of the sample in the composing intervals. Find the midpoints of each interval;
  5. Construct histogram and frequency polygon. Compare with normal distribution (analytically and graphically);
  6. Plot the empirical distribution function;
  7. Calculate sample numerical characteristics: sample mean and central sample moment;
  8. Calculate the approximate values ​​of the standard deviation, skewness and kurtosis (using the MS Excel analysis package). Compare the approximate calculated values ​​with the exact ones (calculated using MS Excel formulas);
  9. Compare selected graphical characteristics with corresponding theoretical ones.
Download solution

There is the following sample data (10% sample, mechanical) on the output and the amount of profit, million rubles. According to the initial data:
Task 13.1.
13.1.1. Construct a statistical series of distribution of enterprises by the amount of profit, forming five groups at equal intervals. Plot the distribution series.
13.1.2. Calculate the numerical characteristics of a series of distribution of enterprises by the amount of profit: the arithmetic mean, standard deviation, variance, coefficient of variation V. Draw conclusions.
Task 13.2.
13.2.1. Determine the boundaries in which with a probability of 0.997 lies the sum of the profit of one enterprise in the general population.
13.2.2. Using Pearson's x2 test, at the significance level α, test the hypothesis that the random variable X - the amount of profit - is distributed according to the normal law.
Task 13.3.
13.3.1. Determine the coefficients of the sample regression equation.
13.3.2. Establish the presence and nature of the correlation between the cost of goods produced (X) and the amount of profit per enterprise (Y). Plot a scatterplot and a regression line.
13.3.3. Calculate the linear correlation coefficient. Using Student's t-test, check the significance of the correlation coefficient. Draw a conclusion about the tightness of the relationship between factors X and Y, using the Chaddock scale.
Guidelines... Task 13.3 is performed using this service.
Download solution

Task... The following figures represent the time spent by customers in entering into contracts. Construct an interval variation series of the presented data, a histogram, find an unbiased estimate of the mathematical expectation, a biased and unbiased estimate of the variance.

An example. According to table 2:
1) Plot the distribution series for 40 commercial banks in the Russian Federation:
A) by the amount of profit;
B) by the amount of credit investments.
2) According to the obtained distribution series, determine:
A) profit on average for one commercial bank;
B) credit investments on average for one commercial bank;
C) modal and median profit values; quartiles, deciles;
D) modal and median value of credit investments.
3) According to the distribution rows obtained in item 1, calculate:
a) the range of variation;
b) average linear deviation;
c) standard deviation;
d) coefficient of variation.
Fill out the necessary calculations in tabular form. Analyze the results. Draw conclusions.
Plot the obtained distribution series. Graphically define fashion and median.

Solution:
To build a grouping at equal intervals, we will use the Grouping statistical data service.

Figure 1 - Entering parameters

Description of parameters
Number of lines: the amount of raw data. If the dimension of the series is small, indicate its number. If the selection is large enough, then click the Insert from Excel button.
Number of groups: 0 - the number of groups will be determined by the Sturgess formula.
If there is a specific number of groups, specify it (for example, 5).
Row view: Discrete series.
Significance level: for example 0.954. This parameter is set to define the confidence interval for the mean.
Sample: For example, a 10% mechanical sampling was carried out. We indicate the number 10. For our data, we indicate 100.

The grouping method also allows you to measure variation(variability, variability) of signs. With a relatively small number of population units, variation is measured based on the ranked series of units that make up the population. The row is called ranked, if the units are arranged in ascending (descending) order of the attribute.

However, the ranked series are rather poorly indicative when a comparative characteristic of the variation is required. In addition, in many cases one has to deal with statistical populations consisting of a large number of units, which are practically difficult to represent in the form of a specific series. In this regard, for an initial general acquaintance with statistical data and especially to facilitate the study of variation of signs, the phenomena and processes under study are usually combined into groups, and the results of the grouping are drawn up in the form of group tables.

If there are only two columns in the group table - groups according to the selected feature (options) and the number of groups (frequency or frequency), it is called near distribution.

Distribution series - The simplest kind of structural grouping by one attribute, displayed in a group table with two columns, which contain the options and frequencies of the attribute. In many cases, with such a structural grouping, i.e. with the compilation of distribution series, the study of the initial statistical material begins.

A structural grouping in the form of a distribution series can be turned into a true structural grouping if the selected groups are characterized not only by frequencies, but also by other statistical indicators. The main purpose of the distribution series is to study the variation of features. The theory of distribution series is developed in detail by mathematical statistics.

The distribution series is divided by attributive(grouping according to attributive characteristics, for example, dividing the population by sex, nationality, marital status, etc.) and variational(grouping by quantitative characteristics).

Variational series is a group table that contains two columns: the grouping of units according to one quantitative characteristic and the number of units in each group. The intervals in the variation series are usually equal and closed. The variation series is the following grouping of the Russian population in terms of average per capita money income (Table 3.10).

Table 3.10

Distribution of the population of Russia by average per capita income in 2004-2009

Population groups by average per capita money income, rubles / month

Population in the group, in% of the total

8 000,1-10 000,0

10 000,1-15 000,0

15 000,1-25 000,0

More than 25,000.0

All population

Variational series, in turn, are subdivided into discrete and interval. Discrete Variational series combine variants of discrete features that vary within narrow limits. An example of a discrete variation series is the distribution of Russian families by the number of children they have.

Interval Variational series combine variants of either continuous features or discrete features varying over a wide range. The range of variation is the distribution of the population of Russia in terms of average per capita money income.

Discrete variational series are not used very often in practice. Meanwhile, their compilation is not difficult, since the composition of the groups is determined by the specific options that the studied grouping characteristics actually possess.

Interval variation series are more widespread. When compiling them, a difficult question arises about the number of groups, as well as the size of the intervals that must be established.

The principles for solving this issue are outlined in the chapter on the methodology for constructing statistical groupings (see paragraph 3.3).

Variational series are a means of folding or compressing diverse information into a compact form, they can be used to make a fairly clear judgment about the nature of the variation, to study the differences in the features of the phenomena included in the studied set. But the most important value of the variation series is that on their basis special generalizing characteristics of variation are calculated (see Chapter 7).

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher is faced with the serious task of correctly grouping the initial data. If the data are discrete, then problems, as we have seen, do not arise - you just need to calculate the frequency of each feature. If the investigated feature has continuous character (which is more widespread in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range of the feature is divided into a certain number of intervals To.

Grouped by interval (continuous) variation series the intervals (), ranked by the value of the feature, are called, where the numbers of observations that fall into the r "-th interval, indicated together with the corresponding frequencies (), or relative frequencies ():

Characteristic value intervals

Frequency mi

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary idea of ​​the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fill the area of ​​their possible values, taking any values.

Rice. 1.15.

So the columns on the histogram and cumulative must touch each other, do not have areas where the values ​​of the characteristic do not fall within the limits of all possible(ie, the histogram and cumulative should not have "holes" along the abscissa, which do not include the values ​​of the studied variable, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations within the given interval, or the relative frequency - the proportion of observations. Intervals should not intersect and are generally of the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f (x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance, one can judge the hypothetical distribution law.

Cumulative - the curve of the accumulated frequencies (frequencies) of the interval variation series. The cumulative is compared to the graph of the cumulative distribution function F (x), also considered in the course of probability theory.

Basically, the concepts of histograms and cumulates are associated with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task, perhaps, is the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, since in this case the histogram turns out to be too smoothed ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, are used to build a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the number axis: the histogram will turn out to be undersmooth (undersmoothed), with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How do you determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to split the original set of values ​​of the trait under study. This formula has truly become super popular - most statistical textbooks offer it, and many statistical packages use it by default. To what extent this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution)

Share this