Big data big data analysis. Simplicity is the key to success

According to research&trends

Big Data, "Big Data" has become the talk of the town in the IT and marketing press for several years now. And it is clear: digital technologies have permeated the life of a modern person, “everything is written”. The volume of data on various aspects of life is growing, and at the same time, the possibilities of storing information are growing.

Global technologies for information storage

Source: Hilbert and Lopez, `The world's technological capacity to store, communicate, and compute information,` Science, 2011 Global.

Most experts agree that accelerating data growth is an objective reality. Social networks, mobile devices, data from measuring devices, business information are just a few types of sources that can generate huge amounts of information. According to research IDCDigital universe, published in 2012, the next 8 years the amount of data in the world will reach 40 Zb (zettabytes), which is equivalent to 5200 GB per inhabitant of the planet.

Growth of collected digital information in the USA


Source: IDC

A significant part of the information is not created by people, but by robots interacting both with each other and with other data networks, such as, for example, sensors and smart devices. At this rate of growth, the amount of data in the world, according to researchers, will double every year. The number of virtual and physical servers in the world will grow tenfold due to the expansion and creation of new data centers. In this regard, there is a growing need for the effective use and monetization of this data. Since the use of Big Data in business requires considerable investment, it is necessary to clearly understand the situation. And it is, in essence, simple: you can increase business efficiency by reducing costs and/or increasing sales.

What is Big Data for?

The Big Data paradigm defines three main types of tasks.

  • Storing and managing hundreds of terabytes or petabytes of data that conventional relational databases cannot efficiently use.
  • Organization of unstructured information consisting of texts, images, videos and other types of data.
  • Big Data analysis, which raises the question of how to work with unstructured information, the generation of analytical reports, and the implementation of predictive models.

The Big Data project market intersects with the business intelligence (BA) market, the volume of which in the world, according to experts, in 2012 amounted to about 100 billion dollars. It includes components of network technology, servers, software and technical services.

Also, the use of Big Data technologies is relevant for income assurance (RA) class solutions designed to automate the activities of companies. Modern income guarantee systems include tools for detecting inconsistencies and in-depth data analysis that allow timely detection of possible losses or distortion of information that can lead to lower financial results. Against this background, Russian companies, confirming the demand for Big Data technologies in the domestic market, note that the factors that stimulate the development of Big Data in Russia are the growth of data, the acceleration of managerial decision-making and the improvement of their quality.

What prevents working with Big Data

Today, only 0.5% of the accumulated digital data is analyzed, despite the fact that objectively there are industry-wide tasks that could be solved using analytical solutions of the Big Data class. Developed IT markets already have results that can be used to evaluate the expectations associated with the accumulation and processing of big data.

One of the main factors that slows down the implementation of Big Data projects, in addition to high cost, is the problem of choosing the data to be processed: that is, the definition of what data should be extracted, stored and analyzed, and which should not be taken into account.

Many business representatives note that the difficulties in implementing Big Data projects are associated with a lack of specialists - marketers and analysts. The rate of return on investment in Big Data directly depends on the quality of work of employees involved in deep and predictive analytics. The huge potential of data that already exists in an organization often cannot be effectively used by marketers themselves due to outdated business processes or internal regulations. Therefore, Big Data projects are often perceived by businesses as difficult not only in implementation, but also in evaluating the results: the value of the collected data. The specifics of working with data requires marketers and analysts to shift their attention from technology and reporting to solving specific business problems.

Due to the large volume and high speed of data flow, the data collection process involves real-time ETL procedures. For reference:ETL - fromEnglishExtract, Transform, load- literally "extraction, transformation, loading") - one of the main processes in management data warehouses, which includes: extracting data from external sources, their transformation and cleaning to meet needs ETL should be viewed not only as a process of transferring data from one application to another, but also as a tool for preparing data for analysis.

And then the issues of ensuring the security of data coming from external sources should have solutions that correspond to the volume of information collected. Since Big Data analysis methods are developing so far only after the growth of the volume of data, the ability of analytical platforms to use new methods of preparing and aggregating data plays an important role. This suggests that, for example, data about potential buyers or a massive data warehouse with a history of clicks on online store sites can be interesting for solving various problems.

Difficulties do not stop

Despite all the difficulties with the implementation of Big Data, the business intends to increase investments in this area. According to Gartner data, in 2013, 64% of the world's largest companies have already invested or have plans to invest in deploying Big Data technologies for their business, while in 2012 there were 58% of such companies. According to a Gartner study, the leaders of industries investing in Big Data are media companies, telecoms, the banking sector and service companies. Successful results of Big Data implementation have already been achieved by many major players in the retail industry in terms of using data obtained using RFID tools, logistics and relocation systems (from the English. replenishment- accumulation, replenishment - R&T), as well as from loyalty programs. Successful retail experience stimulates other market sectors to find new effective ways to monetize big data in order to turn their analysis into a resource that works for business development. Thanks to this, according to experts, in the period up to 2020, investments in management and storage will decrease for each gigabyte of data from $2 to $0.2, but for the study and analysis of the technological properties of Big Data will grow by only 40%.

The costs presented in various investment projects in the field of Big Data are of a different nature. Cost items depend on the types of products that are selected based on certain decisions. The largest part of the costs in investment projects, according to experts, falls on products related to the collection, structuring of data, cleaning and information management.

How it's done

There are many combinations of software and hardware that allow you to create effective Big Data solutions for various business disciplines: from social media and mobile applications, to business data mining and visualization. An important advantage of Big Data is the compatibility of new tools with databases widely used in business, which is especially important when working with cross-disciplinary projects, such as organizing multi-channel sales and customer support.

The sequence of working with Big Data consists of collecting data, structuring the information received using reports and dashboards (dashboard), creating insights and contexts, and formulating recommendations for action. Since working with Big Data implies high costs for collecting data, the result of processing of which is not known in advance, the main task is to clearly understand what the data is for, and not how much of it is available. In this case, data collection turns into a process of obtaining information that is extremely necessary for solving specific problems.

For example, telecommunications providers aggregate a huge amount of data, including geolocation, which is constantly updated. This information may be of commercial interest to advertising agencies, which may use it to serve targeted and localized advertising, as well as to retailers and banks. Such data can play an important role in deciding whether to open a retail outlet in a particular location based on data on the presence of a powerful targeted flow of people. There is an example of measuring the effectiveness of advertising on outdoor billboards in London. Now the coverage of such advertising can only be measured by placing people near advertising structures with a special device that counts passers-by. Compared to this type of measurement of advertising effectiveness, the mobile operator has much more opportunities - he knows exactly the location of his subscribers, he knows their demographic characteristics, gender, age, marital status, etc.

Based on such data, in the future, the prospect opens up to change the content of the advertising message, using the preferences of a particular person passing by the billboard. If the data shows that the person passing by travels a lot, then they can be shown an ad for the resort. The organizers of a football match can only estimate the number of fans when they come to the match. But if they were able to ask the mobile operator for information on where visitors were an hour, a day or a month before the match, then this would give the organizers the opportunity to plan places for advertising the next matches.

Another example is how banks can use Big Data to prevent fraud. If the client reports the loss of the card, and when making a purchase using it, the bank sees in real time the location of the client’s phone in the purchase area where the transaction takes place, the bank can check the information on the client’s statement whether he tried to deceive him. Or the opposite situation, when a client makes a purchase in a store, the bank sees that the card on which the transaction takes place and the client’s phone are in the same place, the bank can conclude that its owner is using the card. Thanks to these advantages of Big Data, the boundaries that traditional data warehouses are endowed with are expanding.

For a successful decision to implement Big Data solutions, a company needs to calculate an investment case, and this causes great difficulties due to many unknown components. The paradox of analytics in such cases is to predict the future based on the past, information about which is often missing. In this case, an important factor is the clear planning of your initial actions:

  • Firstly, it is necessary to determine one specific business problem, for which Big Data technologies will be used, this task will become the core of determining the correctness of the chosen concept. You need to focus on collecting data related to this particular task, and during the proof of concept you will be able to use various tools, processes and management methods that will allow you to make more informed decisions in the future.
  • Secondly, it is unlikely that a company without the skills and experience of data analytics will be able to successfully implement a Big Data project. The necessary knowledge always comes from previous experience in analytics, which is the main factor affecting the quality of work with data. An important role is played by the culture of using data, since often the analysis of information reveals the harsh truth about the business, and in order to accept this truth and work with it, developed methods for working with data are needed.
  • Thirdly, the value of Big Data technologies lies in providing insights. Good analysts remain in short supply in the market. They are called specialists who have a deep understanding of the commercial meaning of the data and know how to apply them correctly. Data analysis is a means to achieve business goals, and in order to understand the value of Big Data, you need an appropriate behavior model and an understanding of your actions. In this case, big data will provide a lot of useful information about consumers, based on which you can make useful business decisions.

Despite the fact that the Russian Big Data market is just beginning to take shape, some projects in this area are already being implemented quite successfully. Some of them are successful in the field of data collection, such as projects for the Federal Tax Service and Tinkoff Credit Systems, others in terms of data analysis and practical application of its results: this is the Synqera project.

Tinkoff Credit Systems Bank implemented a project to implement the EMC2 Greenplum platform, which is a tool for massively parallel computing. In recent years, the bank has increased its requirements for the speed of processing accumulated information and real-time data analysis, caused by the high growth rate in the number of credit card users. The Bank announced plans to expand the use of Big Data technologies, in particular for processing unstructured data and working with corporate information obtained from various sources.

The Federal Tax Service of Russia is currently creating an analytical layer of the federal data warehouse. On its basis, a single information space and technology for accessing tax data for statistical and analytical processing is being created. During the implementation of the project, work is being carried out to centralize analytical information with more than 1200 sources of the local level of the Federal Tax Service.

Another interesting example of real-time big data analysis is the Russian startup Synqera, which developed the Simplate platform. The solution is based on the processing of large data arrays, the program analyzes information about customers, their purchase history, age, gender and even mood. At the cash registers in the network of cosmetic stores, touch screens with sensors that recognize the emotions of customers were installed. The program determines the mood of a person, analyzes information about him, determines the time of day and scans the discount database of the store, after which it sends targeted messages to the buyer about promotions and special offers. This solution improves customer loyalty and increases retailer sales.

If we talk about foreign successful cases, then in this regard, the experience of using Big Data technologies at Dunkin` Donuts, which uses real-time data to sell products, is interesting. Digital displays in stores display offers that change every minute, depending on the time of day and product availability. According to cash receipts, the company receives data on which offers received the greatest response from buyers. This data processing approach allowed to increase profits and turnover of goods in the warehouse.

As the experience of implementing Big Data projects shows, this area is designed to successfully solve modern business problems. At the same time, an important factor in achieving commercial goals when working with big data is the choice of the right strategy, which includes analytics that identifies consumer needs, as well as the use of innovative technologies in the field of Big Data.

According to a global survey conducted annually by Econsultancy and Adobe since 2012 among marketers of companies, “big data”, which characterizes the actions of people on the Internet, can do a lot. They are able to optimize offline business processes, help understand how mobile device owners use them to search for information, or simply “make marketing better”, i.e. more efficient. Moreover, the last function is becoming more popular from year to year, as follows from our diagram.

The main areas of work of Internet marketers in terms of customer relations


A source: Econsultancy and Adobe, publishedemarketer.com

Note that the nationality of the respondents does not matter much. According to a survey conducted by KPMG in 2013, the proportion of "optimists", i.e. of those who use Big Data when developing a business strategy is 56%, and the fluctuations from region to region are small: from 63% in North American countries to 50% in EMEA.

Use of Big Data in various regions of the world


A source: KPMG, publishedemarketer.com

Meanwhile, the attitude of marketers to such “fashion trends” is somewhat reminiscent of a well-known anecdote:

Tell me, Vano, do you like tomatoes?
- I like to eat, but I don't.

Despite the fact that marketers say they “love” Big Data and even seem to use it, in fact, “everything is complicated,” as they write about their heartfelt attachments in social networks.

According to a survey conducted by Circle Research in January 2014 among European marketers, 4 out of 5 respondents do not use Big Data (despite the fact that they, of course, “love” it). The reasons are different. There are few inveterate skeptics - 17% and exactly the same number as their antipodes, i.e. those who confidently answer "Yes". The rest are hesitating and doubting, the “swamp”. They evade a direct answer under plausible excuses like "not yet, but soon" or "we'll wait for the others to start."

Use of Big Data by marketers, Europe, January 2014


A source:dnx, published -emarketer.com

What confuses them? Sheer nonsense. Some (exactly half of them) simply do not believe this data. Others (there are also quite a few of them - 55%) find it difficult to correlate the sets of "data" and "users" among themselves. Someone just (let's put it politically correct) has an intracorporate mess: data walks ownerlessly between marketing departments and IT structures. For others, the software cannot cope with the influx of work. Etc. Since the total shares are well above 100%, it is clear that the situation of "multiple barriers" is not uncommon.

Barriers preventing the use of Big Data in marketing


A source:dnx, published -emarketer.com

Thus, we have to state that so far "Big Data" is a great potential that still needs to be used. By the way, this may be the reason why Big Data is losing its “fashion trend” halo, as evidenced by the survey data conducted by the Econsultancy company we have already mentioned.

The most significant trends in digital marketing 2013-2014


A source: Consultancy and Adobe

They are being replaced by another king - content marketing. How long?

It cannot be said that Big Data is some fundamentally new phenomenon. Big data sources have been around for years: databases of customer purchases, credit histories, lifestyles. And for years, scientists have used this data to help companies assess risk and predict future customer needs. However, today the situation has changed in two aspects:

More sophisticated tools and methods have emerged to analyze and combine different datasets;

These analytical tools are complemented by an avalanche of new data sources driven by the digitization of virtually every data collection and measurement method.

The range of information available is both inspiring and intimidating for researchers who grew up in a structured research environment. Consumer sentiment is captured by websites and all sorts of social media. The fact of viewing ads is recorded not only by set-top boxes, but also with the help of digital tags and mobile devices that communicate with the TV.

Behavioral data (such as number of calls, shopping habits and purchases) is now available in real time. Thus, much of what could previously be learned through research can now be learned through big data sources. And all these information assets are constantly being generated, regardless of any research processes. These changes make us wonder if big data can replace classical market research.

It's not about the data, it's about questions and answers

Before ordering a death knell for classical research, we must remind ourselves that it is not the presence of one data asset or another, but something else that is decisive. What exactly? Our ability to answer questions, that's what. A funny thing about the new world of big data is that results from new data assets lead to even more questions, and those questions tend to be best answered by traditional research. Thus, as big data grows, we see a parallel increase in the availability and demand for “small data” that can provide answers to questions from the world of big data.

Let's consider a situation: a large advertiser constantly monitors traffic in stores and sales volumes in real time. Existing research methodologies (in which we ask participants in research panels about their buying motivations and behavior at the point of sale) help us better target specific customer segments. These methodologies can be expanded to include a wider range of big data assets, to the point where big data becomes a passive observation tool and research a method of ongoing, narrowly focused investigation of changes or events that need to be studied. This is how big data can free research from unnecessary routine. Primary research should no longer focus on what's going on (big data will). Instead, primary research can focus on explaining why we see certain trends or deviations from trends. The researcher will be able to think less about getting data and more about how to analyze and use it.

At the same time, we see that big data is solving one of our biggest problems, the problem of overly long studies. Examining the studies themselves has shown that overly bloated research tools have a negative impact on data quality. Although many experts acknowledged this problem for a long time, they invariably responded with the phrase: “But I need this information for senior management,” and long interviews continued.

In the world of big data, where quantitative indicators can be obtained through passive observation, this issue becomes moot. Again, let's think back to all of this consumption research. If big data gives us insights about consumption through passive observation, then primary research in the form of surveys no longer needs to collect this kind of information, and we can finally back up our vision of short surveys not only with good wishes, but also with something real.

Big Data needs your help

Finally, "big" is just one of the characteristics of big data. The characteristic "large" refers to the size and scale of the data. Of course, this is the main characteristic, since the volume of this data is beyond the scope of everything that we have worked with before. But other characteristics of these new data streams are also important: they are often poorly formatted, unstructured (or, at best, partially structured), and full of uncertainty. The emerging field of data management, aptly named "entity analytics", aims to solve the problem of overcoming noise in big data. Its task is to analyze these datasets and find out how many observations are for the same person, which observations are current, and which of them are usable.

This kind of data cleansing is necessary to remove noise or erroneous data when working with big or small data assets, but it is not enough. We also need to create context around big data assets based on our previous experience, analytics and category knowledge. In fact, many analysts point to the ability to manage the uncertainty inherent in big data as a source of competitive advantage, as it enables better decision making.

And this is where primary research is not only freed from routine thanks to big data, but also contributes to content creation and analysis within big data.

A prime example of this is the application of our brand new brand equity framework to social media. (we are talking about the one developed inMillward Browna new approach to measuring brand valueThe Meaningfully Different Framework- "The paradigm of significant differences" -R & T ). This model is behavior-tested within specific markets, implemented on a standard basis, and can easily be applied to other marketing disciplines and decision support information systems. In other words, our model of brand equity, based on (though not exclusively) survey research, has all the properties needed to overcome the unstructured, incoherent, and uncertain nature of big data.

Consider consumer sentiment data provided by social media. In its raw form, peaks and valleys in consumer sentiment are very often minimally correlated with offline measures of brand equity and behavior: there is simply too much noise in the data. But we can reduce this noise by applying our models of consumer meaning, brand differentiation, dynamics, and identity to raw consumer sentiment data, which is a way of processing and aggregating social media data along these dimensions.

Once the data is organized according to our framework model, the identified trends usually match the brand equity and behavior measurements obtained offline. In fact, social media data cannot speak for itself. To use them for this purpose requires our experience and models built around brands. When social media gives us unique information expressed in the language that consumers use to describe brands, we must use that language when creating our research to make primary research much more effective.

Benefits of Exempt Studies

This brings us back to the fact that big data is not so much replacing research as it is freeing it up. Researchers will be relieved of having to create a new study for each new case. The ever-growing assets of big data can be used for different research topics, allowing subsequent primary research to delve deeper into the topic and fill in the gaps. Researchers will be freed from having to rely on overly inflated surveys. Instead, they will be able to use short surveys and focus on the most important parameters, which improves the quality of the data.

With this release, researchers will be able to use their established principles and insights to add precision and meaning to big data assets, leading to new areas for survey research. This cycle should lead to a deeper understanding on a range of strategic issues and ultimately a move towards what should always be our main goal - to inform and improve the quality of brand and communications decisions.

The term "Big Data" may be recognizable today, but there is still quite a bit of confusion around it as to what it actually means. In truth, the concept is constantly evolving and being revised as it remains the driving force behind many ongoing waves of digital transformation, including artificial intelligence, data science, and the Internet of Things. But what is Big-Data technology and how is it changing our world? Let's try to understand the essence of Big Data technology and what it means in simple words.

It all started with an “explosion” in the amount of data we have created since the dawn of the digital age. This is largely due to the development of computers, the Internet and technologies that can "snatch" data from the world around us. Data by itself is not a new invention. Even before the era of computers and databases, we used paper transaction records, client records, and archive files, which are data. Computers, especially spreadsheets and databases, have made it easy for us to store and organize data on a large scale. All of a sudden, information is available at the click of a mouse.

However, we have come a long way from the original tables and databases. Today, every two days we create as much data as we received from the very beginning until the year 2000. That's right, every two days. And the amount of data we create continues to skyrocket; by 2020, the amount of available digital information will increase from about 5 zettabytes to 20 zettabytes.

Nowadays, almost every action we take leaves its mark. We generate data whenever we access the Internet, when we carry our smartphones equipped with a search engine, when we talk with our acquaintances through social networks or chats, etc. In addition, the amount of machine-generated data is also growing rapidly. Data is generated and shared when our smart home devices communicate with each other or with their home servers. Industrial equipment in plants and factories is increasingly equipped with sensors that accumulate and transmit data.

The term "Big Data" refers to the collection of all this data and our ability to use it to our advantage in a wide range of areas, including business.

How does Big Data technology work?

Big Data works on the principle: the more you know about a particular subject or phenomenon, the more reliably you can achieve a new understanding and predict what will happen in the future. By comparing more data points, relationships that were previously hidden emerge, and these relationships allow us to learn and make better decisions. This is most often done through a process that involves building models from the data we can collect and then running a simulation that tweaks the values ​​of the data points each time and sees how they affect our results. This process is automated - modern analytics technologies will run millions of these simulations, tweaking every possible variable until they find a model - or idea - that helps solve the problem they are working on.

Bill Gates hangs over the paper contents of one CD

Until recently, data was limited to spreadsheets or databases - and everything was very organized and tidy. Anything that could not be easily organized into rows and columns was considered too complex to work with and was ignored. However, progress in storage and analytics means that we can capture, store and process a large amount of data of various types. As a result, "data" today can mean anything from databases to photographs, videos, sound recordings, written texts, and sensor data.

To understand all this messy data, projects based on Big Data often use cutting-edge analytics, using artificial intelligence and machine learning. By teaching computers to identify what specific data is—for example, through pattern recognition or natural language processing—we can teach them to identify patterns much faster and more reliably than we can.

How is Big Data used?

This ever-increasing flow of information about sensor data, text, voice, photo and video data means that we can now use data in ways that were unimaginable just a few years ago. This brings revolutionary changes to the business world in almost every industry. Companies today can predict, with incredible accuracy, which specific categories of customers will want to make an acquisition, and when. Big Data also helps companies perform their activities much more efficiently.

Even outside of business, Big Data projects are already helping to change our world in a variety of ways:

  • Improving healthcare - Data-driven medicine is able to analyze vast amounts of medical information and images for models that can help detect disease at an early stage and develop new drugs.
  • Predicting and responding to natural and man-made disasters. Sensor data can be analyzed to predict where earthquakes might occur, and human behavior patterns provide clues that help organizations provide assistance to survivors. Big Data technology is also being used to track and protect the flow of refugees from war zones around the world.
  • Preventing crime. Police forces are increasingly using data-driven strategies that incorporate their own intelligence and public domain information to make better use of resources and take countermeasures where needed.

The best books about Big-Data technology

  • Everybody lies. Search engines, Big Data and the Internet know everything about you.
  • BIG DATA. All technology in one book.
  • happiness industry. How Big Data and new technologies help to add emotion to goods and services.
  • A revolution in analytics. How to improve your business with operational analytics in the era of Big Data.

Problems with Big Data

Big Data gives us unprecedented insights and opportunities, but it also raises issues and questions that need to be addressed:

  • Data Privacy – The Big Data we generate today contains a lot of information about our personal lives that we have every right to keep private. More and more often, we are asked to strike a balance between the amount of personal data we disclose and the convenience that applications and services based on the use of Big Data offer.
  • Data Protection - Even if we think we're fine with someone having our data for a specific purpose, can we trust them to keep our data safe and secure?
  • Discrimination of data - when all the information is known, will it be acceptable to discriminate against people based on data from their personal lives? We already use credit scores to decide who can borrow money, and insurance is heavily data-driven too. We should expect to be analyzed and assessed in more detail, but care should be taken that this does not complicate the lives of those who have fewer resources and limited access to information.

Accomplishing these tasks is an important part of Big Data, and they need to be addressed by organizations that want to use such data. Failure to do so can leave a business vulnerable, not only in terms of its reputation, but also legally and financially.

Looking to the future

Data is changing our world and our lives at an unprecedented pace. If Big Data is capable of all this today, just imagine what it will be capable of tomorrow. The amount of data available to us will only increase, and analytics technology will become even more advanced.

For businesses, the ability to apply Big Data will become increasingly critical in the coming years. Only those companies that view data as a strategic asset will survive and thrive. Those who ignore this revolution risk being left behind.



At one time, I heard the term “Big Data” from German Gref (head of Sberbank). Like, they are now actively working on implementation, because this will help them reduce the time they work with each client.

The second time I came across this concept was in the client's online store, on which we worked and increased the assortment from a couple of thousand to a couple of tens of thousands of commodity items.

The third time I saw that Yandex needed a big data analyst. Then I decided to delve deeper into this topic and at the same time write an article that will tell you what kind of term it is that excites the minds of TOP managers and the Internet space.

VVV or VVVVV

I usually start any of my articles with an explanation of what kind of term it is. This article will be no exception.

However, this is not primarily due to the desire to show how smart I am, but because the topic is really complex and requires careful explanation.

For example, you can read what big data is on Wikipedia, do not understand anything, and then return to this article in order to understand the definition and applicability for business. So, let's start with a description, and then to business examples.

Big data is big data. Amazing, right? Actually, from English it is translated as “big data”. But this definition, one might say, is for dummies.

Important. Big data technology is an approach/method of processing more data to obtain new information that is difficult to process in conventional ways.

Data can be both processed (structured) and fragmented (i.e. unstructured).

The term itself appeared relatively recently. In 2008, a scientific journal predicted this approach as something necessary to deal with a large amount of information that is growing exponentially.

For example, every year the information on the Internet that needs to be stored, and, of course, processed, increases by 40%. Again. +40% every year new information appears on the Internet.

If the printed documents are understandable and the ways of processing them are also understandable (transfer to electronic form, stitch into one folder, numbered), then what to do with information that is presented in completely different “carriers” and other volumes:

  • Internet documents;
  • blogs and social networks;
  • audio/video sources;
  • measuring devices;

There are characteristics that make it possible to classify information and data as big data.

That is, not all data may be suitable for analytics. These characteristics contain the key concept of big data. They all fit in three V.

  1. Volume (from English volume). Data is measured in terms of the physical volume of the “document” to be analyzed;
  2. Speed ​​(from English velocity). The data does not stand in its development, but constantly grows, which is why they need to be processed quickly to obtain results;
  3. Variety (from English variety). The data may not be uniform. That is, they can be fragmented, structured or partially structured.

However, from time to time, a fourth V (veracity - reliability / credibility of the data) and even a fifth V are added to VVV (in some cases it is viability - viability, in others it is value).

Somewhere I even saw 7V, which characterize data related to big data. But in my opinion, this is from a series (where Ps are periodically added, although the initial 4 is enough for understanding).

WE ARE ALREADY MORE THAN 29,000 people.
TURN ON

Who needs it?

A logical question arises, how can information be used (if anything, big data is hundreds and thousands of terabytes)? Not even like that.

Here is the information. So why did they come up with big data then? What is the use of big data in marketing and business?

  1. Conventional databases cannot store and process (I'm not even talking about analytics now, but simply storing and processing) a huge amount of information.

    Big data solves this main problem. Successfully stores and manages information with a large volume;

  2. Structures information coming from various sources (video, images, audio and text documents) into one single, understandable and digestible form;
  3. Formation of analytics and creation of accurate forecasts based on structured and processed information.

It's complicated. Simply put, any marketer who understands that if you study a large amount of information (about you, your company, your competitors, your industry), you can get very decent results:

  • Full understanding of your company and your business from the side of numbers;
  • Study your competitors. And this, in turn, will make it possible to get ahead by dominating them;
  • Learn new information about your customers.

And precisely because big data technology gives the following results, everyone rushes with it.

They are trying to screw this business into their company in order to get an increase in sales and a decrease in costs. And to be specific, then:

  1. Increasing cross-sells and up-sells through better knowledge of customer preferences;
  2. Search for popular products and reasons why they are bought (and vice versa);
  3. Product or service improvement;
  4. Improvement in the level of service;
  5. Increasing loyalty and customer focus;
  6. Fraud prevention (more relevant for the banking sector);
  7. Reducing excess costs.

The most common example given in all sources is, of course, Apple, which collects data about its users (phone, watch, computer).

It is because of the presence of the eco-system that the corporation knows so much about its users and in the future uses this for profit.

You can read these and other examples of use in any other article except this one.

Let's go to the future

I will tell you about another project. Or rather, about a person who builds the future using big data solutions.

This is Elon Musk and his company Tesla. His main dream is to make cars autonomous, that is, you get behind the wheel, turn on the autopilot from Moscow to Vladivostok and ... fall asleep, because you don’t need to drive a car at all, because he will do everything himself.

It would seem fantastic? But no! It's just that Elon acted much wiser than Google, who control cars using dozens of satellites. And went the other way:

  1. Each car sold is equipped with a computer that collects all the information.

    All means everything. About the driver, his driving style, the roads around, the movement of other cars. The volume of such data reaches 20-30 GB per hour;

  2. Further, this information is transmitted via satellite to the central computer, which processes this data;
  3. Based on the big data that this computer processes, a model of an unmanned vehicle is built.

By the way, if Google is doing pretty badly and their cars get into accidents all the time, then Musk, due to the fact that he is working with big data, is doing much better, because test models show very good results.

But... It's all about the economy. What are we all about profit, yes about profit? Much that big data can solve is completely unrelated to earnings and money.

Google statistics, just based on big data, shows an interesting thing.

Before doctors announce the beginning of an epidemic of a disease in a region, the number of search queries for the treatment of this disease increases significantly in this region.

Thus, the correct study of the data and their analysis can form forecasts and predict the onset of the epidemic (and, accordingly, its prevention) much faster than the opinion of the authorities and their actions.

Application in Russia

However, Russia, as always, slows down a bit. So the very definition of big data in Russia appeared no more than 5 years ago (I'm talking about ordinary companies now).

And this is despite the fact that this is one of the fastest growing markets in the world (drugs and weapons are nervously smoking on the sidelines), because every year the market for software for collecting and analyzing big data grows by 32%.

To characterize the big data market in Russia, I am reminded of an old joke. Big date is like sex before 18.

Everyone is talking about it, there is a lot of hype around it and little real action, and everyone is ashamed to admit that they themselves are not doing this. Indeed, there is a lot of hype around this, but little real action.

Although the well-known research company Gartner announced already in 2015 that big data is no longer an increasing trend (like artificial intelligence, by the way), but completely independent tools for analyzing and developing advanced technologies.

The most active niches where big data is used in Russia are banks / insurance (not without reason I started the article with the head of Sberbank), telecommunications, retail, real estate and ... the public sector.

For example, I will tell you in more detail about a couple of sectors of the economy that use big data algorithms.

Banks

Let's start with banks and the information they collect about us and our actions. For example, I took the TOP-5 Russian banks that are actively investing in big data:

  1. Sberbank;
  2. Gazprombank;
  3. VTB 24;
  4. Alfa Bank;
  5. Tinkoff bank.

It is especially pleasant to see Alfa Bank among the Russian leaders. At the very least, it's nice to know that the bank, whose official partner you are, understands the need to introduce new marketing tools into your company.

But I want to show examples of the use and successful implementation of big data on the bank, which I like for the non-standard look and actions of its founder.

I'm talking about Tinkoff Bank. Their main task was to develop a system for analyzing big data in real time due to an overgrown customer base.

Results: the time of internal processes was reduced by at least 10 times, and for some - more than 100 times.

Well, a little distraction. Do you know why I started talking about the non-standard antics and actions of Oleg Tinkov?

It’s just that, in my opinion, it was they who helped him turn from a middle-class businessman, of which there are thousands in Russia, into one of the most famous and recognizable entrepreneurs. To prove it, watch this unusual and interesting video:

The property

In real estate, things are much more complicated. And this is exactly the example that I want to give you to understand the big date within the normal business. Initial data:

  1. Large volume of text documentation;
  2. Open sources (private satellites transmitting earth change data);
  3. The vast amount of uncontrolled information on the Internet;
  4. Constant changes in sources and data.

And on the basis of this, it is necessary to prepare and evaluate the cost of a land plot, for example, under the Ural village. It will take a week for a professional.

The Russian Society of Appraisers & ROSEKO, which actually implemented big data analysis with the help of software, will take no more than 30 minutes of leisurely work. Compare, a week and 30 minutes. Colossal difference.

Well, for a snack

Of course, huge amounts of information cannot be stored and processed on simple hard drives.

And the software that structures and analyzes data is generally intellectual property and each time it is an author's development. However, there are tools on the basis of which all this charm is created:

  • Hadoop & MapReduce;
  • NoSQL databases;
  • Tools of the Data Discovery class.

To be honest, I can’t clearly explain to you how they differ from each other, since acquaintance and work with these things are taught in physical and mathematical institutes.

Why then did I start talking about it if I can't explain it? Remember in all the movies the robbers go into any bank and see a huge number of all sorts of pieces of iron connected to the wires?

The same is true for big data. For example, here is a model that is currently one of the most leaders on the market.

Big date tool

The cost in the maximum configuration reaches 27 million rubles per rack. This is, of course, the deluxe version. I mean that you try on the creation of big data in your business in advance.

Briefly about the main

You may ask why do you, small and medium-sized businesses, work with big data?

To this I will answer you with a quote from one person: “In the near future, customers will be in demand for companies that better understand their behavior, habits and correspond to them as much as possible.”

But let's face it. To implement big data in a small business, it is necessary to have not only large budgets for the development and implementation of software, but also for the maintenance of specialists, at least such as a big data analyst and a system administrator.

And now I am silent about the fact that you should have such data for processing.

OK. For small businesses, the topic is almost not applicable. But this does not mean that you need to forget everything that you have read above.

Just study not your own data, but the results of data analytics from well-known both foreign and Russian companies.

For example, the Target retail chain, using big data analytics, found out that pregnant women before the second trimester of pregnancy (from the 1st to the 12th week of pregnancy) are actively buying non-flavored products.

With this data, they send them discount coupons for unscented products with a limited expiration date.

And if you are just a very small cafe, for example? Yes, very simple. Use a loyalty app.

And after some time and thanks to the accumulated information, you will be able not only to offer customers dishes relevant to their needs, but also to see the most unsold and most marginal dishes with just a couple of mouse clicks.

Hence the conclusion. It is hardly worth implementing big data for small businesses, but using the results and developments of other companies is a must.

Each industrial revolution has had its symbols: iron and steam, steel and mass production, polymers and electronics, and the next revolution will be marked by composite materials and data. Big Data - a false trail or the future of the industry?

12/20/2011 Leonid Chernyak

The symbols of the first industrial revolution were cast iron and steam, the second - steel and in-line production, the third - polymer materials, aluminum and electronics, and the next revolution will be under the sign of composite materials and data. Is Big Data a false trail or the future of the industry?

For more than three years, a lot has been said and written about big data(Big Data) in combination with the word "problem", reinforcing the mystique of this topic. During this time, the “problem” has become the focus of attention of the vast majority of large manufacturers, counting on finding a solution to it, many startups are being created, and all the leading industry analysts are trumpeting how important the ability to work with large amounts of data is now to ensure competitiveness. Such, not too well-reasoned, mass character provokes dissent, and you can find a lot of skeptical statements on the same topic, and sometimes even the epithet red herring is applied to Big Data (literally, “smoked herring” is a false trail, a distracting maneuver).

So what is Big Data? The easiest way is to present Big Data as an avalanche of data that has spontaneously collapsed from nowhere, or to reduce the problem to new technologies that radically change the information environment, or maybe, together with Big Data, we are experiencing another stage in the technological revolution? Most likely, both that, and another, and the third, and still unknown. It is significant that out of more than four million pages on the Web containing the phrase Big Data, one million also contains the word definition - at least a quarter of those who write about Big Data are trying to give their own definition. Such mass interest testifies in favor of the fact that, most likely, there is something qualitatively different in Big Data than what ordinary consciousness is pushing towards.

background

The fact that the vast majority of references to Big Data is somehow related to business can be misleading. In fact, the term was by no means born in a corporate environment, but was borrowed by analysts from scientific publications. Big Data is one of the few titles that has a quite reliable date of birth - September 3, 2008, when a special issue of the oldest British scientific journal Nature was published, dedicated to finding an answer to the question "How can technologies that open up the possibility of working with large volumes influence the future of science?" data?" The special issue summarizes previous discussions about the role of data in science in general and in e-science in particular.

The role of data in science has been the subject of discussion for a very long time - the English astronomer Thomas Simpson was the first to write about data processing back in the 18th century in his work “On the Advantages of Using Numbers in Astronomical Observations”, but only at the end of the last century did interest in this topic become noticeable, and data processing came to the fore at the end of the last century, when it was discovered that computer methods can be applied in almost all sciences from archeology to nuclear physics. As a consequence, the scientific methods themselves are noticeably changing. It is no coincidence that the neologism libratory appeared, formed from the words library (library) and laboratory (laboratory), which reflects changes regarding the idea of ​​what can be considered the result of research. Until now, only the final results obtained, and not raw experimental data, have been submitted to the judgment of colleagues, but now, when a variety of data can be converted into a “digit”, when various digital media are available, the object of publication can be various kinds of measured data, and of particular importance is the possibility of re-processing previously accumulated data in the libratory. And then there is a positive feedback, due to which the process of accumulating scientific data is constantly accelerating. That is why, realizing the scale of the upcoming changes, the editor of the Nature issue Clifford Lynch proposed a special name for the new paradigm Big Data, chosen by him by analogy with such metaphors as Big Reft, Big Ore, etc., reflecting not so much the amount of something, how much the transition of quantity into quality.

Big Data and business

Less than a year later, the term Big Data appeared on the pages of leading business publications, in which, however, completely different metaphors were used. Big Data is compared with mineral resources - the new oil (new oil), goldrush (gold rush), data mining (data development), which emphasizes the role of data as a source of hidden information; with natural disasters - data tornado (data hurricane), data deluge (data flood), data tidal wave (data flood), seeing them as a threat; capturing the connection with industrial production - data exhaust (data release), firehose (data hose), Industrial Revolution (industrial revolution). In business as well as in science, big data is not something completely new either - the need to work with big data has long been talked about, for example in connection with the spread of radio frequency identification (RFID) and social networks, and just like and in science, it lacked only a vivid metaphor to define what was happening. That is why in 2010 the first products appeared, claiming to fall into the category of Big Data - a suitable name was found for already existing things. It is significant that in the 2011 Hype Cycle version, which characterizes the state and prospects of new technologies, Gartner analysts introduced one more position Big Data and Extreme Information Processing and Management with an estimate of the period of mass implementation of the corresponding solutions from two to five years.

Why is Big Data a problem?

Three years have passed since the appearance of the term Big Data, but if everything is more or less clear in science, then the place of Big Data in business remains uncertain, it is no coincidence that they often talk about the “problem of Big Data”, and not just about a problem, but about everything else is also poorly defined. Often the problem is simplified, interpreted like Moore's law, with the only difference that in this case we are dealing with the phenomenon of doubling the amount of data per year, or exaggerated, presenting almost as a natural disaster that urgently needs to be dealt with in some way. There is indeed more and more data, but all this overlooks the fact that the problem is by no means external, it is caused not so much by the incredible amount of data that has collapsed, but by the inability of the old methods to cope with new volumes, and, most importantly, by us created by themselves. There is a strange imbalance - the ability to generate data turned out to be stronger than the ability to process them. The reason for this bias is most likely that in the 65 years of the history of computers, we have not understood what data is and how it is related to the results of processing. Strangely, for centuries mathematicians have been dealing with the basic concepts of their science, such as number and number systems, involving philosophers in this, and in our case, data and information, by no means trivial things, are left unattended and left to the mercy of intuitive perception. So it turned out that all these 65 years, data processing technologies themselves have been developing at an incredible pace, and cybernetics and information theory have hardly developed, remaining at the level of the 50s, when tube computers were used exclusively for calculations. Indeed, the fuss around Big Data that is currently being observed, with careful attention, causes a skeptical smile.

Scaling and tiering

Clouds, big data, analytics - these three factors of modern IT are not only interconnected, but today they cannot exist without each other. Working with Big Data is impossible without cloud storage and cloud computing - the emergence of cloud technologies, not only in the form of an idea, but already in the form of completed and implemented projects, has become the trigger for launching a new spiral of increasing interest in Big Data analytics. If we talk about the impact on the industry as a whole, today the increased requirements for scaling storage systems have become apparent. This is indeed a necessary condition, because it is difficult to predict in advance which analytical processes will require certain data and how intensively the existing storage will be loaded. In addition, the requirements for both vertical and horizontal scaling become equally important.

In the new generation of its storage systems, Fujitsu has paid great attention to the aspects of scaling and storage tiering. Practice shows that today, to perform analytical tasks, systems must be heavily loaded, but business requires that all services, applications, and the data itself always remain available. In addition, the requirements for the results of analytical research are very high today - competently, correctly and in a timely manner conducted analytical processes can significantly improve the results of the business as a whole.

Alexander Yakovlev ([email protected]), Product Marketing Manager at Fujitsu (Moscow).

By ignoring the role of data and information as subjects of research, the same mine was laid that exploded now, at a time when needs have changed, when the computing load on computers turned out to be much less than other types of work performed on data, and the purpose of these actions is in obtaining new information and new knowledge from existing data sets. That is why it is pointless to talk about solving the problem of Big Data outside the restoration of the links of the “data - information - knowledge” chain. Data is processed to obtain information, which should be just enough for a person to turn it into knowledge.

Over the past decades, there has been no serious work on the relationship of raw data with useful information, and what we habitually call the information theory of Claude Shannon is nothing more than a statistical theory of signaling, and has nothing to do with information perceived by a person. There are many individual publications reflecting private points of view, but there is no full-fledged modern information theory. As a result, the vast majority of specialists do not distinguish between data and information at all. Everyone around them only states that there is a lot or a lot of data, but no one has a mature idea of ​​what exactly is a lot, in what ways the problem should be solved - and all because the technical capabilities of working with data have clearly outstripped the level of development of the ability to use them . Only one author, Dion Hinchcliffe, editor of the Web 2.0 Journal, has a Big Data classification that aligns technology with the outcome expected from Big Data processing, but even that is far from satisfactory.

Hinchcliff divides approaches to Big Data into three groups: Fast Data (Fast Data), their volume is measured in terabytes; Big Analytics - petabyte data and Deep Insight - exabytes, zettabytes. The groups differ from each other not only in the amount of data they operate, but also in the quality of the decision to process them.

Processing for Fast Data does not imply the acquisition of new knowledge, its results are correlated with a priori knowledge and make it possible to judge how certain processes proceed, it allows you to see what is happening better and in more detail, confirm or reject some hypotheses. Only a small part of the currently existing technologies is suitable for solving Fast Data tasks, this list includes some storage technologies (Greenplum, Netezza, Oracle Exadata, Teradata, DBMS like Verica and kdb). The speed of these technologies should increase in sync with the growth of data volumes.

The tasks solved by Big Analytics tools are noticeably different, not only quantitatively, but also qualitatively, and the corresponding technologies should help in obtaining new knowledge - they serve to transform the information recorded in the data into new knowledge. However, this middle level does not assume the presence of artificial intelligence when choosing decisions or any autonomous actions of the analytical system - it is built on the principle of "learning with a teacher". In other words, all of its analytical potential is laid in it in the learning process. The most obvious example is a car playing Jeopardy!. Classical representatives of such analytics are MATLAB, SAS, Revolution R, Apache Hive, SciPy Apache and Mahout products.

The highest level, Deep Insight, involves unsupervised learning and the use of modern analytics methods, as well as various visualization methods. At this level, it is possible to discover knowledge and patterns that are a priori unknown.

Big Data Analytics

Over time, computer applications are getting closer to the real world in all its diversity, hence the growth in the volume of input data and hence the need for their analytics, and in a mode as close to real time as possible. The convergence of these two trends has led to the emergence of the direction big data analytics(Big Data Analytics).

The victory of the Watson computer was a brilliant demonstration of the capabilities of Big Data Analytics - we are entering an interesting era when the computer is first used not so much as a tool to speed up calculations, but as an assistant that expands human capabilities in choosing information and making decisions. The seemingly utopian plans of Vannevar Bush, Joseph Licklider and Doug Engelbart are beginning to come true, but this is not happening quite the way it was seen decades ago - the power of a computer is not in superiority over a person in terms of logical capabilities, which scientists especially hoped for, but in a significantly greater ability handle huge amounts of data. Something similar was in Garry Kasparov's fight with Deep Blue, the computer was not a more skillful player, but it could sort through more options faster.

The gigantic volumes combined with high speed that distinguish Big Data Analytics from other applications require appropriate computers, and today almost all major manufacturers offer specialized hardware and software systems: SAP HANA, Oracle Big Data Appliance, Oracle Exadata Database Machine and Oracle Exalytics Business Intelligence Machine, Teradata Extreme Performance Appliance, NetApp E-Series Storage Technology, IBM Netezza Data Appliance, EMC Greenplum, Vertica Analytics Platform powered by HP Converged Infrastructure. In addition, many small and start-up companies have entered the game: Cloudera, DataStax, Northscale, Splunk, Palantir, Factual, Kognitio, Datameer, TellApart, Paraccel, Hortonworks.

Feedback

Qualitatively new Big Data Analytics applications require for themselves not only new technologies, but also a qualitatively different level of systems thinking, but there are difficulties with this - developers of Big Data Analytics solutions often rediscover the truths known since the 50s. As a result, analytics is often considered in isolation from the means of preparing initial data, visualization and other technologies for providing results to a person. Even such a respected organization as The Data Warehousing Institute treats analytics in isolation from everything else: according to it, 38% of enterprises are already exploring the possibility of using Advanced Analytics in their management practice, and another 50% intend to do so within the next three years. This interest is justified by bringing many arguments from the business, although it can be said more simply - enterprises in the new conditions need a more advanced management system, and it is necessary to start creating it with the establishment of feedback, that is, from a system that helps in making decisions, and in the future, it may Perhaps it will be possible to automate the actual decision-making. Surprisingly, all of the above fits into the methodology for creating automated control systems for technological objects, known since the 60s.

New tools for analysis are required because there is not just more data than before, but more of their external and internal sources, now they are more complex and diverse (structured, unstructured and quasi-structured), various indexing schemes are used (relational, multidimensional, noSQL). It is no longer possible to deal with data in the old ways - Big Data Analytics extends to large and complex arrays, so they still use the terms Discovery Analytics (opening analytics) and Exploratory Analytics (explaining analytics). No matter how you call it, the essence is the same - feedback, which supplies decision-makers in an acceptable form with information about various kinds of processes.

Components

To collect raw data, appropriate hardware and software technologies are used, which ones depend on the nature of the control object (RFID, information from social networks, various text documents, etc.). These data are fed to the input of the analytical engine (the regulator in the feedback loop, if we continue the analogy with cybernetics). This controller is based on a hardware and software platform on which the analytical software itself runs, it does not provide the generation of control actions sufficient for automatic control, so data scientists or data engineers are included in the circuit. Their function can be compared with the role played, for example, by specialists in the field of electrical engineering, who use knowledge from physics in application to the creation of electrical machines. The task of engineers is to manage the process of converting data into information used to make decisions - they close the feedback loop. Of the four components of Big Data Analytics, in this case, we are only interested in one - the software and hardware platform (systems of this type are called Analytic Appliance or Data Warehouse Appliance).

For a number of years, Teradata was the only manufacturer of analytical specialized machines, but it was not the first - back in the late 70s, the then leader of the British computer industry, ICL, made a not very successful attempt to create a content-addressable storage (Content-Addressable Data Store), which was based on the IDMS DBMS. But Britton-Lee was the first to create a "database engine" in 1983 based on a multiprocessor configuration of the Zilog Z80 family of processors. Subsequently, Britton-Lee was bought by Teradata, which has been producing MPP-architecture computers for decision support systems and data warehouses since 1984. And Netezza was the first of a new generation of vendors of such systems - its Netezza Performance Server solution used standard blade servers along with specialized Snippet Processing Unit blades.

Analytics in DBMS

Analytics come first predictive, or predictive(Predictive Analysis, RA). In most existing implementations, the initial data for RA systems are data previously accumulated in data warehouses. For analysis, the data is first transferred to intermediate data marts (Independent Data Mart, IDM), where the presentation of data does not depend on the applications using them, and then the same data is transferred to specialized analytical data marts (Analytical Data Mart, ADM), and specialists are already working with them , using various development tools, or data mining (Data Mining). Such a multi-stage model is quite acceptable for relatively small amounts of data, but as they increase and as the requirements for efficiency increase, such models reveal a number of shortcomings. In addition to the need to move data, the existence of many independent ADMs leads to a complication of the physical and logical infrastructure, the number of modeling tools used grows, the results obtained by different analysts turn out to be inconsistent, and computing power and channels are not optimally used. In addition, the separate existence of storages and ADM makes it almost impossible to perform analytics in near real time.

The way out may be an approach called In-Database Analytics or No-Copy Analytics, which involves the use of data directly in the database for the purposes of analytics. Such DBMS are sometimes called analytical and parallel. The approach has become especially attractive with the advent of MapReduce and Hadoop technologies. In the new generation applications of the In-Database Analytics class, all data engineering and other intensive work is done directly on the data in the store. Obviously, this significantly speeds up the processes and allows you to perform real-time applications such as pattern recognition, clustering, regression analysis, and various kinds of forecasting. The acceleration is achieved not only by getting rid of moves from storage to storefronts, but mainly by using various methods of parallelization, including cluster systems with unlimited scaling. Solutions such as In-Database Analytics open up the possibility of using cloud technologies in an analytics application. The next step could be SAP HANA (High Performance Analytic Appliance) technology, the essence of which is to place data for analysis in RAM.

Major suppliers...

By 2010, the main software vendors for In-Database Analytics were Aster Data (Aster nCluster), Greenplum (Greenplum Database), IBM (InfoSphere Warehouse; IBM DB2), Microsoft (SQL Server 2008), Netezza (Netezza Performance System, PostGresSQL) , Oracle (Oracle Database 11g/10g, Oracle Exadata), SenSage (SenSage/columnar), Sybase (Sybase IQ), Teradata, and Vertica Systems (Vertica Analytic Database). These are all well-known companies, with the exception of Silicon Valley startup SenSage. Products differ markedly in the type of data they can work with, in functionality, interfaces, in the analytics software used, and in their ability to work in the clouds. The leader in terms of solution maturity is Teradata, and in terms of avant-garde - Aster Data. The list of analytical software vendors is shorter - products from KXEN, SAS, SPSS and TIBCO companies can work in local configurations, and Amazon, Cascading, Google, Yahoo! and Cloudera.

2010 was a pivotal year for predictive analytics, comparable to 2007, when IBM acquired Cognos, SAP acquired Business Object, and Oracle acquired Hyperion. It all started with EMC acquiring Greenplum, then IBM acquiring Netezza, HP acquiring Vertica, Teradata acquiring Aster Data and SAP acquiring Sybase.

…and new opportunities

The analytical paradigm opens up fundamentally new possibilities, which was successfully proved by two engineers from Cologne who created the company ParStream (the official name of empulse GmbH). Together, they managed to create an analytical platform based on processors, both general-purpose and graphic processors, competitive with predecessors. Four years ago, Michael Hümmepl and Jörg Bienert, formerly of Accenture, were commissioned by a German travel company that needed a system to generate tours that could select a record containing 20 parameters in 100 milliseconds from a database of 6 billion records. None of the existing solutions can cope with such a task, although similar problems are encountered everywhere where rapid analysis of the contents of very large databases is required. ParStream was born from the premise of applying HPC technologies to Big Data Analytics. Hümmepl and Bienert started by writing their own database engine designed to run on an x86 cluster that supports data operations in the form of parallel streams, hence the name ParStream. They chose to work only with structured data as their initial setup, which actually opens up the possibility of relatively simple parallelization. By design, this database is closer to the new Google Dremel project than to MapReduce or Hadoop, which are not adapted to real-time queries. Starting with the x86/Linux platform, Hümmepl and Bienert soon became convinced that their database could also support nVidia Fermi GPUs.

Big Data and Data Processing

To understand what to expect from what is called Big Data, one should go beyond the boundaries of the modern narrow "IT" worldview and try to see what is happening in a broader historical and technological retrospective, for example, try to find analogies with technologies that have a longer history. After all, having called the subject of our activity technology, we must also treat it as technology. Practically all known material technologies are reduced to processing, processing or assembling of raw materials specific to them or some other components in order to obtain qualitatively new products - something is at the input of the technological process and something is at the output.

The peculiarity of intangible information technologies is that the technological chain is not so obvious here, it is not clear what is the raw material, what is the result, what is input and what is the output. The easiest way to say that the input is raw data, and the output is useful information. In general, almost true, but the relationship between these two entities is extremely complex; if we remain at the level of healthy pragmatics, we can confine ourselves to the following considerations. Data is raw facts expressed in various forms, which in themselves carry no useful meaning until they are placed in context, properly organized and ordered in the process of processing. Information appears as a result of the analysis of processed data by a person, this analysis gives meaning to the data and provides them with consumer qualities. Data is unorganized facts that need to be turned into information. Until recently, ideas about data processing(data processing) were reduced to an organic circle of algorithmic, logical or statistical operations on relatively small amounts of data. However, as computer technology converges with the real world, the need to transform data from the real world into information about the real world increases, the amount of data being processed increases, and the requirements for processing speed increase.

Logically, information technologies are not much different from material technologies, the input is raw data, the output is structured, in a form more convenient for human perception, extracting information from them and the power of intelligence to turn information into useful knowledge. Computers were called computers for their ability to count, remember the first application for ENIAC - processing gun firing data and turning them into artillery tables. That is, the computer processed raw data, extracted useful data, and wrote it down in a form acceptable for use. Before us is nothing more than a conventional technological process. Generally speaking, instead of the accustomed term Information Technology, the more accurate Data Processing should be used more often.

Information technologies should be subject to general patterns in accordance with which all other technologies develop, and this is, first of all, an increase in the amount of processed raw materials and an increase in the quality of processing. This happens everywhere, regardless of what exactly serves as a raw material and what is the result, whether it be metallurgy, petrochemistry, biotechnology, semiconductor technologies, etc. It is also common that none of the technological areas develops monotonously, early or late there are moments of accelerated development, jumps. Rapid transitions can occur when a need arises outside, and there is an ability to satisfy it inside the technology. Computers could not be built on vacuum tubes - and semiconductors appeared, cars need a lot of gasoline - they discovered the cracking process, and there are many such examples. Thus, under the name of Big Data lies the emerging qualitative transition in computer technology, which can lead to serious changes, it is no coincidence that it is called the new industrial revolution. Big Data is another technological revolution with all the ensuing consequences.

The first experience in Data Processing dates back to the 4th millennium BC, when pictographic writing appeared. Since then, several main areas of work with data have developed, the most powerful was and remains textual, from the first clay tablets to SSD, from libraries of the middle of the first millennium BC to modern libraries, then various kinds of mathematical numerical methods appeared from papyri with the proof of the Pythagorean theorem and tabular techniques to simplify calculations to modern computers. As society developed, various kinds of tabular data began to accumulate, the automation of work with which began with tabulators, and in the 19th and 20th centuries many new methods for creating and accumulating data were proposed. The need to work with large amounts of data was understood for a long time, but there were no funds, hence the utopian projects such as Paul Otlet's Librarium, or a fantastic weather forecasting system using the labor of 60 thousand people-calculators.

Today, the computer has become a universal tool for working with data, although it was conceived only to automate calculations. The idea to use a computer for Data Processing originated at IBM ten years after the invention of digital programmable computers, and before that, punch devices such as Unit Record, invented by Herman Hollerith, were used for data processing. They were called Unit Record, that is, a single record - each card contained the entire record related to any one object. The first computers did not know how to work with Big Data - only with the advent of disk and tape drives, they were able to compete with machine counting stations that existed until the end of the 60s. By the way, in relational databases, the heritage of Unit Record is clearly traced.

Simplicity is the key to success

The growth in the volume of raw data, along with the need to analyze it in real time, requires the creation and implementation of tools that can effectively solve the so-called Big Data Analytics problem. Information Builders technologies allow you to work with data from any source in real time, thanks to many different adapters and the architecture of the Enterprise Service Bus. The WebFOCUS tool allows you to analyze data on the fly and gives you the ability to visualize the results in the best way for the user.

Based on RSTAT technology, Information Builders has created a predictive analytics product that allows for scenario forecasting: “What will happen if” and “What is needed for”.

Business intelligence technologies have also come to Russia, but only a few Russian companies use predictive analysis, which is caused by the low culture of using business intelligence in domestic enterprises and the difficulty of understanding existing analysis methods by a business user. With this in mind, Information Builders now offers products that Gartner analysts rate as the easiest to use.

Mikhail Stroev([email protected]), Director of Business Development in Russia and the CIS InfoBuild CIS (Moscow).

Data is everywhere

As computers gradually evolved from computing devices to general-purpose data processing machines, after about 1970, new terms began to appear: data as products (data product); tools for working with data (data tool); applications implemented through the relevant organization (data application); data science (data science); data scientists (data scientist), and even journalists who convey the information contained in the data to the general public (data journalist).

Today, applications of the data application class have become widespread, which do not just perform operations on data, but extract additional values ​​from them and create products in the form of data. Among the first applications of this type is the CDDB audio CD database, which, unlike traditional databases, was created by extracting data from discs and combining them with metadata (disc titles, track names, etc.). This base underlies the Apple iTunes service. One of the factors of Google's commercial success was also the awareness of the role of the data application - data ownership allows this company to "know" a lot using data that lies outside the page being searched for (PageRank algorithm). In Google, the problem of spelling correctness is quite simply solved - for this, a database of errors and corrections is created, and corrections are offered to the user, which he can accept or reject. A similar approach is also used for recognition during speech input - it is based on accumulated audio data.

In 2009, during the outbreak of swine flu, the analysis of queries to search engines made it possible to trace the spread of the epidemic. Many companies (Facebook, LinkedIn, Amazon, etc.) have taken the path of Google, not only providing services, but also using the accumulated data for other purposes. The ability to process this type of data gave impetus to the emergence of another science of the population - citizen science. The results obtained through a comprehensive analysis of population data allow you to gain much deeper knowledge about people and make more informed administrative and commercial decisions. The collection of data and tools for working with them is now called infoware.

Big Data Machine

Data warehouses, online stores, billing systems or any other platform that can be attributed to Big Data projects usually has unique specifics, and when designing it, the main thing is integration with industrial data, providing data accumulation processes, their organization and analytics.

Oracle provided an integrated Oracle Big Data Appliance to support the Big Data processing chain, consisting of optimized hardware with a full software stack and 18 Sun X4270 M2 servers. The interconnection is based on Infiniband 40 Gb/s and 10-Gigabit Ethernet. Oracle Big Data Appliance includes a combination of both open source and proprietary software from Oracle.

Key-value stores or NoSQL DBMS are recognized today as the main ones for the world of Big Data and are optimized for fast data accumulation and access to them. As such a DBMS for the Oracle Big Data Appliance, a DBMS based on Oracle Berkley DB is used, which stores information about the topology of the storage system, distributes data and understands where data can be placed with the least amount of time.

Oracle Loader for Hadoop allows you to use MapReduce technology to create optimized datasets for loading and analysis in Oracle 11g. The data is generated in the "native" format of the Oracle DBMS, which minimizes the use of system resources. Processing of the formatted data is performed on the cluster, and then the data can be accessed from the workstations of traditional RDBMS users using standard SQL commands or business intelligence tools. The integration of Hadoop data and Oracle DBMS is carried out using the Oracle Data Integrator solution.

Oracle Big Data Appliance comes with an open distribution of Apache Hadoop including HDFS and other components, an open distribution of the R statistical package for raw data analysis, and Oracle Enterprise Linux 5.6. Enterprises already using Hadoop can integrate data hosted on HDFS into Oracle DBMS using the external table functionality, and there is no need to immediately load the data into the DBMS - external data can be used in conjunction with internal Oracle database data using SQL commands.

Connectivity between Oracle Big Data Appliance and Oracle Exadata via Infiniband provides high-speed data transfer for batch processing or SQL queries. Oracle Exadata delivers the performance you need for both data warehousing and online transaction processing applications.

The new Oracle Exalytics product can be used to solve business intelligence problems and is optimized for using Oracle Business Intelligence Enterprise Edition with in-memory processing.

Vladimir Demkin ([email protected]), Leading Consultant for Oracle Exadata at Oracle CIS (Moscow).

Science and specialists

Author of the report “What is Data Science?” (What is Data Science?), published in the O'Reilly Radar Report series, Mike Loukidis wrote: "The future belongs to companies and people who can turn data into products." This statement involuntarily recalls Rothschild's famous words "Who owns the information - he owns the world", uttered by him when he learned about the defeat of Napoleon at Waterloo earlier than others and pulled off a scam with securities. Today, this aphorism should be rephrased: "The world is owned by the one who owns the data and technologies for their analysis." Karl Marx, who lived a little later, showed that the industrial revolution divided people into two groups - those who own the means of production and those who work for them. In general terms, something similar is happening now, but now the subject of ownership and division of functions is not the means of production of material values, but the means of production of data and information. And this is where problems arise - it turns out that owning data is much more difficult than owning tangible assets, the former are quite easily replicated and the likelihood of their theft is much higher than theft of material objects. In addition, there are legal methods of intelligence - with sufficient volume and appropriate analytical methods, you can "calculate" what is hidden. That's why there's so much focus now on Big Data Analytics (see sidebar) and how to protect against it.

Various types of activities with data, and above all the knowledge of methods for extracting information, are called data science (data science), which, in any case, translated into Russian, is somewhat disorienting, since it rather refers not to some new academic science, but to an interdisciplinary set knowledge and skills needed to extract knowledge. The composition of such a set largely depends on the area, but more or less generalized qualification requirements for specialists, who are called data scientists, can be distinguished. This was best done by Drew Conway, who in the past was involved in the analysis of data on terrorist threats in one of the US intelligence agencies. The main theses of his dissertation are published in the quarterly journal IQT Quarterly, which is published by In-Q-Tel, which acts as an intermediary between the US CIA and scientific organizations.

Conway depicted his model in the form of a Venn diagram (see figure), representing the three areas of knowledge and skills that you need to own and possess in order to become a data scientist. Hacker skills should not be understood as malicious acts, in this case the combination of possession of certain tools with a special analytical mindset, like Hercule Poirot, or perhaps this ability can be called the deductive method of Sherlock Holmes. Unlike great detectives, you also need to be an expert in a number of mathematical areas and understand the subject. Machine learning is formed at the intersection of the first two areas, at the intersection of the second and third - traditional methods. The third zone of intersection is dangerous because of speculativeness, without mathematical methods there can be no objective vision. At the intersection of all three zones lies data science.

Conway's diagram gives a simplified picture; firstly, not only machine learning lies at the intersection of hacker and mathematical circles, and secondly, the size of the last circle is much larger, today it includes many disciplines and technologies. Machine learning is called only one of the areas of artificial intelligence associated with the construction of algorithms capable of learning, it is divided into two sub-areas: case-based or inductive learning, which reveals hidden patterns in data, and deductive, aimed at formalizing expert knowledge. Machine learning is also divided into supervised learning (Supervised Learning), when classification methods based on pre-prepared training data sets are studied, and unsupervised learning (Unsupervised Learning), when internal patterns are sought through cluster analysis.

So, Big Data is not speculative reflections, but a symbol of the overtaking technical revolution. The need for analytical work with big data will significantly change the face of the IT industry and stimulate the emergence of new software and hardware platforms. Already today, the most advanced methods are used to analyze large amounts of data: artificial neural networks - models built on the principle of organization and functioning of biological neural networks; methods of predictive analytics, statistics and Natural Language Processing (directions of artificial intelligence and mathematical linguistics, which studies the problems of computer analysis and synthesis of natural languages). Methods that involve human experts, or crowdsourcing, A / B testing, sentiment analysis, etc. are also used. Well-known methods are used to visualize the results, for example, tag clouds and completely new Clustergram, History Flow and Spatial Information Flow.

From the side of Big Data technologies, they are supported by distributed file systems Google File System, Cassandra, HBase, Luster and ZFS, MapReduce and Hadoop software constructs and many other solutions. According to experts, such as the McKinsey Institute, under the influence of Big Data, the sphere of production, healthcare, trade, administration and monitoring of individual movements will undergo the greatest transformation.



Share