Organization of search in information systems. Classification of Internet information resources

Kocheganova Polina

Methods for finding educational information on the Internet

The most important condition and leading factor that determines the success of educational activities using computer technology is the readiness of students for productive activities in a didactic computer environment.

Mastering effective methods and means of searching, processing and using educational information makes it possible not only to intensify educational processes, but also to develop the cognitive interests of students, the desire for productive, creative activity.

Thanks to the ubiquitous development and application of computer technologies, information in all areas of human activity is now in one form or another electronic form: science, production, commerce, literature, entertainment, etc. The Internet has compatibility with various electronic networks and databases and allows you to get convenient access to almost any kind of information.

The information resources available via the Internet are enormous. These are tens of millions of documents presented in various ways, the number of which is constantly increasing. Depending on the method of presentation, type and nature of information, the methods of access to it also differ, therefore, before considering search methods, we will consider the classification of information resources.

According to the principle of organization and use, the search tools can be divided into catalogs (reference books, directories) and search engines.

    Catalogs

Directories are directories containing lists of Internet addresses, grouped according to certain criteria. As a rule, they are grouped by topic (science, art, news, etc.), where each topic branches into several sublevels.

The peculiarity of these means of information retrieval is that the creation of a structure, a database and their constant updating is carried out "manually" by a team of editors and programmers, and the search process itself requires the direct participation of the user, independently moving from link to link.

    Search engines

The action of search engines consists in a constant sequential study of all Internet sites available to a given search engine, with all their links and branches. Due to the constant updating of information, the search engine regularly returns after a certain period (about a month) to the nodes already studied in order to detect and register changes. All read information is indexed, that is, a specialized database is created in which all Internet pages examined by the system are encoded.

Upon receipt of a request from the user, the search engine examines all the indexed information and produces a list of documents corresponding to the search task. The found documents are ranked based on the location of the keywords (in the heading, at the beginning of the text, in the first paragraphs) and the frequency of their occurrence in the text.

Despite the similar principle of operation, search engines differ in terms of query languages, search zones, search depth within a document, ranking methods and priorities, so the use of different search engines gives different results.

A more or less serious approach to any problem begins with an analysis of possible methods for solving it. Information search on the Internet can be performed using several methods, which differ significantly both in the efficiency and quality of the search, and in the type of information retrieved. In some cases, you have to use very laborious methods - the result is worth it.

The following main methods of searching for information on the Internet can be distinguished, which, depending on the goals and objectives of the seeker, are used individually or in combination with each other:

    Direct search using hypertext links

Since all sites in the WWW are actually linked to each other, information can be retrieved by sequentially viewing linked pages using a browser.

Although this completely manual search method looks like a complete anachronism on a Web of more than 60 million sites, "manual" Web browsing is often the only option in the final stages of information retrieval, when mechanical "digging" gives way to deeper analysis. The use of catalogs, classified and thematic lists and all kinds of small reference books also applies to this type of search.

    Use of search engines

Today this method is one of the main and, in fact, the only one in conducting a preliminary search. The latter may result in a list of Network resources to be considered in detail.

As a rule, the use of search engines is based on the use of keywords that are passed to search engines as search arguments: what to look for. If done correctly, the formation of a list of keywords requires preliminary work on the preparation of a thesaurus.

    Search using special tools

This fully automated method can be very effective for conducting initial searches.

Spider is a key tool for searching the Web. As previously stated, a spider is a program that obtains some or all of the resources from a large number of sites, mainly for the purpose of creating inverted indexes that will later be used by search applications. Like other Web clients, the spider makes HTTP requests to access Web site resources and parses the responses. The main differences between a spider and a browser are the much larger number of sites that are accessed and sent requests, the lack of any display of responses, and the rather unusual use of responses.

In practice, however, only a fraction of the resources can be requested from sites. Many spiders, for example, do not request images or multimedia resources. This is done if the spider is used to build an index of text resources only.

    Analysis of new resources

Search for newly formed resources may be necessary when conducting repeated search cycles, searching for the most recent information, or for analyzing trends in the development of the research object in dynamics.

Another possible reason may be that most search engines update their indexes with a significant delay due to huge amounts of processed data, and this delay is usually the longer, the less popular the topic of interest. This consideration can be very important when conducting a search in a highly specialized subject area. This can include, for example, working with social networks, video content.

Really useful methods for finding educational information on the Internet:

    Drawing up a thesaurus

For effective use of search engines, a list of keywords is required, organized taking into account the semantic relations between them, i.e. thesaurus. When compiling a thesaurus, it is necessary to provide for the processing of synonyms, homonyms and morphological variations of keywords. It is not necessary to enter the name of the topic itself.

    See 2-3-4 search pages

    Selection of search engines

The sequence of the use of search engines is established in accordance with the decrease in the expected search efficiency with the use of each machine.

In total, about 180 search servers are known, differing by regions of coverage, search principles (and, consequently, by the input language and the nature of the perceived queries), the size of the index base, the speed of information updating, the ability to search for "non-standard" information, and the like. The main criteria for choosing search servers are the volume of the server's index base and the degree of development of the search engine itself, that is, the level of complexity of the queries it perceives.

    Use English-language resources, even if you don't know the language. Today, technical machine translation is no longer just a collection of words as it used to be. Good, useful foreign sites are translated more than adequately.

    Use specialized sources to find educational information:e- library, a bank of dissertations, a cyber marketplace, archive sites, etc.

When completing the work, one can come to the conclusion that a very large amount of educational information on various topics is stored on the Internet in the form of articles in electronic newspapers, reports, reference books, graphic images, audio and video files and much more. While surfing the Internet, you can find any information, in other words, if any data was ever entered into a computer, then most likely they can be found somewhere on the vast expanses of the Internet.

There is no information that cannot be found on the Internet, you just need to know where and how to look.

Bibliography

    Garmashov M. Yu., Korotkov A. M. Preparing students for productive activities in a didactic computer environment. - Volgograd, 2001.

    I.P. Norenkov Knowledge management in the information and educational environment. - M .: MESI, 2000.

    Putilov G.P. The concept of building an information and educational environment for a technical university. - M .: MGIEM, 1999.

    Information search tools in the INTERNET // Afanasy-Exchange (Tver). - March 28, 1997.

    Uskov V.L. Distance engineering education on the basis of the Internet / Library of the journal "Information Technologies", 2000, № 3.

    Demin Igor Svyatoslavovich Search for scientific and educational information on the Internet // Vestnik TSU. 2008. No. 9.

Searching for information on the global Internet: general information

According to the principle of organization and use, search tools can be divided into:

    Catalogs ... Directories are directories containing lists of Internet addresses, grouped according to certain criteria. As a rule, they are grouped by topic (science, art, news, etc.), where each topic branches into several sublevels. Some search directories:

Name

Hey!

www.au.ru

Atrus (registration required)

www.atrus.ru

List.ru

www.list.ru

Constellation

www.stars.ru

Snail

www.ulitka.ru

Ivan Susanin

www.susanin.ru

    Search engines cars ... For a detailed search of documents, specialized search engines are used - search engines. Upon receipt of a request from the user, the search engine produces a list of documents matching the search task. The found documents are ranked based on the location of the keywords (in the heading, at the beginning of the text, in the first paragraphs) and the frequency of their occurrence in the text. Using different search engines gives different results. The most common of the search engines are:

Name

I ndex

www. yandex. ru

BUT port

www.aport.ru

R ambler

www.rambler.ru

G coal

www.google.ru

M eil

www. mail.ru

I NS

www.yahoo.com

BUT ltavista

www.altavista.com

A search query can consist of one or more words, it can contain various punctuation marks. As for the case, in general, the case of spelling of search words and operatorsdoesn't matter , that is, the words "abstract", "Abstract", abstract, "ABSTRACT" and "ABSTRACT" will be perceived in the same way. This fully applies to the Latin alphabet. So, "Yes "and" yES ", and even" yeS "," yes "and" YES "are all the same for searching.

Practical work "Information search in the global Internet"



Hiding the aroma in the buds,

Lilacs are blooming.

May is blooming, which means

Today is a holiday - May day!

    Save poem:


  1. Search holiday pictures:


  2. Check out the search results located on 1 page. Go to the 2nd page: scroll the mouse wheel to be at the bottom of the browser window and clickL KM by page link2 .

    Please select the picture you like and click on itL KM.

In the new window you will see the same picture, only at an increased size. To the right of it there will be information about the size of the picture and the sites on which it is located.

    Copy the picture :

    1. perform a clickNS CM in the picture;

      select teamCopy picture ;

      close the browser window by clicking on the buttonClose .

    Insert a picture into the document:

    1. go to the text editor window (there should be a congratulatory poem);

      pin the cursor with a clickL KM after the last character of the poem (this! ) and press the keyEnter to move the cursor to a new line;

      perform a clickNS KM;

      in the local menu select the commandInsert .

    Save the document in your personal folder under the nameCongratulations to *** from *** ... Instead of the first ***, type the name of the person to whom the congratulation will be sent; instead of the second ***, type your name. For example,Congratulations for Anastasia from Olga ... Close the text editor program.

    Launch your browser Google Chrome
    .

    Go to your mailbox on the portalmail . ru

    In the main mail menu (at the top of the window), select the commandWrite a message .

    Fill in the required fields :


  1. Select the buttonsend (it is located both at the top and at the bottom of the browser window).

    Close the browser window.

    Turn off your computer.

Exercise 1

The task : Find the name of the largest freshwater lake in the world.


For optimal and fast work with search engines, there are certain rules for writing queries. A detailed list for a specific search server can, as a rule, be found on the server itself under the links Help, Hint, Rules for making a request, etc.

    Organize your search and fill in the table with search results:

    Question

    Search results (number of pages)

    yandex . ru

    rambler . ru

    google.ru

    mail .ru

    aport . ru

    How to find a person on the Internet by photography?

    How to register on the Vkontakte website?

    How to remove red eye?

    Close the browser (exit the program).

Exercise 2

The task : to findbiography of the Minister of Education of the Russian Federation A.A. Fursenko using a search engineg oogle. r u

Exercise # 3

Searching for literary works on the Internet




Attention! To view books in formatFB2 you need a special program ("reader"). For example,AlReader .

Course work

On the topic: "Organization of storage and retrieval of information on the Internet"


Introduction

The Internet as a medium of information in Russia cannot yet compete with traditional media, but it has great prospects in this respect and will be able to continue to act on a par with other information resources in the future.

Currently, over 500 million people use more or less regularly

Internet, and in two years their number, according to experts, will exceed 1 billion, in other words, more than 16% of the world's population. Of course, such a colossal audience could not remain unclaimed - the Internet has long turned into a huge information platform.

All over the world, and now in our country, the presence of a working website is becoming a sign of stable, professional work of the company. The Internet has long become not only a means of communication, but also a field for serious commercial activity. Almost every foreign company has its own representative office on the Internet, a virtual office. The total turnover of companies trading on the Internet reaches billions of dollars. In Russia, an increasing number of companies are also using the Internet to promote their products and services. This is easy to verify by looking at advertising publications. More and more email and Web site addresses are found alongside the usual telephone and fax numbers. Soon, the lack of an Internet address will be as difficult as the lack of a fax machine. Those who take their place now will benefit significantly in the future. This is efficiency and relevance. Traditional mass media, with all their visibility and familiarity, are no longer able to provide the appropriate level of efficiency required by a modern person. Therefore, more and more people turn to the Internet to get the latest information: about services and prices, weather, exchange rates, just news. You can change the information on the Web site several times a day. In print media, advertisements must be ordered at least a week in advance, or even more. And on the Internet everything is operative: new goods or services, a new discount or a new supplier - tomorrow customers will find out about it. There is no need to wait until the next print ad is released. The information on the site will always be up to date, the freshest. This is what is appreciated, this is what attracts millions of users to the Internet.


1. Data storage on the network Internet

1.1 Hypertext documents, types of files

A hypertext document is understood as a document containing so-called links to another document. All this is implemented through the HyperText Transfer Protocol (HTTP).

Information in Web documents can be found by keywords. This means that each Web browser contains specific links through which so-called hyperlinks are formed, allowing millions of Internet users to search for information around the world.

Hypertext documents are based on HTML (HyperText Markup Language). This language is very simple, its control codes, which, in fact, are compiled by the browser for display on the screen, consist of ASCII text. Links, lists, headings, pictures, and forms are called HTML elements that let you click to view another document with a mouse click.

There are two ways to create hypertext documents. You can use one of the WYSIWYG HTML editors (for example, Netscape Composer, the basics of which are discussed in the section "Word Processing on a Computer", Microsoft FrontPage, HotDog, etc.), which do not require special knowledge about the internal structure of the created document. This method allows you to create documents for the WWW without knowing the HTML language. HTML editors automate the creation of hypertext documents, eliminate routine work. However, their capabilities are limited, they greatly increase the size of the resulting file and the result obtained with their help does not always meet the developer's expectations. But, of course, this method is indispensable for beginners in the preparation of hypertext documents.

An alternative is to create and mark up the document using a regular plain text editor (such as emacs or NotePad). This method manually inserts HTML commands into the text. By creating documents this way, you know exactly what you are doing.

As noted, an HTML document contains symbolic information. One part of it is the text itself, i.e. the data that make up the content of the document. Another - tags(markup tags), also called marking flags, - special constructs of the HTML language used to mark up a document and control its display. It is the tags of the HTML language that determine in what form the text will be presented, which of its components will act as hypertext links, and which graphic or multimedia objects should be included in the document. The graphic and sound information included in the HTML document is stored in separate files. HTML document viewers (browsers) interpret the markup flags and arrange text and graphics on the screen accordingly. For files containing HTML documents, the .htm or .html extensions are accepted.

Uppercase and lowercase letters are not distinguished when writing tags . In most cases, tags are used in pairs. The pair consists of a start tag and an end tag. Opening tag syntax:

<имя_тега [атрибуты]>

Brackets used in syntax descriptions indicate that the element may be missing. The name of the closing tag differs from the name of the opening tag only in that it is preceded by a forward slash:

Tag attributes are written in the following format:

name [= "value"]

Quotation marks when specifying an argument value are optional and can be omitted. For some attributes, a value may not be specified. The end tag has no attributes.

The action of any paired tag begins where the start tag is encountered and ends when the corresponding end tag is encountered. Often, a pair of start and end tags is called container, and the part of the text, bordered by the opening and closing tags, is element .

The sequence of characters that make up the text can consist of spaces, tabs, line feeds, carriage returns, letters, punctuation marks, numbers, and special characters (for example, +, #, $, @), with the exception of the following four characters that have special meanings in HTML:< (меньше), >(greater than), & (ampersand) and "(double quote). If you need to include any of these characters in the text, you should encode it with a special sequence of characters.

Non-breaking spaces can also be classified as special characters. Using this symbol is one way to increase the space between some words in the text. Ordinary spaces cannot be used for these purposes, since a group of consecutive spaces is interpreted by the browser as one.

1.2 Graphic files, their types and features

Nowadays, the use of full-color, high-quality graphics in realistic colors on PC-class computers looks completely commonplace. Not so long ago, though, this was a privilege of publishing systems, which were usually built on Macintosh platforms or Silicon Graphics' graphics stations. As a last resort, PC users were content with graphics with color, a maximum depth of 8 bits / pixel (256 colors) at a rather weak resolution of 320X200, or 16 colors at a resolution of 640X480.

Now, with the development of video adapter architectures and the reduction in the cost of video memory on various microcircuits, the average user is quite accessible to systems on the PC platform that successfully work with realistic (TrueColor) images with a depth of 24 bits / pixel (more than 16 million colors).

In connection with technical progress, a need arose to transfer to the PC platform and adapt various formats for encoding and storing graphic information from other platforms (for example, Macintosh, where similar developments have been developing for the second decade), or the development of our own PC-oriented graphic formats , fully taking into account all the features of the architecture of their video adapters.

Moreover, in the last 5 years, in connection with the lightning-fast spread of the Internet and, in particular, the World Wide Web technologies, a problem of a different kind began to arise - the development of image formats that are compact enough for transmission in a network with minimal delays and are hardware independent, since they are connected to the network. computers of various architectures.

In this regard, I would like to briefly consider several common graphic formats and briefly describe their capabilities. All this information is summarized in the following table:

Format Max. deep colors Max. number of colors

Max. image size,

Encode multiple images
BMP 24 16"777"216 65535x65535 RLE * -
GIF 8 256 65535x65535 LZW +
Jpeg 24 16"777"216 65535x65535 Jpeg -
PCX 24 16"777"216 65535x65535 RLE -
PNG image 48 281"474"976"710"656 Deflation (LZ77) -
Tiff 24 16"777"216 total 4'294'967'295 LZW, RLE and others * +

In addition, it should be noted that the most compact formats are JPEG, GIF, PNG, which, moreover, are platform independent. The BMP format is a standard Windows format, but it is not widely used due to the exorbitant file sizes, especially when saving graphics with a color depth of 24 bits / pixel. Regarding the TIFF format, it should be noted that, like JPEG, GIF, it is partially platform independent, but too large for use on the web and, even worse, too difficult to interpret. In addition, any software products, including viewers of graphic files, containing code for encoding / decoding data using the LZW algorithm must be distributed under the appropriate license agreement of Unisys Corp., the owner of the algorithm, which further increases the cost of these products.

Further consideration, I would like to turn to the cross-platform formats accepted on the Internet as the de facto standard: JPEG, GIF, PNG.

I want to note right away that the PNG (Portable Network Graphic) format will not be given much attention, although, perhaps, it deserves it. This is a consequence of the fact that this format appeared not so long ago and, despite all its advantages, has not yet received universal recognition.

So, in fact, a person or company that intends to place a large number of images on their disks and, possibly, provide them for use on the Internet, is faced with a dilemma: what to choose GIF or JPEG.

The GIF format, developed by CompuServe and originally proposed as a format for exchanging images on the web, is a format with a fairly high image compression ratio. In addition, GIF has additional features that make it attractive to use on the web. The first is the ability to change the order of displaying image lines on the screen, filling in the gaps between them with temporary information. Visually, it looks so that as it downloads from the network (which often happens at a catastrophically low speed), the image appears on the screen as if "in low quality", and then, as additional information is loaded, it restores the missing lines of the image. Thus, the user can get an idea of ​​the content of the image even before the download process is finished and interrupt the download of an unnecessary large file. The second possibility is to store more than one image in one file, which makes elementary frame-by-frame animation possible. Another distinctive feature of GIF is that one of the colors can be declared "transparent", and then when the image is displayed, those parts of it that are painted with this color will not be displayed on the screen and the background on which the image is superimposed will be visible under them. The biggest disadvantage of GIF is that it can store a maximum of 256 colors, which has become less and less acceptable lately. At the same time, GIF users are haunted by the same nuisance as in the case of the TIFF format: GIF also uses LZW compression, and therefore, each image can be distributed only if there is a corresponding license agreement.

The JPEG format is a TrueColor format, which means it can store images with a color depth of 24 bits / pixel. This color depth is sufficient for virtually accurate reproduction of images of any complexity. A deeper color representation (for example 32 bits / pixel) actually turns out to be practically indistinguishable from that when viewed on modern monitors and when printed on most of the available printers. This color depth can only be useful in publishing. JPEG generally has a higher compression rate for images than GIF (this aspect is described in more detail in the chapter "Practices for using JPEG"), but does not have the ability to store multiple images in one file. Recently, a modification of the JPEG format has been developed, called Progressive JPEG, which can be roughly translated into Russian as "gradual JPEG", which is intended for the same tasks as interlaced display of GIF images. This made the JPEG format even more attractive as a web standard. However, JPEG also has its drawbacks. Unlike GIF, which can efficiently compress images of almost any content, JPEG focuses primarily on realistic images, that is, photographic images, and the compression quality is significantly degraded when images with clearly defined lines and color boundaries are processed.

Thus, it is still impossible to make a final choice in favor of one or the other format. However, the JPEG format seems to me more interesting from the point of view of the original compression algorithm and great opportunities for development in the future. Also, the JPEG format should be considered unambiguously more flexible: it allows you to choose between good image quality or a good compression ratio and find an acceptable compromise for each specific case. Therefore, all further research is devoted to this particular format.

1.3 Search engines and rules for finding information

The convenience of the Internet is that you can find almost any information in it, even when we do not know exactly where it is. If the address of the page with the material we are interested in is unknown and there is no page with suitable links either, we have to search for materials all over the Internet. To do this, use Internet search engines - special web sites that allow you to find the desired document.

There are two main methods of searching the Internet. In the first case, you are looking for web pages related to a specific topic. The search is carried out by choosing a thematic category and gradually narrowing it down. Such search engines are called search directories. They are convenient when you need to get acquainted with a new topic for yourself or get to the well-known "classic" resources on a given topic. The second search method is used when the topic is narrow, specific or you need rare, little-known resources. In this case, you have to imagine what keywords should be found in the document on the topic of your interest. These words must be chosen in such a way that they are most likely to be found in the necessary documents that are not related to the chosen topic. The systems that allow this kind of search are called search indexes. Search directories differ from search indexes not only in the search method, but also in the way they are formed. Any search engine on the Internet consists of two parts. A specialized web page, accessible to everyone and allowing them to search, relies on a large, constantly updated and updated database that contains information about the Internet resources.

The method of replenishing this database depends on the type of search engine, the most important thing is the selection accuracy. Every resource you find should be useful. The topic of the page is defined or checked manually. Because of this, the volume of search directories is relatively small. When the volume approaches a million pages, the amount of manual labor is so great that further growth of the catalog stops.

Search indexes, by contrast, are breadth-of-reach. With the definition of the words available on a web page, the automation copes well, the data of the search index can cover many millions of web pages. This makes searching an index more difficult than searching a directory because the same keywords can appear on web pages about different topics.

Information retrieval systems are hosted on the Internet on public servers. The basis of search engines is the so-called search engines, or automatic indexes. Special robotic programs (also known as spiders) automatically periodically scan the Internet based on certain algorithms, indexing the documents found. The created index databases are used by search engines to provide the user with access to information posted on the Web sites. The user, within the framework of the corresponding interface, formulates a request, which is processed by the system, after which the results of the request processing are displayed in the browser window. Query processing mechanisms are constantly improving, and modern search engines do not just sort through a huge number of documents. - The search is carried out on the basis of original and highly complex algorithms, and its results are analyzed and sorted in such a way that the information presented to the user most closely matches his expectations.
Currently, in the development of search engines, there is a tendency to combine automatic index search engines and manually compiled catalogs of Internet resources. The resources of these systems successfully complement each other, and it is quite logical to combine their capabilities.

Nevertheless, studies of the capabilities of search engines, even the most powerful of them, such as AltaVista or HotBot, show that the real coverage of the World Wide Web resources by a separate such system does not exceed 30%. Therefore, you should not limit yourself to using any one of them. If you were unable to find the information you are interested in using one system, try using another.

Each search engine has its own characteristics and, and the quality of the result obtained depends on the subject of the search and the accuracy of the query. Therefore, when starting to search for information, first of all, you need to clearly understand what exactly and where you want to find. For example, foreign systems are striking in the number of indexed documents. For searching in the field of professional knowledge, especially information in a foreign language, systems such as AltaVista, HotBot or Northern are best suited.

However, for searching information in Russian, especially in the Russian part of the Internet, Russian search engines are better suited. First, they are specifically targeted specifically at the Russian-language resources of the Web and, as a rule, are distinguished by a greater coverage and depth of research of these resources. Secondly, the Russian systems work taking into account the morphology of the Russian language, that is, all forms of the desired words are included in the search. Russian systems better take into account such a historically established feature of Russian Internet resources as the coexistence of several Cyrillic encodings.

2. Review and characteristics of web search engines Internet

2.1 Rambler

To search for Russian-language information on the Internet, it is better to use Russian search engines. In this experience and in the following others, we will search for information using several systems designed to search in the Russian-speaking part of the Internet. As you will see, they are not fundamentally different from the world's search engines. Since we have already considered several systems, and you know the general principles of searching for information on the Internet, then in further experiments we will not dwell on all the intricacies. Since these systems communicate with you in Russian, you will be able to independently study them using the knowledge you gained from previous experiments.

Let's search using the Rambler system. As you will see, this system has a convenient system for searching and issuing the information found.

You can search both on the World Wide Web and in newsgroups, as well as in the catalog of this system and in products. In addition to a simple query, it is possible to work with detailed queries. But we will execute a simple query, just like for other Russian search engines.

Enter the words in the query input field Internet search. We want to find documents that contain both the word "search" and the word "Internet".

Click the button To find!... We got a list of found pages.

The list of found pages is conveniently organized. First, there are links to the pages that best match the search criteria. Most fully satisfy the request are documents in which the search words are often repeated and are located not far from each other. In addition, the detected keywords are highlighted in a short fragment of the text of the found document.

In the Rambler system, you can see the words that are most often used in user queries. In addition, Rambler maintains a list of the most popular Russian Internet sites. Since all information in the system is presented in Russian, we hope that you will be able to independently familiarize yourself with the capabilities of this search engine in the future.

2.2 Yandex

The Yandex search engine is located at www.uaandeh.ru. She was officially commissioned on September 23, 1997.

What is Yandex? This is how the creators of the system answer this question. Yandex is a full-text information retrieval system (ISS) that takes into account the morphology of the Russian and English languages. The Yandex system is designed to search for information in electronic texts of various structures and different presentation methods (formats). Yandex (pronounced "Yandex") stands for "language index" or, in English spelling, Yandex - YetAnotherINDEX. You can also consider Yandex as a partial translation of the word Index from English into Russian ("I" means "I").

At the heart of the search engine Yandex. Ru is the system kernel common to all products with the Yandex prefix (Yandex. Site, Yandex. Lib, Yandex. Dict, Yandex.CD). The first products of the Yandex series (Yandex. Site, Yandex. Dict) were presented to the general public on October 18, 1996 at the Netcom'96 exhibition. Search engine for the "Russian Internet". was a natural continuation of the Yandex line. As stated, a good question contains half the answer. Searching and finding what you need in a heap of texts on the Internet is not only the skill of the search engine, but also of the user making the request. Yandex does not require the user to know special search commands. just type the question (“where to find cheap computers” or “we need telephones in Moscow and the Moscow region”), and you will get the result - a list of pages where these words are found. Regardless of the form in which you used the word in the query, the search takes into account all its forms according to the rules of the Russian language. For example, if the query is set to go, then the search will find links to documents containing the words "go", "is going", "walked", "walked", etc.

Yandex works not only with language queries, but also allows you to search only on certain servers OR exclude obviously unnecessary servers from the search. Now you can search for images by captions and file names. Also, objects such as scripts, applets and styles became available for search (search is carried out by name). Convenient work with new features is offered on the advanced search page, where a complex query language is reduced to filling in fields in a form. In addition to the standard sorting of results - by relevance (that is, by the degree of compliance with the query), you can sort documents by the date of update. An interesting feature of the system is the ability to search in Yandex anywhere on the Internet. To do this, you need to download the program with the name Yandex. Bar and install it. After that, a new panel will appear in the browser window. It is designed to enter a search request (without having to open a Yandex page) and perform a number of other functions.

Yandex looks like a typical portal, on the main page of which you can find links to materials of almost any topic. But this is not his only face, for "serious" users who do not want to waste time downloading information that is unnecessary at the moment, there is another Yandex. Its page impresses with its modest design and loading speed. The address of this essence of the search engine is www.ya.ru.

2.3 Yahoo

Databases: Managed by a search service for Internet resources, news, maps, advertising information, sports information, business, phone numbers, personal WWW pages, and email addresses (separate database).

Search: All Yahoo pages offer not only a simple search box, but options for that search, as well as Usenet or Email searches. The search can be limited to specifying a certain period of time. Boolean operators (and, or) and sequential search are also supported. Note: If you search on Yahoo! did not lead to a positive result, the search process automatically switches to Alta Vista, which continues the search, and in case of positive results, it automatically returns the found information to Yahoo !.

If Yahoo! cannot connect quickly enough with Alta Vista, then Yahoo! will provide a link page with a set of search tools. After one of these links is selected, the keywords are passed to a search engine of your choice.

A means of making the search easier is the presence of a "tip search" (TS) - search with a "hint": Yahoo! It is a subordinate directory, which means that the system does not have as many pages as search engines, however, specifying the most general keywords will allow you to find the necessary topic on a high-level page (the first page that appears in front of a user when visiting a site) for an organization or company.

Results: Links are displayed in accordance with the order of the specified words by the search sequence, along with their descriptive text and subordinate hierarchy.

Address: http://www.yahoo.com/

2.4 Altavista

AltaVista (www. AltaVista.com) is one of the oldest search engines on the Internet. The first web index was introduced by the company in 1995. The core of the search engine owes its birth to a strange feature of the research lab at DigitalEquipmentCorp. For some reason, the employees of this laboratory have kept all their electronic correspondence over the past 10 years. So that this heap of information does not just take up disk space, but brings at least some benefit, a program was created to index documents and search for the right words in a heap of electronic correspondence that has turned yellow from time to time. The system turned out to be so successful that it subsequently successfully migrated to the vastness of the World Wide Web.

The AltaVista Index contains documents in over 25 languages. Localized versions of the AltaVista website are located in domains of 20 countries. The search scope can include documents in all supported languages, or only in documents in a specific language, and on a dedicated page, you can learn multiple languages ​​to search in all selected languages ​​at the same time.


Conclusions and offers

Currently, the Internet uses almost all known communication lines from low-speed telephone lines to high-speed digital satellite channels. The operating systems used on the Internet are also diverse. Most computers on the Internet run on Unix or VMS. Special network routers such as NetBlazer or Cisco, whose OS resembles the Unix OS, are also widely represented.

In fact, the Internet consists of many local and global networks belonging to different companies and enterprises, connected by various communication lines. The Internet can be imagined as a mosaic of small networks of different sizes that actively interact with one another, sending files, messages, etc.

An example of the topology of the Internet is the X-Atom network, which consists of several subnets, and at the same time is a fragment of the worldwide Internet.

Today there are more than 130 million computers in the world, and more than 80% of them are united in various information and computer networks from small local area networks in offices to global networks such as the Internet. The worldwide trend towards connecting computers in a network is due to a number of important reasons, such as the acceleration of the transmission of information messages, the ability to quickly exchange information between users, receiving and transmitting messages (faxes, E - Mail letters, etc.) without leaving the workplace, the ability to instantly receive any information from anywhere in the world, as well as the exchange of information between computers of different manufacturers, running under different software.

Such huge potential opportunities that the computer network carries and the new potential rise that the information complex is experiencing, as well as a significant acceleration of the production process, do not give us the right not to accept this for development and not to apply them in practice.

Therefore, it is necessary to develop a fundamental solution to the issue of organizing an ICT (information and computer network) on the basis of an existing computer park and software complex that meets modern scientific and technical requirements, taking into account the increasing needs and the possibility of further gradual development of the network in connection with the emergence of new technical and software solutions.

The Internet continues to evolve with unrelenting intensity, essentially erasing the restrictions on the distribution and receipt of information in the world. However, in this ocean of information it is not very easy to find the required document. It should also be borne in mind that along with the long-standing servers, new ones appear on the network.

In addition to "general"-purpose servers, there are specialized sites in one area or another, such as for high-energy physics - http://xxx.lanl.gov.

When importing article files, you should also keep in mind that often they are stored in PostScript format (with the extension, PS '', EPS '') intended for printing on a laser printer, therefore, in this case, after receiving them for viewing and printing on dot matrix or inkjet printers should use a dedicated program such as GhostView.

There is no doubt that the use of the Internet in scientific work allows you to receive the hottest information and keep in touch with colleagues in the world.

There is speculation that the Internet will supplant and replace books. A number of factors are currently hindering this. First, the lack of comfort when reading books from a computer monitor. Although portable e-text readers already exist, their screen resolution is clearly insufficient. Secondly, copyright for electronic publications is not fully developed.

In the future, the Internet will significantly replace traditional media due to its flexibility, responsiveness and interactivity.

Today, many people unexpectedly discover for themselves the existence of global networks that unite computers all over the world into a single information space called the Internet. It is not easy to define what it is. From a technical point of view, the Internet is an amalgamation of transnational computer networks operating on various protocols, connecting all kinds of computers, physically transmitting data over all available types of lines - from twisted pair and telephone wires to fiber and satellite channels. Most computers on the Internet are connected using TCP / IP. We can say that the Internet is a network of networks that enmeshes the entire globe.


1. Informatics / Kurnosov A.P., Kulev S.V., Ulezko A.V. and etc.; Ed. A.P. Kurnosova.-M: KolosS, 2005. - 72 p. (Textbooks and textbooks for students of higher educational institutions)

2. Workshop on Informatics: Textbook. allowance / Ed. A.P. Kurnosova - Voronezh: VGAU, 2004.239 p.

3. Informatics. Textbook. - 3rd ed., Revised / Ed. N.V. Makarova. - M.: Finance and Statistics, 2002 .-- 256 p.

4. Informatics. Basic course / Simonovich S.V. and others-SPb .: Peter, 2006 .-- 639 p .: ill.

5. Krupnik A.B. Searching the Internet: a tutorial. - 2nd ed. - SPB .: Peter, 2004 .-- 572 p.

6. Orlov A.A. Necessary programs for the Internet - SPb .: Peter, 2006 .-- 127 p.

7. Solonitsyn Yu.A., Kholmogorov V. Internet. Encyclopedia. - 3rd of. - SPb .: Peter, 2003 .-- 592 p.

8. Reznikov F.A. We quickly and easily master the work on the Internet. - M .: Best books, 2002 .-- 284 p.

9. Computer networks and information security tools: Textbook. allowance / Kamalyan A.K., Kulev S.A., Nazarenko K.N. and others - Voronezh: VGAU, 2003 .-- 119 p.

10. Olifer V.G., Olifer N.A. Computer networks. Principles, technologies, protocols. - SPb .: Peter, 2002 .-- 672 p .: ill.

11. Internet: Encyclopedia / Ed. L. Melikhova. - 2nd ed.-SPb .; M .; Kharkov; Minsk; Peter, 2000 .-- 527 p.

12. Mushtovatyi I.F. Self-study guide for working on the Internet / Under total. ed. M.I. Monastyrsky. - 2nd ed., Add. and revised-Rostov n / a: Phoenix, 2002.-312 p.

13. Popov V. Workshop on Internet technologies: Training course / V. Popov.-SPb .; M .; Kharkov; Minsk: Peter, 2002 .-- 476 p.: Ill.

14. Computer networks and information security tools: Textbook / Kamalyan A.K., Kulev S.A., Nazarenko K.N. and others-Voronezh: VGAU, 2003.-119 p.

15. Zaika A.A. Computer Networks - M: Olma-Press, 2005. -448 p.

16. Computer networks: Training course - 2nd ed. (+ CD-ROM). - MicrosoftPress, Russian edition, 1998.

17. Fundamentals of modern computer technology. Ed. Khomonenko A.D. - Crown print, St. Petersburg 1998.

18. Personal computers in TCP / IP networks. Craig Hunt; transl. From English. - BHV-Kiev, 1997.

19. Federal Law of the Russian Federation "On Information, Informatization and Information Protection" dated 20.02.1995 No. 24-FZ.

20. Comer D. Principles of the Internet: Per. from English / D. Comer. - SPB .; M .; Kharkov; Minsk: Peter, 2002.-379 p.

Searching for information is a task that humanity has been solving for many centuries. As the volume of information resources potentially available to one person grew, more and more sophisticated and sophisticated search tools and techniques were developed to find the necessary document.

According to K. Manning's book "Introduction to Information Retrieval", it can be said that the effective operation of any IRS is based on the speed and capabilities of multidimensional sampling of the necessary data from a large array (information retrieval) for internal work with data. This imposes certain requirements on the organization of search rules, the construction of the user and program interface and the form of providing information.

The implementation of the above requirements is entrusted to the next series of structural components, the so-called blocks [Appendix 4].

Based on the book by A.A. Varfolomeev. “Basics of information security”, the choice of just such a structure of an information retrieval system is based on a very simple logic - any block of the system must receive data, process it and issue it to the user in a certain order, providing the logic of the process.

It is impossible to talk about information retrieval systems without mentioning such a thing as a search engine. According to D.N. Kolisnichenko in the book "Search Engines and Website Promotion on the Internet", Search engine- a system with a database generated by a robot containing information about information resources. A distinctive feature of search engines is the fact that a database containing information about Web pages is generated by a robot program. Upon receipt of the result, if the title and description of the document meets your requirements, you can immediately go to its original source by the link. It is more convenient to do this in a new window in order to be able to further analyze the results of the issue. Many search engines allow you to search in the found documents, and it is possible to refine the query by introducing additional terms. If the intelligence of the system is high, then there is also the possibility of searching for similar documents. However, automating the determination of similarity is a very non-trivial task, and often this function does not always work correctly. Some search engines allow you to re-sort the results. It is worth paying attention to the fact that different search engines describe a different number of sources of information on the Internet. Therefore, you cannot limit yourself to searching only in one of the specified search engines. There are various search tools that do not form their own index, but are able to use the capabilities of other search engines. This, as N.A. Gaidmamakin in the book "Automated information systems, databases and data banks", metasearch engines(search services) - systems that can send user queries simultaneously to several search engines, then combine the results and present them to the user in the form of a document with links.

Also, D.N. Kolisnichenko writes that for the most accurate and quick finding of the necessary information on the network, the IPS is used indexing.

Search index- a data structure that contains information about documents and is used in search engines.

Indexing(or indexing) performed by a search engine is the process of collecting, sorting and storing data in order to provide fast and accurate information retrieval. Index creation includes interdisciplinary concepts from linguistics, mathematics and computer science.

Popular search engines focus on full-text indexing of documents written in natural languages. Multimedia documents such as video and audio and graphics can also participate in the search.

A.Yu. Kelina writes in his book "Fundamentals of Information Security" that metasearch engines use indexes of other search services and do not store a local index, while search engines based on cached pages store both index and text corpora for a long time. Unlike full-text indexes, partial-text services limit the indexing depth to reduce the size of the index.

The search engine architecture differs in the way it is indexed. Indexes are of the following types [Appendix 5]:

  • · Direct index. Direct index stores a list of words for each document.
  • · Inverted index. Stores a list of occurrences of each search criterion.

The index is only a part of the search engine hidden from the user. The second part of this apparatus is information retrieval language (IPL), about which Varfolomeev A.A. writes in detail. in the book "Fundamentals of Information Security". IPL is a language that allows you to formulate a request to the system in a simple and visual form. Even if the user is prompted to enter queries in natural language, this does not mean that the system will semantically parse the user's query. The main point is that usually the phrase is broken into words, forbidden and common words are removed from this list, sometimes the vocabulary is normalized, and then all words are connected either by logical AND or OR.

Variants are also possible, as indicated by N.A. Chursin in the book "Popular Informatics". So, in most systems, some phrases will be recognized as key phrases, and will not be split into separate words. Another approach is to calculate the proximity between the request and the document. By now, about a dozen different proximity measures are known. It is these percentages of document compliance with the request that are given out as reference information when the list of found documents.

According to K. Manning, AltaVista possesses the most advanced query language among modern information retrieval systems on the Internet. In addition to the usual set of AND, OR, NOT, this system also allows you to use NEAR. The last operator allows you to organize a contextual search. All documents in the system are divided into fields, so in the request you can specify in which part of the document the user wants to see the keyword (in the link, title, etc.).

(For more information on Internet retrieval languages, see the appendix)

From the book by Yu.I. Kudinov "Fundamentals of modern informatics", you can learn that the most common models for the presentation of documents in an information retrieval system are various variations on the presentation of a document as a set of terms. As mentioned earlier, this is not the entire text of the document, but only a small set of terms that reflect its content. Based on this idea of ​​the document, it is necessary to consider various information retrieval languages.

The most common IPL is a traditional language that allows you to construct logical expressions from a set of terms. In this case, the boolean operators AND, OR, NOT are used.

This scheme is quite simple, and therefore is most widely used in modern information retrieval systems. But even 20 years ago, its shortcomings were also well known.

Boolean searches don't scale well. The AND operator can dramatically reduce the number of documents per request. In this case, everything will very much depend on how typical search terms are for the database. The OR operator, on the other hand, can lead to an unreasonably wide query, in which useful information is lost behind information noise. For the successful application of this IPL, one should have a good knowledge of the vocabulary of the system and its thematic focus. As a rule, for a system with such an IPL, special documentary lexical databases with complex dictionaries are created, which are called thesauri and contain information about the relationship of the terms of the dictionary with each other.

K. Manning points out that a weighted boolean search is a modification of Boolean search. The idea behind this search is quite simple. The term is believed to describe the content of the document with some precision, and this precision is expressed in terms of the weight of the term. In this case, both the terms of the document and the terms of the request can be weighed. The request can be formulated in the IPL described above, but the issue of documents will be ranked depending on the degree of proximity between the request and the document. In this case, the proximity measurement is constructed in such a way that a normal Boolean search would be a special case of a weighted Boolean search.

But, unlike A.A. Varfolomeev. , I.S. Ashmanov, in his book "Website Promotion in Search Engines", writes that although IPLs are not perfect now, special attention should be paid to the algorithm ranking(orderly building) of the received links, since it is no less important. The most frequently used criteria for ranking in the IRS are the presence of words from the query in the document, their number, proximity to the beginning of the document, proximity to each other;

The presence of words from the request in the headings and subheadings of documents (headings must be specially formatted);

The number of links to this document from other documents; "Respectability" of the referencing documents.

Different search engines use different ranking algorithms, but the basic principles for determining relevance are as follows:

  • · The number of query words in the text content of the document (i.e. in the html-code).
  • · Tags in which these words are located.
  • · The location of the search words in the document.
  • · The proportion of words for which relevance is determined in the total number of words in the document.

These principles are applied by all search engines.

The database outputs a similarly ranked list of HTML documents and returns it to the person making the request. Different search engines also choose different ways to display the resulting list - some only show links; others display links with the first few sentences contained in the document, or the title of the document along with the link. Search engine ranking is an essential part of information retrieval.

Aspects of this concept are well presented in the book by K. Manning "Introduction to Information Retrieval". Information search involves the use of certain strategies, methods, mechanisms and means. The behavior of the user who manages the search process is determined not only by information needs, but also by the instrumental variety of the system - technologies and means provided by the system.

Search strategy - the general plan (concept, preference, setting) of the system or user behavior for expressing and satisfying the user's information needs, determined both by the nature of the goal and the type of search, and by systemic "strategic" decisions - the database architecture, methods and search tools in a specific ISS. In general, the choice of a strategy is an optimization problem. In practice, it is largely determined by the art of reaching a compromise between practical needs and the capabilities of the available means.

Search method - a set of models and algorithms for the implementation of individual technological stages: building a search image of a query, selection of documents (comparison of search images of queries and documents), expansion of a query, localization and evaluation of the issue.

Search query image - a text written on the IPL that expresses the semantic content of an information request and contains instructions necessary for the most effective implementation of information retrieval.

The process of searching for information is a sequence of steps leading through the system to a certain result, and allowing to assess its completeness. Since the user usually does not have comprehensive knowledge about the information content of the resource in which he is searching, he can assess the adequacy of the query expression, as well as the completeness of the result obtained, based only on external assessments or on intermediate results and generalizations, comparing them, for example , with the previous ones.

The search process can be represented in the form of the following main components:

  • 1) formulating a query in natural language, choosing a search engine and services, formalizing a query on the appropriate IPL;
  • 2) conducting a search in one or more search engines;
  • 3) an overview of the results (references);
  • 4) preliminary processing of the results obtained: viewing the content of links, extracting and storing relevant data;
  • 5) if necessary, modify the request and conduct a repeated (clarifying) search with subsequent processing of the results.

To reduce the volume of the selected materials, the search results are filtered by the type of sources (sites, portals), topics and other grounds.

According to the search technologies used, IP can be divided into 4 categories:

  • 1. Thematic catalogs;
  • 2. Specialized catalogs (online directories);
  • 3. Search engines (full-text search);
  • 4. Means of metasearch.

Thematic catalogs provide for the processing of documents and their assignment to one of several categories, the list of which is predetermined. This is actually classification based indexing. Indexing can be done automatically or manually with the help of specialists who browse popular websites and compose a short description of resume documents (keywords, abstract, abstract).

Specialized catalogs or reference books are created by industry and topic, by news, by city, by email address, etc.

Search engines(the most advanced Internet search engine) implement full-text search technology. The texts located on the polled servers are indexed. The index can contain information on several million documents.

When using funds metasearch the request is carried out simultaneously by several search engines. The search result is combined into a general list sorted by relevance. Each system processes only a part of the network nodes, which makes it possible to expand the search base.

The so-called "organization of search" and "implementation of search" are also very important, about which D.N. Kolisnichenko in the book "Search Engines and Website Promotion on the Internet."

Search organization

The procedure for finding the necessary information is divided into nine main stages:

  • · Definition of the area of ​​knowledge;
  • · Choice of type and data sources;
  • · Collection of materials necessary to fill the information model;
  • · Selection of the most useful information;
  • · Choice of information processing method (classification, clustering, regression analysis, etc.);
  • · Choice of an algorithm for finding patterns;
  • · Search for patterns, formal rules and structural links in the collected information;
  • · Creative interpretation of the results obtained;
  • · Integration of the extracted "knowledge".

To conduct a search, the interface for working with the corresponding database is initially loaded on the user's computer. It can be a local or remote database. Initially, you should decide on the type of search (simple, advanced, etc.). Then with a set of fields suggested for search. IRS can offer one or more fields for input. In the latter case, these are usually fields: author, title (title), time period, document type, keywords, headings, etc.

Search implementation

It is generally accepted to organize a search by the initial fragments of a word (search with truncation on the right), for example, instead of the word "library", you can enter its fragment "library *". In this case, documents will be found that contain not only the word "library", but also "library", "librarian", "library science", etc. In each case, the user must imagine what exactly he wants to find, since in the proposed the variant will find a much larger number of documents than when specifying the given word completely (without truncation). In such a case, it is possible to carry out a refinement search in the received array of information and, as a result, get more relevant data.

IRS are also characterized by the search execution time, the interface provided to the user and the type of displayed results. When choosing an IRS, attention is paid to their parameters such as coverage and depth. Under coverage the volume of the search engine base is understood, measured by three indicators: the total volume of indexed information, the number of unique servers and the number of unique documents. Under depth it is understood whether there is a limit on the number of pages or on the depth of nesting of directories on one server.

Also, some aspects of information retrieval are covered in the book by V.A. Gvozdeva "Fundamentals of building automated information systems." As it is written in the book, each search engine has its own algorithms for sorting search results. The closer to the beginning of the list obtained as a result of the search, the required document is, the higher the relevance and the better the search engine works. All of them allow you to quickly find on the web by keywords, thematic headings and even individual letters, for example, all or almost all texts where these words are present. In this case, the user is told the addresses of the sites where the results found are constantly present. However, none of them has an overwhelming advantage over the others. To carry out a reliable search for complex queries, experts recommend using sequentially or in parallel (simultaneously) various ISS.

From the book of D.N. Kadeeva "Information technology and electronic communications" you can learn about such a concept as "full-text search engine". It indexes all words of the text that is visible to the user. The presence of morphology makes it possible to find the desired words in all declensions or conjugations. Some machines are able to search for phrases or words at a given distance, which is often important to get a reasonable result. In addition, in the HTML language there are tags that can also be processed by a search engine (headings, links, captions to pictures, etc.). At the same time, you need to know that the fewer the number of keywords included in these tags, the more often they can be found in the texts of the site pages and, therefore, the higher their relevance. The optimal frequency of such words is not more than 5%. There should not be very many keywords, they should mostly consist of one or two words, forming the most commonly used terms. The more relevant keywords have, the more competitive they give the document from the point of view of search engines.

The user receives the completeness and accuracy of the answer depending on the accuracy of the request formulated by him. As a result of the search, he is usually provided with much more information than he needs, some of which may not be relevant at all to the generated query. It is easy to see that a lot depends not only on a well-formulated query, but also on the capabilities of search engines, which are very different. At the same time, the fact that in the data obtained it is possible to skip the main necessary information is quite clearly manifested. Simple queries in the form of separate fairly common terms lead to the extraction of thousands (hundreds of thousands) of documents, the vast majority of which the user does not need ( information noise).

An important aspect is also the ability of the ISS to support multilingualism, that is, the ability to process requests in different languages. Also, usually a search in full-text databases is carried out using morphological analyzers (usually Russian and English), which automatically find existing word forms by a word fragment, word, phrase, even if there are some typos in the query words.

Also, one cannot fail to mention such a feature of the IPS as with search and structuring tools sometimes called search engines ... According to I.S. Ashmanov, in his book "Website Promotion in Search Engines", search engines are used to help people find the information they need. Search tools such as agents, spiders, crawlers and robots are used to collect information about documents on the Internet. These are special programs that search for pages on the Web, extract hypertext links on these pages and automatically index the information they find to build a database. Each search engine has its own set of rules governing how to find and process documents. Some follow every link on every page they find and then, in turn, explore every link on every new page, and so on. Some people ignore links that lead to graphics and sound files, animation files; others are instructed to browse the most popular pages first. The classification of search engines is best presented in the book by A.A. Varfolomeev. "Fundamentals of Information Security":

  • · Agents- the most "intelligent" of search tools. They can do more than just search: they can even execute transactions on your behalf. Already, they can search for sites of a specific topic and return lists of sites sorted by their attendance. Agents can process the content of documents, find and index other types of resources, not just pages. They can also be programmed to retrieve information from pre-existing databases. Regardless of the information that the agents index, they pass it back to the search engine database.
  • General search for information on the Web is carried out by programs known as spiders... The spiders report the content of the found document, index it and extract the summary information. They also look at headers, some links, and send the indexed information to the search engine's database.
  • · Crawlers look through the headers and only return the first link.
  • · Robots can be programmed to follow various links of varying nesting depths, index, and even check links in a document. Due to their nature, they can get stuck in loops, so they need significant Web resources when following links, however, there are methods designed to prevent robots from searching on sites whose owners do not want them to be indexed.

In conclusion, we can say that ISS in the network, with all their external diversity, by its classification, which is described in the book by L.G. Gagarina "Automated Information Systems":

Classification information retrieval systems

In classification ISS, a hierarchical (tree-like) organization of information is used, which is called a CLASSIFIER. The sections of the classifier are called HEADINGS. The library analogue of the classification ISS is a systematic catalog. The classifier is developed and improved by a team of authors. It is then used by another team of specialists called SYSTEMATORS. The taxonomists, knowing the classifier, read the documents and assign them classification indices indicating which sections of the classifier these documents correspond to.

Subject IRS (Web-rings)

From the user's point of view, the subject ISS is organized in the simplest way. Look for the name of the desired subject of your interest (the subject can also be something insubstantial, for example, Indian music), and lists of the corresponding Internet resources are associated with the name. This would be especially useful if the complete list of items is small.

Vocabulary IPS

Cultural problems associated with the use of classification IRS led to the creation of a dictionary-type IRS with a generalized English name search engines... The main idea of ​​the dictionary IRS is to create a dictionary from words found in Internet documents, in which, for each word, a list of documents from which the given word is taken will be stored.

Based on information from the book by A.Yu. Kelina "Fundamentals of Information Security", you can find out that there are two main algorithms for the operation of dictionary IRS: using keywords and using descriptors ( Descriptor - a lexical unit (word, phrase) of an information retrieval language that serves to describe the main semantic content of a document or formulate a query when searching for a document (information) in an information retrieval system). In the first case, to evaluate the content of the document, only those words that occur in it are used, and upon request, the IRS compares the words from the query with the words of the document, determining its relevance by the number, location, and weight of words from the query in the document. IRS for historical reasons use this algorithm in various modifications.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Similar documents

    Data storage on the Internet. Hypertext documents, types of files. Graphic files, their types and features. Search engines and information search rules. Survey of search engines of the Internet. Everything about the search engines Yandex, Google, Rambler.

    term paper added 03/26/2011

    Information search tools on the Internet. Basic requirements and methods of information retrieval. The structure and characteristics of search services. Global search engines WWW (World Wide Web). Planning the search and collection of information on the Internet.

    abstract, added 11/02/2010

    Characteristics of methods for finding information on the Internet, namely, using hypertext links, search engines and special means. Analysis of new Internet resources. The history of the emergence and description of Western and Russian-language search engines.

    abstract, added 05/12/2010

    The structure and principles of building the Internet, searching and storing information in it. The history of the emergence and classification of information retrieval systems. The principle of operation and characteristics of the search engines Google, Yandex, Rambler, Yahoo. Search by URLs.

    term paper, added 03/29/2013

    Characteristics of search engines Yandex, Google, Rambler: similarities and differences, advantages and disadvantages. Search for a definition of a number of terms, software products. Search for information on directions: writers and poets, their works, doctors of science for Samara.

    test, added 08/22/2011

    Concept and principles of work, internal structure and elements, history of formation and development of the search engine "Rambler". Research and analysis, as well as assessment of the effectiveness of this search engine for finding economic information on the Internet.

    term paper added on 05/10/2015

    Methods and tools for storing data on the World Wide Web. The concept and varieties of hypertext documents and graphic files. The principles of search engines and the rules for finding the information you need. Characteristics of some search engines on the Web.

    term paper, added 04/18/2010

Share this