Friday, January 15, 2021

The mechanism used by search engines to search content in the web, based on keywords entered by the user

The mechanism used by search engines to search content in the web, based on keywords entered by the user

More studies focused on the mechanism of how search engine works, as in common all have the main parts (eg: google, yahoo, Bing, ask, DuckDuckgo etc) of Web crawling or Spidering, Web Indexing, and searching.

Fig 1: Simple Mechanism Representation of How Search Engine Works.
Web mining is made of three branches i.e. web content mining (WCM), web structure mining (WSM) and web usage mining (WUM), WCM exploring the proper and relevant information from the contents of web. WSM find out the relation between different web pages by processing the structure of web. WUM recording the user profile and user behaviour inside the log file of the web, and with information scaling over the web, search engines must Accommodates this scaling, this is why large scale search engines architecture includes lots of distributed servers, and many papers discusses how to improve the performance of web search engines regarding the user interface and query input, or towards filtering the output results, and improvement in solving algorithms in web page spying and collecting, indexing, and output. Methods and Structure of search engine and how does it work: There are differences in the ways various search engines work, but they all perform the following activities:
Figure 2: Characteristics of Search Engines.
1) Web Crawling The first step for search engines is to browse the world wide web in automated manner and see what is there based on important words and this is done by a piece of programs, called a web crawling or spidering. Crawler follow links through each visited page and give an index for everything they face. The web crawlers used mainly to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. And because of the large number of the page on the web (more than 20 billion), web crawler cannot visit all these pages daily to check if there are new pages appeared or an existing page are modified. Web crawlers are an essential part of the search engines, and details about the architecture and algorithms as kept as secrets. The behaviour of a web crawler is defined by a combination of the following policies: • Selection policy which states which pages that are downloaded to a database, • Re- visit policy that states when to check for the presence of changes to pages, • Politeness policy that states how to avoid overloading Web sites, and • Parallelization policy that states how to coordinate distributed Web Crawlers. 2) Indexing After the page is crawled, search engines parse the document to generate an index that points to the corresponding result. Those indexed pages are stored in a huge database which can be retrieved later. Indexing is a process that identifies the words and phrases that best describe the page and assigning the page to particular keywords. The purpose of storing an index is to improve the speed and performance to find the relevant documents to the search query. Without an index, the search engine will scan each document in the body, which requires a long time and computing power. 3) Searching and Processing When a search request done by a user to look for words found in that index, the search engine process it by compare the search string comes from the request with the indexed pages in the huge database. Sense there are millions of pages which may match the search string, the search engine calculates the relevancy of each of page in its index with the search string. There are different algorithms to calculate the relevancy. Each of these algorithms has different relative weights for common factors like keyword density, links, or metatags. 4) Result matching A matching method used by the search engine to match the user's query with similar web pages in the database. There are many different techniques used by matching the various search to visualize relevant results strongly. However, there can be challenges during the matching results. Some of these are as shown below: Parsing: Parsing algorithms may pose difficulties if they encounter complex Hyper Text Markup Language (HTML) used in some of the web pages. Such difficulties create instances where there may extract some useful results for display to the user. Filtering: A search engine needs to perform effective filtering of URLs in order to show the most relevant to the searchers it’s really important to show the results of a unique user by minimizing the chances of repetition. 4) Result ranking: It defines the order in which search results are displayed for the user there could be thousands of results that can be shown to the user, but the results appear in order of importance need to be taken care of. Search engines follow the sorting algorithm to rank the results. This algorithm is based on two factors: • Location: It is important for the search engine to search for keywords search at the top of the Web page. For example: Searching for keyword search in the title of the Web page. • Frequency: The algorithm looks for the frequency of keywords repeated in the context of the search results. Frequency of search keywords is not considered to be an ideal factor as it gets biased to content-rich pages. 5) Retrieving Retrieving the results is simply displaying them in the browser and sorted them from the most relevant to the least relevant sites. Search Engine's Architecture: Google, Yahoo, Bing Google: An Overview Google Company was founded by Larry Page and Sergey Brin while studying PHD at Stanford University in 1998 and was officially launched in the fall of 1999. This is a straightforward engine that does not support advanced search syntax making it very easy to use and retrieves pages ranked on the basis of number of sites linking to them and how often they are visited, indicating their popularity (ibid). It claims that 97% of the users find what they are looking for. Google's brand has become so universally recognizable that now days; people use it like a verb. For example, if someone asks you any question you don't know. The answer is “Ask Google OR Google it”. What made Google the most popular and trusted search engine is the quality of its search results. Google is using sophisticated algorithms to present the most accurate results to the users. Google’s founders Larry Page and Sergey Brin came up with the idea that websites referenced by other websites are more important than others and thus deserve a higher ranking in the search results. Over the years the Google ranking algorithm has been enriched with hundreds of other factors (including the help of machine learning) and still remains the most reliable way to find exactly what you are looking for on the Internet. Features Google includes the following most important features: 1) Cached page archives 2) Result clustered by indention 3) Result displayed option, from 10-100 “Google Search” Supports: 1) Implied Boolean (+)sign, (-) sign 2) Double quotes (“”) for phrases Stopwords. The component: • Crawler: There are several distributed crawlers, they parse the pages and extract links and keywords • URL Server: Provides to crawlers a list of URLs to scan. • Server Store: The crawlers sends collected data to a store serve. It compresses the pages and places them in the repository. Each page is stored with an identifier, a docID • Repository: Contains a copy of the pages and images, allowing comparisons and caching. • Indexer: It indexes pages. It decompresses documents and converts them into sets of words called "hits". It distributes hits among a set of "barrels". This provides an index partially sorted. It also creates a list of URLs on each page. A hit contains the following information: the word, its position in the document, font size, capitalization. • Barrels: These "barrels" are databases that classify documents by docID. They are created by the indexer and used by the sorter • Anchors: The bank of anchors created by the indexer contains internal links and text associated with each link
Figure 3: google architecture
• URL Resolver: It takes the contents of anchors, converts relative URLs into absolute addresses and finds or creates a docID.It builds an index of documents and a database of links. • Doc Index: Contains the text relative to each URL. • Links: The database of links associates each one with a docID (and so to a real document on the Web). • Page Rank: The software uses the database of links to define the PageRank of each page. • Sorter: It interacts with barrels. It includes documents classified by docID and creates an inverted list sorted by worded. • Lexicon: A software called DumpLexicon takes the list provided by the sorter (classified by wordID), and also includes the lexicon created by the indexer (the sets of keywords in each page), and produces a new lexicon to the searcher. • Searcher: It runs on a web server in a datacenter, uses the lexicon built by DumpLexicon in combination with the index classified by wordID, taking into account the PageRank, and produces a results page. Yahoo: An Overview Yahoo is the oldest and also the largest directory on the Internet began in mid-1994. This is one of the most frequently accessed tools, and despite the fact that most people consider it as a search engine, and it is classified as a directory. Yahoo is one of the most popular email providers and its web search engine holds the third place in search with an average of 2% market share. From October 2011 to October 2015, Yahoo search was powered exclusively by Bing. In October 2015 Yahoo agreed with Google to provide search-related services and until October 2018, the results of Yahoo were powered both by Google and Bing. As of October 2019, Yahoo! Search is once again provided exclusively by Bing. Yahoo is also the default search engine for Firefox browsers in the United States (since 2014). Yahoo’s web portal is very popular and ranks as the 11 most visited website on the Internet (According to Alexa). Structure • Yahoo is hierarchically organized with subject catalogue or directory of the web which is browsable and searchable. • Yahoo indexes web pages, UseNet and e-mail address. Features • Topic and region specific “yahoos!” • Automatic truncation • No case sensitivity and stop words the syntax that yahoo follows for searching is fairly standard among all search engines.
Figure 4: YAHOO architecture
The components: Data Acquisition -- Web Crawling • follow hyperlinks to download pages • spam detection • (near) duplicate detection • Link analysis -- e.g., Pagerank • prepares input for crawling and query processing Index Construction and Updates • build inverted index structure in bulk, similar to mining but updates trickier Query Processing Boolean queries: • compute unions/intersections of lists Ranked queries: • give scores to all docs in union BING SEARCH ENGINE (www.bing.com) Bing (known previously as Live Search, Windows Live Search, and MSN Search) is a web search engine (advertised as a "decision engine") from Microsoft. MSN was launched with Windows 95 as default page set but officially launched as search engine in 1998. It is used by most of the Windows based computers as default search engines with Internet Browsers. MSN search makes improvement in search technology and launched as Windows live search in September 2006. Later on Microsoft announced it as Live Search in March 2007 and continued till May 2009.
Figures 5: Live, MSN and Bing Search Engines
MSN search engine is now known as “Bing” search engine of Microsoft. Bing was unveiled by Microsoft CEO Steve Ballmer on May28, 2009 at San Diego. It went fully online on June 3, 2009. It is advertised as a decision engine. In October 2011, Microsoft stated that they were working on new back-end search infrastructure with the goal of delivering faster and slightly more relevant search results for users. Known as "Tiger", the new index-serving technology has been incorporated into Bing globally since August 2011. In May 2012, Microsoft announced another redesign of its search engine that includes "Sidebar", a social feature that searches users' social networks for information relevant to the search query. In September 2013, a new-look Bing was released to tie in with Microsoft's "Metro" design language.
Figure 6: news search in Bing
Now, semantic technology used in Bing search. Notable changes include the listing of search suggestion in real time as queries are entered and a list of related searches appeared. It is known as explorer pane. Bing also includes the ability to save and share search histories. Important features of Bing: • It provides simple as well as advanced search facility. • Bing provides access to user session history in the explorer pane. It includes related searches and prior searches. • Right side extended preview of web page and gives URLs to link inside of the pages. • On certain sites, Bing will allow to search within search results. • For Boolean search, the operators are used in capitals: AND, NOT and AND NOT; besides this + sign can be used to indicate the search term must appear, whereas – sign used to indicate that search term must not appear. • Truncation: Asterisk ‘*’ may be used as a wild card for truncation. • Dictionary features: When ‘define’, ‘definition’ or ‘what is’ followed by a word is entered in the search box, Bing will show a direct answer from Encarta dictionary. • In advanced Search: Users can select the option: any of the words, all of the words, words in the title, the exact phrase, Boolean phrase, or links to the URL. Bing search follows these rules: 1. Search words for basic searches are not case sensitive. 2. All searches are “AND” search, so there is no need to type the word and between your search terms. 3. Stop words such as ‘a’, ‘and’, ‘or’, and ‘the’ are ignored, unless they are surrounded by quote marks. Display of Search Results 1. Web search display the total number of search hits. 2. Suggestion of related searches on left side. 3. Title with hyperlink. More on this page displays on right side. 4. Brief description of metadata. 5. Web site address 6. Option to Cached-Pages is available. Bing employs an advanced set of rules or instructions that each search goes through in order to narrow down and filter the best results. Bing uses Natural Language Processing: Bing’s natural language processing pipelines for developers leverage patterns found in training data from developer queries collected over the years containing commonly used terms and text structures typical for coding queries. Bing Uses Click Signals to Improve Accuracy: This is likely a reference to using click patterns to determine what a user means when they type a particular search query. If a pattern is consistent, a search engine can know with confidence that a particular pattern means a specific thing. If the click pattern is less consistent, then that is a signal that the search query is ambiguous. That’s when you see search engine results pages (SERPs) with different kinds of sites. Bing Uses Upvotes for Ranking Forum Answers: The system then extracts the best matched code samples from popular, authoritative and well moderated sites like Stackoverflow, Github, W3Schools, MSDN, Tutorialspoint, etc. taking into account such aspects as fidelity of API and programming language match, counts of up/down-votes, completeness of the solution and more Integration with Apple On June 10, 2013, Apple announced that it will be dropping Google as its web search engine and include Microsoft's Bing.

The mechanism used by search engines to search content in the web, based on keywords entered by the user

The mechanism used by search engines to search content in the web, based on keywords entered by the user More studies focused on the m...