The mechanism used by search engines to search content in the web, based on
keywords entered by the user
More studies focused on the mechanism of how search engine works, as in common
all have the main parts (eg: google, yahoo, Bing, ask, DuckDuckgo etc) of Web
crawling or Spidering, Web Indexing, and searching.
Fig 1: Simple Mechanism Representation of How Search Engine Works.
Web mining is made of three branches i.e. web content mining (WCM), web
structure mining (WSM) and web usage mining (WUM), WCM exploring the proper and
relevant information from the contents of web. WSM find out the relation between
different web pages by processing the structure of web. WUM recording the user
profile and user behaviour inside the log file of the web, and with information
scaling over the web, search engines must Accommodates this scaling, this is why
large scale search engines architecture includes lots of distributed servers,
and many papers discusses how to improve the performance of web search engines
regarding the user interface and query input, or towards filtering the output
results, and improvement in solving algorithms in web page spying and
collecting, indexing, and output. Methods and Structure of search engine and how
does it work: There are differences in the ways various search engines work, but
they all perform the following activities:
Figure 2: Characteristics of Search Engines.
1) Web Crawling The first step for search engines is to browse the world wide
web in automated manner and see what is there based on important words and this
is done by a piece of programs, called a web crawling or spidering. Crawler
follow links through each visited page and give an index for everything they
face. The web crawlers used mainly to create a copy of all the visited pages for
later processing by a search engine that will index the downloaded pages to
provide fast searches. In general, it starts with a list of URLs to visit,
called the seeds. As the crawler visits these URLs, it identifies all the
hyperlinks in the page and adds them to the list of URLs to visit, called the
crawl frontier. URLs from the frontier are recursively visited according to a
set of policies. And because of the large number of the page on the web (more
than 20 billion), web crawler cannot visit all these pages daily to check if
there are new pages appeared or an existing page are modified. Web crawlers are
an essential part of the search engines, and details about the architecture and
algorithms as kept as secrets. The behaviour of a web crawler is defined by a
combination of the following policies: • Selection policy which states which
pages that are downloaded to a database, • Re- visit policy that states when to
check for the presence of changes to pages, • Politeness policy that states how
to avoid overloading Web sites, and • Parallelization policy that states how to
coordinate distributed Web Crawlers. 2) Indexing After the page is crawled,
search engines parse the document to generate an index that points to the
corresponding result. Those indexed pages are stored in a huge database which
can be retrieved later. Indexing is a process that identifies the words and
phrases that best describe the page and assigning the page to particular
keywords. The purpose of storing an index is to improve the speed and
performance to find the relevant documents to the search query. Without an
index, the search engine will scan each document in the body, which requires a
long time and computing power. 3) Searching and Processing When a search request
done by a user to look for words found in that index, the search engine process
it by compare the search string comes from the request with the indexed pages in
the huge database. Sense there are millions of pages which may match the search
string, the search engine calculates the relevancy of each of page in its index
with the search string. There are different algorithms to calculate the
relevancy. Each of these algorithms has different relative weights for common
factors like keyword density, links, or metatags. 4) Result matching A matching
method used by the search engine to match the user's query with similar web
pages in the database. There are many different techniques used by matching the
various search to visualize relevant results strongly. However, there can be
challenges during the matching results. Some of these are as shown below:
Parsing: Parsing algorithms may pose difficulties if they encounter complex
Hyper Text Markup Language (HTML) used in some of the web pages. Such
difficulties create instances where there may extract some useful results for
display to the user. Filtering: A search engine needs to perform effective
filtering of URLs in order to show the most relevant to the searchers it’s
really important to show the results of a unique user by minimizing the chances
of repetition. 4) Result ranking: It defines the order in which search results
are displayed for the user there could be thousands of results that can be shown
to the user, but the results appear in order of importance need to be taken care
of. Search engines follow the sorting algorithm to rank the results. This
algorithm is based on two factors: • Location: It is important for the search
engine to search for keywords search at the top of the Web page. For example:
Searching for keyword search in the title of the Web page. • Frequency: The
algorithm looks for the frequency of keywords repeated in the context of the
search results. Frequency of search keywords is not considered to be an ideal
factor as it gets biased to content-rich pages. 5) Retrieving Retrieving the
results is simply displaying them in the browser and sorted them from the most
relevant to the least relevant sites. Search Engine's Architecture: Google,
Yahoo, Bing Google: An Overview Google Company was founded by Larry Page and
Sergey Brin while studying PHD at Stanford University in 1998 and was officially
launched in the fall of 1999. This is a straightforward engine that does not
support advanced search syntax making it very easy to use and retrieves pages
ranked on the basis of number of sites linking to them and how often they are
visited, indicating their popularity (ibid). It claims that 97% of the users
find what they are looking for. Google's brand has become so universally
recognizable that now days; people use it like a verb. For example, if someone
asks you any question you don't know. The answer is “Ask Google OR Google it”.
What made Google the most popular and trusted search engine is the quality of
its search results. Google is using sophisticated algorithms to present the most
accurate results to the users. Google’s founders Larry Page and Sergey Brin came
up with the idea that websites referenced by other websites are more important
than others and thus deserve a higher ranking in the search results. Over the
years the Google ranking algorithm has been enriched with hundreds of other
factors (including the help of machine learning) and still remains the most
reliable way to find exactly what you are looking for on the Internet. Features
Google includes the following most important features: 1) Cached page archives
2) Result clustered by indention 3) Result displayed option, from 10-100 “Google
Search” Supports: 1) Implied Boolean (+)sign, (-) sign 2) Double quotes (“”) for
phrases Stopwords. The component: • Crawler: There are several distributed
crawlers, they parse the pages and extract links and keywords • URL Server:
Provides to crawlers a list of URLs to scan. • Server Store: The crawlers sends
collected data to a store serve. It compresses the pages and places them in the
repository. Each page is stored with an identifier, a docID • Repository:
Contains a copy of the pages and images, allowing comparisons and caching. •
Indexer: It indexes pages. It decompresses documents and converts them into sets
of words called "hits". It distributes hits among a set of "barrels". This
provides an index partially sorted. It also creates a list of URLs on each page.
A hit contains the following information: the word, its position in the
document, font size, capitalization. • Barrels: These "barrels" are databases
that classify documents by docID. They are created by the indexer and used by
the sorter • Anchors: The bank of anchors created by the indexer contains
internal links and text associated with each link
Figure 3: google architecture
• URL Resolver: It takes the contents of anchors, converts relative URLs into
absolute addresses and finds or creates a docID.It builds an index of documents
and a database of links. • Doc Index: Contains the text relative to each URL. •
Links: The database of links associates each one with a docID (and so to a real
document on the Web). • Page Rank: The software uses the database of links to
define the PageRank of each page. • Sorter: It interacts with barrels. It
includes documents classified by docID and creates an inverted list sorted by
worded. • Lexicon: A software called DumpLexicon takes the list provided by the
sorter (classified by wordID), and also includes the lexicon created by the
indexer (the sets of keywords in each page), and produces a new lexicon to the
searcher. • Searcher: It runs on a web server in a datacenter, uses the lexicon
built by DumpLexicon in combination with the index classified by wordID, taking
into account the PageRank, and produces a results page. Yahoo: An Overview Yahoo
is the oldest and also the largest directory on the Internet began in mid-1994.
This is one of the most frequently accessed tools, and despite the fact that
most people consider it as a search engine, and it is classified as a directory.
Yahoo is one of the most popular email providers and its web search engine holds
the third place in search with an average of 2% market share. From October 2011
to October 2015, Yahoo search was powered exclusively by Bing. In October 2015
Yahoo agreed with Google to provide search-related services and until October
2018, the results of Yahoo were powered both by Google and Bing. As of October
2019, Yahoo! Search is once again provided exclusively by Bing. Yahoo is also
the default search engine for Firefox browsers in the United States (since
2014). Yahoo’s web portal is very popular and ranks as the 11 most visited
website on the Internet (According to Alexa). Structure • Yahoo is
hierarchically organized with subject catalogue or directory of the web which is
browsable and searchable. • Yahoo indexes web pages, UseNet and e-mail address.
Features • Topic and region specific “yahoos!” • Automatic truncation • No case
sensitivity and stop words the syntax that yahoo follows for searching is fairly
standard among all search engines.
Figure 4: YAHOO architecture
The components: Data Acquisition -- Web Crawling • follow hyperlinks to download
pages • spam detection • (near) duplicate detection • Link analysis -- e.g.,
Pagerank • prepares input for crawling and query processing Index Construction
and Updates • build inverted index structure in bulk, similar to mining but
updates trickier Query Processing Boolean queries: • compute
unions/intersections of lists Ranked queries: • give scores to all docs in union
BING SEARCH ENGINE (www.bing.com) Bing (known previously as Live Search, Windows
Live Search, and MSN Search) is a web search engine (advertised as a "decision
engine") from Microsoft. MSN was launched with Windows 95 as default page set
but officially launched as search engine in 1998. It is used by most of the
Windows based computers as default search engines with Internet Browsers. MSN
search makes improvement in search technology and launched as Windows live
search in September 2006. Later on Microsoft announced it as Live Search in
March 2007 and continued till May 2009.
Figures 5: Live, MSN and Bing Search Engines
MSN search engine is now known as “Bing” search engine of Microsoft. Bing was
unveiled by Microsoft CEO Steve Ballmer on May28, 2009 at San Diego. It went
fully online on June 3, 2009. It is advertised as a decision engine. In October
2011, Microsoft stated that they were working on new back-end search
infrastructure with the goal of delivering faster and slightly more relevant
search results for users. Known as "Tiger", the new index-serving technology has
been incorporated into Bing globally since August 2011. In May 2012, Microsoft
announced another redesign of its search engine that includes "Sidebar", a
social feature that searches users' social networks for information relevant to
the search query. In September 2013, a new-look Bing was released to tie in with
Microsoft's "Metro" design language.
Figure 6: news search in Bing
Now, semantic technology used in Bing search. Notable changes include the
listing of search suggestion in real time as queries are entered and a list of
related searches appeared. It is known as explorer pane. Bing also includes the
ability to save and share search histories. Important features of Bing: • It
provides simple as well as advanced search facility. • Bing provides access to
user session history in the explorer pane. It includes related searches and
prior searches. • Right side extended preview of web page and gives URLs to link
inside of the pages. • On certain sites, Bing will allow to search within search
results. • For Boolean search, the operators are used in capitals: AND, NOT and
AND NOT; besides this + sign can be used to indicate the search term must
appear, whereas – sign used to indicate that search term must not appear. •
Truncation: Asterisk ‘*’ may be used as a wild card for truncation. • Dictionary
features: When ‘define’, ‘definition’ or ‘what is’ followed by a word is entered
in the search box, Bing will show a direct answer from Encarta dictionary. • In
advanced Search: Users can select the option: any of the words, all of the
words, words in the title, the exact phrase, Boolean phrase, or links to the
URL. Bing search follows these rules: 1. Search words for basic searches are not
case sensitive. 2. All searches are “AND” search, so there is no need to type
the word and between your search terms. 3. Stop words such as ‘a’, ‘and’, ‘or’,
and ‘the’ are ignored, unless they are surrounded by quote marks. Display of
Search Results 1. Web search display the total number of search hits. 2.
Suggestion of related searches on left side. 3. Title with hyperlink. More on
this page displays on right side. 4. Brief description of metadata. 5. Web site
address 6. Option to Cached-Pages is available. Bing employs an advanced set of
rules or instructions that each search goes through in order to narrow down and
filter the best results. Bing uses Natural Language Processing: Bing’s natural
language processing pipelines for developers leverage patterns found in training
data from developer queries collected over the years containing commonly used
terms and text structures typical for coding queries. Bing Uses Click Signals to
Improve Accuracy: This is likely a reference to using click patterns to
determine what a user means when they type a particular search query. If a
pattern is consistent, a search engine can know with confidence that a
particular pattern means a specific thing. If the click pattern is less
consistent, then that is a signal that the search query is ambiguous. That’s
when you see search engine results pages (SERPs) with different kinds of sites.
Bing Uses Upvotes for Ranking Forum Answers: The system then extracts the best
matched code samples from popular, authoritative and well moderated sites like
Stackoverflow, Github, W3Schools, MSDN, Tutorialspoint, etc. taking into account
such aspects as fidelity of API and programming language match, counts of
up/down-votes, completeness of the solution and more Integration with Apple On
June 10, 2013, Apple announced that it will be dropping Google as its web search
engine and include Microsoft's Bing.
Wow
ReplyDelete