Understanding How Search Engines Work
The following is based on a chapter from my eBook Ever Seeking: A History and Future of Search
It’s sometimes easy to take for granted that you can type a few words in a text field and instantly walk away with tens, thousands, even millions of relevant pieces of information that range from web pages, documents, images or videos. In order to reach the place we have, it’s taken decades to create and record the information we currently have on record. What makes things even more interesting is how quickly we are creating new information.
According to IBM back in 2012, 2.5 exabytes (or 2.5 billion gigabytes) of data was generated every day[i]. As of writing this, that was six years ago. The rate of information being created has surely increased, and just imagining that 2.5 exabytes times 365 days a year times six years is enough to make your head spin.
So clearly having enough data on a topic isn’t an issue. Matching queries with relevant data, however, was and always will be the primary challenge that search engines were meant to solve. Let’s explore a bit.
Bots and Spiders and Algorithms, Oh My!
Unless a search engine is human-powered, or hybrid, it uses a crawler to index content for its database to perform four critical steps:
Crawling
Indexing
Calculating Relevancy
Retrieving Results
These four steps comprise the full cycle of how a Web page or piece of content goes from being posted online to being found by a user. Let’s take a closer look.
Crawling
In order to truly understand search engines, one needs to understand crawlers (or “bots” or “spiders” as they’re sometimes called). These are just names for the applications which scour the Web for information. The way they work in one sense is quite simple. They branch out through the World Wide Web and “crawl” the pages and content throughout the Internet. They do this by following links and moving from page to page.
Some of the things that these spiders initially kept track of include the following:
Content on a page
Text, image and link content
Meta information about a page
Most of this is generated by the page author (or sometimes auto-generated by a website’s content management system):
Page titles, “Meta” descriptions and other tags that may be invisible to the average user
The number of links pointing to a page and the reputations of the sites that contain those links
How frequently content is updated on the page
An interface to the spiders
The crawling process itself is so important to website owners that both Google and Bing provide interfaces to their crawling tools that give a window (though an incomplete one at best) into how the search engines see and crawl your website.
There are countless other third-party tools available that provide recommendations and assessments of websites’ search engine optimization and rankings.
Indexing
In the early days, search engines had submission forms so that people could add their site to the crawling process. This quickly became impractical and simply didn’t scale as fast as the Web was growing. Because of this, after the late 1990s search engines moved to a solely crawl-based method of indexing websites.
This relies on building upon existing search indexes, user-submitted site maps, and using websites’ linking structures to find new sites almost exactly as a human user might click from hyperlink to hyperlink as they browse the Web.
Evolving from metadata to page content
In the early days of search engines, page authors were relied upon to provide information about their page content (known as Meta information), which was sometimes true though led to some bad SEO practices, which we’ll discuss in a bit. As time has gone by, spiders have become more sophisticated. Instead of solely relying on information provided by page authors that can be easily manipulated (more on that in the next section), they are also focusing on:
How often and how recently the page is updated
Domain authority
Geographic location
Website quality indicators such as mobile-friendliness and page load times
And many other things that provide the ability to use contextual relevance to searchers
In addition, new tools have been provided by Bing and Google (Bing Webmaster Tools and Google Search Console, respectively) to allow webmasters to submit accurate site maps and page listings, which allow the search engines to more quickly discover new pages and accurate information.
As you can imagine, with all of the information now available, these search indexes are now quite large. Google’s, for instance, is now over 100,000,000[ii] gigabytes in size and growing. Remember, this isn’t the size of the content on the pages, but simply the information that is indexed about the content on those pages, or about the videos, images, or other types of content. This is truly meta information, and the sheer size of it is overwhelming.
Calculating Relevancy
Search engines such as Google, Bing and any others use unique algorithms to calculate relevancy and perform matching against a search query that is submitted.
A search algorithm can be described as a mathematical formula that takes a problem (or question/query) as its input and then returns a solution to the problem. In doing so, it most likely evaluates several possible solutions. A search engine algorithm uses keywords as the input problem, and returns relevant search results as the solution, matching these keywords to the results stored in its database, or search index. Keyword relevancy is based on a math formula that will vary between search engines.
This means that the search engine with the best algorithm wins, right? Arguably, the answer to that question is yes, which partially explains Google’s rise to the top (see the previous chapter for more information about this), and also explains the constant need to make algorithms better.
What is neither simple nor transparent are the formulas used within these algorithms which use this data to compile and display search results, or how all of the above (as well as many other factors) is weighted and ranked to display the results you receive. You may have heard news stories in the digital marketing world about how the latest Google algorithm updates have wreaked havoc on businesses’ online presences.
One of the most famous updates was the original Google “Penguin” algorithm, released April 2012. This change to Google’s ranking platform was primarily meant to penalize sites that were engaging in the following “black hat” SEO tactics.
Keyword stuffing
Link schemes, including link farms and hidden links on pages
Cloaking, redirects or “doorway” pages
Purposeful duplicate content
Since Penguin, there have been many updates (some less controversial than others.) According to the Moz “Google Algorithm Change History” page, there have been roughly 16-18 modifications per year for the last several years. Some have gone with other animal names like “Panda” and “Hummingbird,” and there have even been subsequent Penguin releases.
Assessing Quality
Google has several ways of assessing the quality of a website’s content. One of these is the “Your Money or Your Life” (YMYL) guidelines, made available in Google’s Quality Ratings Guidelines, a 160+ page document first published by the search giant in 2015, and updated several times subsequently.
YMYL is a subset of Google’s Quality Ratings Guidelines that their human website reviewers use to assess the quality of a website. By using a combination of many factors, it helps these reviewers determine how reputable a website is. This includes everything from the frequency of updates, to the content of key sections, to the overall look of the site.
In particular, YMYL deals with website content that discusses or affects the visitor’s “financial stability,” to use Google’s exact words, in addition to their health. In the most recent update, “financial stability” was used to replace the word “wealth,” which was a little more vague. In addition to banks, credit unions and other financial institutions, this also includes medical and legal advice pages.
How is quality defined by Google with YMYL?
There are 6 factors used in the YMYL rankings, according to Google:
Is the content from a reputable source?
Are contact details present?
Does the site have a good reputation?
Is the website kept up to date?
Is there consistency within the site?
Is the design of the site of high quality?
Even if your SEO efforts aren’t related to financial advice or content, the above still gives you some insight into how a modern search engine is “thinking” about your content.
Contextual Relevance
The chances of success with your search increase drastically the more context you have about the person. In that case, the more context you have, the better you can execute a search and the better the results you will receive.
So what do we mean when we talk about relevance or contextual relevance? Context provides search engines some method to narrow down from the billions of web pages out there on the Internet.
While many others break down context into many categories, we’re going to talk about four main categories:
Location
Geography
Language
Weather. Searches for local destinations may be impacted by current weather, such as pubs with beer gardens being more prominent on sunny days.
Date/Time
Personalization
History
Demographics
Social (your followers)
Interest Profile
Mood. Positivity, negativity, excitement, hunger and far more, as reflected in status or other updates, could impact the content presented.
Platform
Computer, Tablet, Smartphone or other type of device
User-provided context
All of the above are in addition to the obvious: context provided explicitly by the user. For instance, some types of user-provided context are:
The word or words typed into the text field of a search engine
Turning location-awareness on for the device on which they are searching
Allowing their search history to be saved
Choosing an “Image” or “Maps” search in the search engine instead of a standard search
Logging into their account on an e-commerce store
Imagine trying to help someone find what they are looking for without any type of context? Even the early Web directories like Yahoo! at least allowed users to self-select from predefined (and often rigid) categories.
Retrieving Results
When someone searches for a keyword or phrase, search engines will run that search through an algorithm that uses up to hundreds of different criteria to help provide a ranking of web pages, documents, images, map results, products available for sale, and other types of results.
These can be further divided into what are often referred to as “graphs” or a series of signals that search engines use to help clarify a potential result’s importance or relevance.
Link graph
The link graph is a record of every single time a page or document is linked to within other documents and links. As you can imagine, that’s a pretty long list. To add to its complexity, links from different sources have different weights or values, depending on how reputable the source page is. For instance, Google’s PageRank uses things like the age of a domain, the number of reputable pages linking into that site, and other things to assign a score to a page or domain.
Then, in turn, when that page or domain links out to another page, the value of that link is based in part on the PageRank of that page. Note that PageRank applies specifically to Google, but each search engine has their equivalent.
Social graph
Much like search engines use link graphs to determine how valuable a piece of content is based on what other websites link to it, the social graph determines that value based on social media engagement. This means, the more something is liked, shared, commented on, and so forth, the more valuable that piece of content is according to the social graph.
There is also a timing component to this, particularly on the more intelligent search engines tied to a social graph. For instance, if a video has a lot of views, but is 10 years old, something that is more recent but with a similar amount of engagement would show first. Social graphs tie into not only popularity by a sum of the measures of engagement, but also the content’s recency.
Also note that as of the writing of this book, Google does not tap into the social graph for its results, but many other search engines (particularly those tied to social networks themselves) utilize this.
Semantic graph
If you’ve ever made a mistake in your spelling when typing a search query into Google, you’ve undoubtedly seen the “Did you mean:” prompt, which gently corrects your error and suggests one or more words that are spelled correctly.
In addition to providing relatively simple spell checking, the semantic graph also ties your search terms to other related terms, or even concepts. Some of this can get relatively abstract pretty quickly, but a lot of time and effort is continually put into enhancing these human language translation efforts.
Finally, the semantic graph also has other methods (beyond link popularity) to determine authority on a topic, which is used to help those websites, pages, and documents rank higher in categories of information for which they appear to be a domain expert.
Other factors
In addition to the above, there are many other factors that go into search algorithms. Some of these are contextual, such as location-based (is this restaurant nearby?), language-based (is this content in the language this browser is set to?), or personalized (did this user recently search for something similar and visit a similar page?), and others are proprietary and can only be guessed at.
Understanding how search works helps marketers to become better search engine marketers. While you may not be responsible for the algorithms that enable people to find content, having insights into how they work helps you create more effective, searchable content.
Learn more about my book Ever Seeking: A History and Future of Search
[i] Wall, Matthew. “Big Data: Are you ready for a blast-off?” BBC News. March 4, 2014.
[ii] Google. “How Search Organizes Information.” Retrieved July 4, 2017.
[iii] Dawson, Ross. “The 9 Kinds of Context That Will Define Contextual Search.” RossDawson.com, 2011.