Computer Techniques Engineering Department

Scientific Article by the Assistant Lecturer Ridhab Sami Abd Ali: (How Search Engines Work: Crawling, Indexing, and Ranking)

17/03/2020

2583

Search engines are answer machines. They exist to discover, understand, and organize the internet's content in order to offer the most relevant results to the questions searchers are asking. In order to show up in search results, your content needs to first be visible to search engines. It's arguably the most important piece of the SEO puzzle: If your site can't be found, there's no way you'll ever show up in the SERPs (Search Engine Results Page). How Search Engines Work: The Basics A “search engine” is several interlinked mechanisms that work together to identify pieces of web content images, videos, website pages, etc. based on the words you type into a search bar. Site owners use Search Engine Optimization to improve the chances that content on their site will show up in search results. Search engines use three basic mechanisms: • Web crawlers: Bots that continually browse the web for new pages. Crawlers collect the information needed to index a page correctly and use hyperlinks to hop to other pages and index them too. • Search index: A record of all web pages online, organized in a way that allows association between keyword terms and page content. Search engines also have ways of grading the quality of content in their indexes. • Search algorithms: Calculations that grade the quality of web pages, figure out how relevant that page is to a search term, and determine how the results are ranked based on quality and popularity. Search engines try to deliver the most useful results for each user to keep large numbers of users coming back time and again. This makes business sense, as most search engines make money through advertising. Google made an impressive $116B in 2018, for example. How Search Engines Crawl, Index, and Rank Content Search engines look simple from the outside. You type in a keyword, and you get a list of relevant pages. But that deceptively easy interchange requires a lot of computational heavy lifting backstage. The hard work starts way before you make a search. Search engines work round-the-clock, gathering information from the world’s websites and organizing that information, so it’s easy to find. This is a three-step process of first crawling web pages, indexing them, then ranking them with search algorithms. How Does Web Crawling Work? Search engines use their own web crawlers to discover and access web pages. All commercial search engine crawlers begin crawling a website by downloading its robots.txt file, which contains rules about what pages search engines should or should not crawl on the website. The robots.txt file may also contain information about sitemaps; this contains lists of URLs that the site wants a search engine crawler to crawl. Search engine crawlers use a number of algorithms and rules to determine how frequently a page should be re-crawled and how many pages on a site should be indexed. For example, a page which changes a regular basis may be crawled more frequently than one that is rarely modified. What is a search crawler? A search crawler is a bot that scans web pages and adds these to search indexes. “Scanning” means getting a copy of the HTML on each page, and then using this to determine relevance for a search query. When a site is indexed for the first time the crawler will visit a nominated domain and sitemap. The crawler will first scan this domain’s homepage and add this to a search index. Then the crawler will visit every link on this page, and add each of these pages too. The crawler will continue scanning pages and visiting links until all accessible pages are in your Collection. In a few seconds, the crawler can visit hundreds of pages and add these to a search index. This method of following links to scan new pages is exactly how web search engines like Google work too. Indexing After finding a page, a bot fetches (or renders) it similar to the way your browser does. That means the bot should “see” what you see, including images, videos, or other types of dynamic page content. The bot organizes this content into categories, including images, CSS and HTML, text and keywords, etc. This process allows the crawler to “understand” what’s on the page, a necessary precursor to deciding for which keyword searches the page is relevant. Search engines then store this information in an index, a giant database with a catalog entry for every word seen on every webpage indexed. Google’s index, takes up around 100,000,000 gigabytes and fills “server farms,” thousands of computers that never get turned off, around the globe. Make sure crawlers “see” your site how you want them to; control which parts of the site you allow them to index. • URL Inspection Tool: If you want to know what crawlers see when they land on your site, use the URL Inspection Tool. You can also use the tool to find out why crawlers aren’t indexing the page or request that Google crawl it. • Robots.txt: You won’t want crawlers to show every page of your site in SERPs; author pages or pagination pages, for example, can be excluded from indexes. Use a robots.txt file to control access by telling bots which pages they can crawl. Blocking crawlers from certain work-a-day areas of your site won’t affect your search rankings. Rather, it’ll help crawlers focus crawl budget on the most important pages. Ranking In the final step, search engines sort through indexed information and return the right results for each query. They do that with search algorithms, rules that analyze what a searcher is looking for and which results best answer the query. Algorithms use numerous factors to define the quality of the pages in their index. Google is leverages a whole series of algorithms to rank relevant results. Many of the ranking factors used in these algorithms analyze the general popularity of a piece of content and even the qualitative experience users have when they land on the page. These factors include: • Backlink quality • Mobile-friendliness • “Freshness,” or how recently content was updated • Engagement • Page speed • To make sure that the algorithms are doing their job properly, Google uses human Search Quality Raters to test and refine the algorithm. This is one of the few times when humans, not programs, are involved in how search engines work. Search engines want to show the most relevant, usable results. This keeps searchers happy and ad revenue rolling in. That’s why most search engines’ ranking factors are actually the same factors that human searchers judge content by such as page speed, freshness, and links to other helpful content. When designing and refreshing websites, optimize page speed, readability, and keyword density to send positive ranking signals to search engines. Working to improve engagement metrics like time-on-page and bounce rate can also help boost rankings.