Google API Leak: Comprehensive Review and Guidance

Google API Leak: Comprehensive Review and Guidance

The SEO World is Rocked by a Massive Google Leak

To begin, I want to address the article length. This article is very lengthy due to the sheer amount of information to cover. All of it is extremely interesting, so I found a full TL;DR very difficult. Instead, if you prefer you can find relevant action items under related sections, highlighted in yellow.


Intro:

Just when you thought the SEO world couldn't get any crazier following a series of site-shattering updates,  a bombshell document leak has turned everything on its head. In May 2024, internal documentation from Google's Content Warehouse API was exposed in an unprecedented look under the hood.

The leak originated from Google's own GitHub repository and was publicly accessible for about 6 weeks before being removed on May 7th.  During that time, the documentation was widely shared within the SEO community.

First off, is this whole thing legit?

To verify the authenticity of the leaked information, multiple former Google employees reviewed the documents and confirmed that they appeared to be legitimate. Industry expert Rand Fishkin obtained the leaked files from an anonymous source, who later revealed himself as SEO expert Erfan Amizi. Amizi claimed to have spoken to former Google employees to confirm the legitimacy of the documents. Rand Fishkin then enlisted the help of technical SEO guru Mike King to analyze the vast amount of data.

After an extensive review, Mike concluded the documentation is legitimate and lined up with details from sources with deeper knowledge of Google's internal operations.

What Was Leaked?

We are talking about a staggering 2,500 pages of technical documentation detailing over 14,000 attributes and features that Google's search algorithm potentially uses to rank websites. This provides unprecedented insight into how Google evaluates and weights various elements such as links, user engagement, site authority, and page content for ranking purposes.

Key revelations from the leaked files included confirmation that Google uses data from Chrome to influence rankings, employs whitelists for sensitive search topics, and considers factors like author expertise and brand mentions. Additionally, the documents highlight Google's efforts in spam detection and quality signals through features such as SpamBrain and Quality Rater feedback.

The leaked files reveal details about systems like:

  • User Engagement Metrics: Contrary to Google’s previous statements, the documents suggest that user engagement metrics such as clicks, impressions, and Chrome data play a significant role in rankings.
  • NavBoost and Glue: Systems like NavBoost and Glue utilize clickstream data to influence search rankings, emphasizing user behavior and demote site visibility.
  • PageRank Variants: The documents reveal multiple types of PageRank, including deprecated versions, indicating the evolution of Google's ranking strategies.
  • Spam Detection and Quality Signals: Features like SpamBrain and Quality Rater feedback are integrated into the ranking process, highlighting Google's efforts to maintain search result quality.
  • Vertical optimization: Methods for identifying different site business models like news, ecommerce, personal blogs

Detailed Descriptions of Modules and Features:

The leaked documents provide descriptions of various specific modules and features that are integral to Google's ranking systems. It's important to note these down as it will help understand the algorithm as well as some specific points mentioned throughout this article: 

 Here are some notable ones:

  • Craps: This module is related to click and impression signals. It includes metrics like bad clicks, good clicks, last longest clicks, unsquashed clicks, and unsquashed last longest clicks. These metrics measure the success of search results based on user click behavior.
  • PerDocData: This module mentions the attribute hostAge, which is used to sandbox fresh spam in serving time. This confirms the existence of sandboxing for new content based on the host's age, impacting the visibility of new content in search results.
  • RealTime Boost: This system uses data from the Chrome browser to influence search rankings. Metrics like total Chrome views for a site and Chrome transition clicks (chrome_trans_clicks) are considered, emphasizing the importance of optimizing for Chrome user behavior.
  • NavBoost: This key system utilizes click-driven metrics to boost, demote, or adjust the ranking of web search results. It has been updated to use a 13-month data window and focuses on web search results, while a related system called "Glue" handles ranking for other universal search verticals.
  • Mustang: Identified as the primary scoring, ranking, and serving system. It encompasses various scoring algorithms and serves as the backbone for ranking processes.
  • Twiddlers: Re-ranking functions that operate after the primary search algorithm. They adjust information retrieval scores or change the ranking of documents just before presenting them to the user. Examples include FreshnessTwiddler for document freshness and QualityBoost for enhancing quality signals.
  • Trawler: The web crawling system that maintains crawl rates and understands how often pages change. It plays a crucial role in keeping the index updated with the latest web content.
  • HtmlrenderWebkitHeadless: A rendering system for JavaScript pages. It originally used Webkit but later transitioned to Headless Chrome, highlighting the importance of rendering JavaScript for search indexing.
  • Alexandria and TeraGoogle: Core indexing systems where Alexandria handles primary indexing, and TeraGoogle manages long-term document storage on disk.
  • SuperRoot: The central system that coordinates queries and manages the post-processing system for re-ranking and presenting search results.
  • SnippetBrain: The system responsible for generating snippets for search results, ensuring relevant and concise information is displayed in the search results page.

Additionally, the leak confirmed some of the biggest "lies" that Google has been accused of spreading for years regarding things they allegedly don't use for rankings:

  • Click data and dwell time metrics
  • Chrome user data
  • Whitelists for preferred sites in verticals like travel and health
  • Quality rater feedback from human evaluators
  • A "sandbox" for new websites

While the leak lacked full context on how the signals are weighted and combined, it appeared to contradict many of Google's past public statements on its ranking systems. The SEO community continues to analyze the 2,500+ pages of documentation to extract insights and identify potential changes to SEO best practices.

Let’s dig into the contents of the Google Search API leak, and examine what it reveals about Google's ranking factors, how it aligns with or contradicts Google's previous statements, and what it means for SEO moving forward.

Google’s Response:

Following the leak, many in the SEO community waited for an official response from Google. On May 29, 2024, Google finally broke its silence:

"We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information. We've shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."

Google's statement emphasizes a few key points:

  1. The leaked information may be outdated or incomplete and lacks full context.
  2. Drawing conclusions based solely on the leak could lead to inaccurate assumptions.
  3. Google has already shared extensive information about its search systems through official channels.
  4. Specific details are often withheld to prevent manipulation and protect the integrity of results.

Notably, Google did not outright deny the authenticity of the leaked documentation. Instead, they focused on downplaying its significance and completeness. When pressed for comment on specific factors mentioned in the leak that contradict past statements, Google declined to address them.

However, some in the SEO community have criticized this stance, arguing that the leak exposed a pattern of contradictions and that Google should be more transparent. Many feel that while specifics may need to be guarded, a general acknowledgment and explanation of the discrepancies is warranted.

For now, it appears that Google's official stance is to minimize discussion of the leak and redirect focus to its existing public documentation and communications. While unsatisfying to some, this approach is consistent with Google's priorities of protecting its algorithms and preventing manipulation.

As the SEO community continues to analyze and debate the leaked information, pressure may mount for Google to provide further clarification. However, a sudden shift to full transparency is likely unrealistic. The most probable outcome is a gradual increase in confirmations of general concepts and practices, without delving into the specific details laid bare by the leak.

πŸ“°
Update!

Google has responded to some details like Navboost and clicks you can read more here

What the leak revealed: 

There is so much information gathered from the leaked documents, let's break this up into relevant categories;

The Relationship to Panda

The leaked documents shed light on the relationship between Google's current ranking systems and its historical Panda algorithm. Key points include:

Panda's Legacy

  • Quality Focus: Panda, launched in 2011, aimed to reduce the rankings of low-quality sites and promote high-quality content.
  • Content Quality: The leaked documents show that many principles from Panda still influence current algorithms, particularly in assessing content quality and user engagement.

Similar Systems and Metrics

  • Baby Panda References: The documents mention "Baby Panda," which refers to updates or variations of the original Panda system, indicating its ongoing relevance.
  • Quality Signals: Systems like SpamBrain and Quality Rater feedback continue Panda's mission of prioritizing quality content. Metrics such as content originality, user engagement, and site authority reflect Panda's foundational concepts.

Integration with Modern Algorithms

  • Embeddings and Contextual Understanding: Panda's emphasis on content relevance is now integrated with advanced techniques like site and page embeddings, which analyze topical focus and context.

The leaked documents highlight that while Google's ranking algorithms have evolved significantly since Panda's introduction, many of its core principles remain integral. The focus on content quality, user engagement, and combating spam are enduring legacies of the Panda algorithm, now enhanced by sophisticated data analysis and machine learning techniques.

User Engagement Signals and Click Data

The leaked API documentation confirms that Google indeed uses click data and various user engagement metrics as signals in its ranking systems. This contradicts years of public statements from Google downplaying or denying the use of such signals.

The NavBoost System

The documents reference a key system called "NavBoost," which utilizes click-driven metrics to boost, demote, or otherwise adjust the ranking of web search results.

Google VP Pandu Nayak confirmed the existence of NavBoost in the DOJ antitrust case, stating that the system has used a rolling 18-month window of click data since around 2005. Recently, it was updated to use a 13-month data window and focuses specifically on ranking web search results, while a related system called "Glue" handles ranking for other universal search verticals.

Sandboxing and HostAge: The leaked documents reveal that Google employs a sandboxing mechanism for new content based on the host's age. This means that fresh content may be temporarily sandboxed, affecting its visibility in search results.

  • HostAge Attribute: Mentioned in the PerDocData module, the hostAge attribute is used to sandbox fresh spam during serving time. This helps Google manage the quality of new content and prevent spam from ranking prematurely.
  • Impact on New Content: New websites or content may be placed in a sandbox period where their visibility is limited until they establish credibility and trust signals.
πŸ‘€
SEO TAKE AWAYS: Be aware of the sandboxing effect on new content. To minimize the impact, focus on building strong initial credibility through high-quality content, authoritative backlinks, and user engagement. Monitor the performance of new content and continue optimizing it to gain trust signals and exit the sandbox period more quickly. It's good to have a launch plan in place for any important content pieces you want higher visibility on.

The leaked documents support Nayak's statements, containing multiple references to features like:

  • goodClicks
  • badClicks
  • lastLongestClicks
  • Impressions
  • "Unicorn" clicks

These appear to be various ways Google measures the success or failure of a search result based on user click behavior. For example, a "good" click likely represents a user clicking through and spending significant time on the page, while a "bad" click is a quick bounce back to the search results.

πŸ‘€
SEO Take Aways:
πŸ‘‰ Craft compelling titles and meta descriptions to boost CTR.

πŸ‘‰ Enhance site speed and usability to reduce bounce rates.

πŸ‘‰ Creating engaging content that encourages users to stay longer and explore more pages.

Leveraging Chrome Data

The documents also reveal that Google calculates metrics related to individual pages and entire websites using data from the Chrome web browser. This aligns with claims that one of the key motivations for creating Chrome was to obtain broad user clickstream data to improve search rankings.

Chrome Data Metrics: The documents indicate that Google calculates metrics such as total Chrome views for a site (chromeInTotal), most visited URLs based on Chrome click data (topUrl), and Chrome transition clicks (chrome_trans_clicks). These metrics help Google understand user interactions and engagement with websites.

Impact on Rankings: The integration of Chrome data into Google's ranking algorithms highlights the importance of optimizing for user behavior on Chrome. Sites that perform well in terms of user engagement on Chrome are likely to benefit in search rankings.

This suggests Google can leverage clickstream data from Chrome users to identify important pages, measure engagement, and factor that into its ranking systems.

With the revelation that systems like NavBoost use click data to influence rankings, SEOs should:

πŸ‘€
SEO Take Aways

πŸ‘‰ Monitor and analyze click data more rigorously.

πŸ‘‰ Implement strategies to increase genuine clicks, such as improving the visual appeal of search snippets and providing clear, enticing calls to action.

πŸ‘‰ Leverage tools that provide real-time analytics to track user behavior patterns.

πŸ‘‰ Optimize sites to perform well on Chrome, considering it might impact rankings more than previously thought.

πŸ‘‰ Focus on improving site speed, usability, and overall user experience.

πŸ‘‰ Use analytics tools to monitor and respond to user interactions, ensuring your site performs well for Chrome users.

Contradicting Public Statements

For years, Google has downplayed the role of clicks and user data in rankings through statements like:

"Using clicks directly in rankings would be a mistake." - Gary Illyes
"Dwell time, CTR, whatever Fishkin's new theory is, those are generally made up crap." - Gary Illyes on Rand Fishkin's research
"We don't use Chrome browsing data for ranking purposes." - John Mueller

The leaked documentation, combined with Nayak's DOJ testimony, clearly contradicts these types of statements from official Google representatives.

The implications for SEOs

The implications for SEOs are significant. Optimizing for user engagement signals like click-through rates from search results and minimizing "pogo-sticking" back to Google after a click may need to become a bigger focus area. Driving qualified traffic to your site through channels like social media and email can also help reinforce positive engagement patterns.

However, it's important not to lose sight of creating a quality user experience and valuable content. As the documentation shows, Google still heavily weighs relevance and quality signals as well. The best approach is to align engagement optimization as a complement to your overall SEO strategy, not a replacement for quality content.

Whitelists for Sensitive Topics

The leaked API documentation suggests that Google uses whitelists to control which websites are allowed to rank prominently for certain types of sensitive or controversial queries. Google appears to prioritize categories like health, news, civics, and other "Your Money or Your Life" (YMYL) topics.

A few specific examples are referenced in the documents:

Travel Whitelist

The documents mention a "Good Quality Travel Sites" module, indicating the existence of a whitelist for the travel vertical. This suggests that Google may prioritize authoritative sources like major booking sites, travel guides, and review platforms for travel-related queries.

This could help explain why some smaller travel blogs struggle to rank well for broad travel terms, despite having quality content. Google may be explicitly prioritizing established brands in this space.

Covid-19 Whitelist

During the Covid-19 pandemic, the documents indicate Google used a "isCovidLocalAuthority" flag to identify trusted local health authorities to prominently display for Covid-related searches. This was likely to elevate factual sources from official organizations like the CDC, WHO, and local government health departments to combat misinformation.

Election Authority Whitelist

The documents mention an "isElectionAuthority" attribute, which was likely used to control what websites were allowed to rank for queries around elections and political issues. This selective approach aimed to promote authoritative sources for civic-related searches and avoid amplifying misinformation.

The existence of these whitelists shows Google's awareness that some topics are too important or sensitive to allow just any website to rank prominently. By prioritizing authoritative and trustworthy sources, they aim to provide high-quality information and facts from established entities.

For sectors like travel, health, news, and civics, this underscores the importance of building a strong, respected brand and earning Google's trust as an authoritative voice in your field. Smaller, less established sites may face an uphill battle ranking for broad terms in these categories if Google is explicitly whitelisting major brands.

However, the leaked documents don't provide insight into how granular these whitelists are applied. They could be limited only to the most broad, high-profile queries in these verticals. More niche topics may still allow for better ranking opportunities from quality, niche-relevant sources.

Human Quality Rater Data

In addition to whitelists, the leaked documents mention Google potentially using data from its human quality raters to influence rankings. This includes specific mentions of:

  • Relevance ratings from quality rater evaluations
  • Human ratings from Google's quality rater platform called "EWOK"

The role and impact of these human quality ratings is unclear from the leaked information. However, it suggests that in addition to machine learning systems, Google may incorporate manual curation and human judgment into its ranking systems, at least for certain types of queries or websites.

The approach makes sense, as Google would likely want an additional layer of human oversight for its most critical search verticals like health, news, finance, etc. Allowing algorithms alone to fully determine rankings for high-stakes topics could be seen as too risky.

For website owners and SEOs, this is another indicator of the importance Google places on brand reputation, authority, and trust signals. If human quality raters are indeed evaluating sites for these factors, it's critical to establish a strong brand presence and focus on expertise, authoritativeness and trustworthiness, especially for sites operating in sensitive verticals.

Overall, the leaked information around whitelists and human raters reveals Google's multi-layered approach to controlling its search results, especially for categories where low-quality information could be problematic or even dangerous. While the existence of whitelists may be frustrating for smaller brands trying to break into these spaces, 

Google's motivation is to provide high-quality, authoritative results on important topics. However, this may perpetuate an echo chamber where newer voices in the space are never heard or granted authority.

The documents reaffirm the significance of content quality, with systems like SpamBrain and Quality Rater feedback playing crucial roles.

πŸ‘€
SEO Take aways
πŸ‘‰ Focus on producing original, high-quality content that meets the needs of users. If utizing AI make sure to have a strong human-written component and consider how to make your content unique,

πŸ‘‰ Regularly update and refresh content to maintain its relevance and accuracy.

πŸ‘‰ Ensure the content is topically relevant and includes multiple types of resources (images, videos, etc)

The leaked documentation provides a significant amount of new information about how Google evaluates and weights different types of links for ranking purposes. This contradicts the narrative in recent years that links are becoming less important as a ranking signal, but does show how this ranking signal is evolving to better consider relevance and engagement of a placed link.

Link Indexing Tiers

One significant revelation is the existence of different "indexing tiers" at Google, which categorize websites and pages based on their importance and authority. The higher the tier a page is indexed in, the more valuable and heavily weighted its outgoing links become.

The documentation makes multiple references to a metric called "sourceType" which shows the relationship between where a page is indexed and its link value. For example:

"sourceType: 1 means the source is in the primary index (e.g. fresh, high quality content)"

This suggests that links from frequently updated, high-quality pages in Google's primary index carry more ranking weight than links from less frequently updated or lower-quality sources.

Link Scoring Based on Traffic

Additionally, Google scores links higher if they originate from pages with higher overall traffic and user engagement. The document discusses "topUrl" which represents "A list of top urls with highest two_level_score, i.e., chrome_trans_clicks."

This suggests that Google uses data from Chrome browsers to identify the most visited pages on a website, and weights links from those popular pages more heavily than links from less-visited parts of the site. This may also help to explain the decisions behind site-links in SERPs and why webmasters do not have control over these features.

So not only does the overall authority of the linking website matter, but the specific page-level popularity and engagement metrics factor into how much value Google assigns to a link.

Relevance Is Critical

The documentation also indicates that Google evaluates the topical relevance between the content of the linking page and the target website receiving the link. The "anchorMismatchDemotion" suggests that Google may ignore or downgrade links when there is a disconnect between the topics of the source content and the destination site.

Earning links from high-quality, popular pages is great, but those links need to be coming from contextually relevant content to pass maximum value.

Link Analysis and Indexing Tiers: The leaked documents provide insights into how Google stratifies its index into different tiers. Links from higher-tier content, which are frequently updated and of higher quality, are considered more valuable.

  • Indexing Tiers: Google categorizes websites and pages based on their importance and authority into different indexing tiers. Higher-tier pages, often stored in flash memory due to their freshness and quality, provide more valuable links.
  • Link Quality Metrics: Metrics like "sourceType" show the relationship between the indexing tier of a page and its link value. For instance, links from frequently updated, high-quality pages (higher-tier) carry more ranking weight.
  • Spam Detection: Metrics such as "phraseAnchorSpamDays" measure link velocity and spam, helping Google identify and nullify potential spam attacks.

Links from the Same Country/Region

Another interesting note is that the documentation references "localCountryCodes" which tracks the countries that a linking page is most relevant for. This suggests that Google may weigh links more heavily when they come from a website that is geolocated and targeted to the same country/region as the destination website.

For example, a U.S. business may see more ranking benefit from a link on a U.S. news website than an equally popular site based in another country. This has been a suspicion for sometime, as it may be easier for link spammers to build up a large network of foreign sites.

Anchor Text Still Matters

While some believed anchor text had become an obsolete ranking factor, the leaked documents show Google is still analyzing anchor text, atleast in the areas of penalties if not for keyword identification.

There are references to scoring "phraseAnchorSpam" and an "anchorSpamPenalizer" which suggests Google has ways to demote links with over-optimized anchor text.

The documentation also states that Google tracks the "average weighted font size" of anchor text on pages. This implies that giving more visual emphasis and styling to links could be a way to indicate their importance.

Overall, the leaked data portrays links as continuing to be a critical piece of Google's ranking systems, with many nuanced factors going into how each link is scored and weighted. While high authority, frequently updated, and relevant pages are ideal for link placements, the documents show Google has sophisticated ways to evaluate all the minute details around links to determine precise value.

For SEOs focused on link building, these revelations emphasize the continued importance of relevance, diversity, and avoiding any perception of manipulation in link acquisition strategies. The fundamentals of high-quality content marketing and PR-driven link building are reinforced as the path to sustainable link equity based on Google's internal processes.

The information about different quality tiers for links and the potential influence of click data on link value suggests SEOs should:

πŸ‘€
Take-Aways
πŸ‘‰ Prioritize obtaining links from high-traffic, authoritative sites.

πŸ‘‰ Place links that are likely to get clicks and engagement.

πŸ‘‰ Gain mentions as well as links.

πŸ‘‰ Focus on the relevance and contextual fit of backlinks rather than sheer volume.

πŸ‘‰ Focus on acquiring high-quality links from top-tier indexed pages.

πŸ‘‰ Ensure that your links are placed on pages that are frequently updated and considered high-quality by Google.

πŸ‘‰ Monitor the rate of link acquisition to avoid triggering spam signals and maintain a natural link profile.Focus on maintaining a natural link velocity to avoid the appearance of spam.

On-Page Signals

The leaked documents provide insight into many of the on-page factors that Google appears to consider when ranking web pages. While we don't know precisely how these are weighted or combined, the sheer number of potential on-page signals is illuminating.

Page Titles

Despite previous statements from Google downplaying the importance of page titles, the documents reference a "titlematchScore" which suggests titles that closely match the query are given more weight. Optimizing titles for relevant keywords could remain an important factor.

However, there's no evidence of Google using a specific character limit for titles, despite that being a common SEO best practice. The documentation makes no mention of any length constraints. 

When it comes to title tag length, the best practice I go by is keeping them under 65 characters, this is mainly due to making sure they don’t get cut off, leading to a higher CTR (which we now know can aid your SEO).  However, with the rate at which title tags are rewritten, it is worthwhile to test ranking, CTR and traffic to pages when adjusting the Title length.

Content Quality

The leaked documents reveal multiple factors that Google considers when assessing the quality and depth of a page's main content:

  • textConfidence - Likely a measure of how relevant and trustworthy the text content is for the query
  • ugcDiscussionEffortScore - A score estimating the effort involved in user-generated content like comments
  • OriginalContentScore - Suggests Google can identify and score true original content higher
  • keywordStuffingScore - A spam signal that likely demotes pages for keyword over-optimization

The documents also mention an "effortScore" that may use AI language models to estimate the effort involved in creating an article's content based on factors like multimedia use, depth of research, etc.

Font Size Impact: The leaked documents reveal that Google tracks the average weighted font size of terms in documents. This suggests that visual emphasis on text, such as using larger font sizes, could impact rankings.

Topic Relevance

Google uses various metrics to gauge the relevance page's content is to its main topic, and penalizes pages that deviate too far:

  • siteFocusScore - How focused a site is on a particular topic
  • siteEmbeddings - Compressed vector embeddings representing a site's main topics
  • pageEmbeddings - Similar embeddings, but at the page level
  • siteRadius - Measures how far a page's embeddings deviate from the core site embeddings

This reflects Google's emphasis on entities, topics, and providing comprehensive coverage of a subject rather than keyword-matching.

Dates and Freshness

The documentation shows Google tracks multiple date signals:

  • bylineDate - The date explicitly specified on the page
  • syntacticDate - Dates extracted from URLs, titles, etc.
  • semanticDate - Dates derived from analyzing the page content
  • lastGoodClick - Date of the last time a user clicked through from search and stayed on the page

This highlights the importance of clearly specifying accurate published/updated dates and regularly refreshing content, especially for time-sensitive queries.

Overall, the leaked data reinforces established on-page best practices, while shedding a light on newer areas like Google's apparent ability to estimate content creation effort and use of advanced language models.

While these on-page factors are revealing, they represent just one piece of Google's ranking systems. The documents also provide insight into off-page signals like links, user data, and overall site authority calculations that are covered in other sections.

πŸ‘€
Take Aways:
πŸ‘‰ Optimize page titles to include target keywords, as the "titlematchScore" signal measures how well titles match user queries.

πŸ‘‰ Put important keywords towards the front of titles.

πŸ‘‰ Create high-quality, in-depth content to score well on the "EffortScore" which measures the effort put into content creation.

πŸ‘‰ Include multimedia like images, videos, data visualizations etc.

πŸ‘‰ Focus on creating content closely aligned with your site's core topics to maximize the "SiteFocus" score which measures topical relevance.

πŸ‘‰ Avoid deviating too far from your main subject areas.

πŸ‘‰ Pay close attention to entity salience and ensure your target entities are prominently mentioned and described well, as entity identification appears to be a factor.

πŸ‘‰ Ensure consistent use of dates across structured data, URLs, page content etc. to provide clear freshness signals.

πŸ‘‰ Optimize for originality over simply creating long-form content, as there are signals measuring content uniqueness and duplication.

πŸ‘‰ For sites with user-generated content, implement robust moderation and prompting to maximize the "ugcDiscussionEffortScore" signal.

πŸ‘‰ Leverage author expertise signals by having content authored by credible subject matter experts who publish across multiple authoritative sites.

Site Signals

In addition to on-page factors, the leaked documents provide insight into many site-wide metrics that Google appears to consider when ranking websites. These suggest Google goes well beyond just individual page analysis, which is unsurprising to those practicing SEO but good to have validation of best practices.

Site Authority

One key finding is that Google calculates and stores a "siteAuthority" metric for websites, contradicting past statements denying the existence of such a measurement. While the documentation doesn't specify how siteAuthority is calculated, it is listed as one of the "Compressed Quality Signals" used in Google's main scoring systems.

This indicates that Google has a site-wide authority metric similar to third-party authority scores like Moz's Domain Authority or Ahrefs' Domain Rating, but likely more sophisticated.

Site Embeddings and Topic Focus

The documents reference techniques Google uses to understand the main topics a website covers through embedding models:

  • siteEmbeddings - Compressed vector embeddings representing a site's core topics
  • siteFocusScore - A score of how focused a site is on particular topics
  • siteRadius - Measures how far any given page's content deviates from the core site embeddings

This aligns with Google's increasing emphasis on entities and comprehensive topic coverage rather than just keyword matching. The siteRadius metric suggests Google may demote pages that stray too far from a site's established topics.

Site Traffic and Engagement

Several features point to Google using site-wide traffic and engagement metrics from sources like Chrome browsers:

  • chromeInTotal - Total views/visits to the site from Chrome
  • siteImpressions - Number of search impressions across the entire site
  • siteClicks - Number of clicks from search results to the site

This indicates that in addition to specific page signals, Google may factor in the overall popularity and user engagement with an entire website when calculating rankings.

Site Freshness and Update Recency

While the leaked data doesn't reveal much about how Google treats updated content, it does mention a "lastGoodClick" metric which appears to be the date of the last time a user landed on a page from search and stayed. This suggests Google tracks site-wide freshness.

There are also references to Google identifying "significant" updates to pages versus just minor updates, hinting that sites making substantive updates may get a freshness boost.

Overall, the site-level metrics revealed reinforce that Google takes a holistic view that goes beyond just individual pages. A website's authority, topical focus, overall traffic levels, and freshness of content all seem to be factored into Google's systems.

This highlights the importance of publishing websites with a well-defined topical area of focus, maintaining a steady stream of quality content, and building a strong brand that attracts direct traffic and engagement signals.

πŸ‘€
Take Aways
πŸ‘‰ Focus on building a well-defined topical area of focus for your website, as Google appears to measure how focused a site is on particular topics through metrics like "siteFocusScore" and "siteEmbeddings".

πŸ‘‰ Maintain a steady stream of fresh, high-quality content to signal ongoing updates and relevance, as Google tracks signals like "lastGoodClick" which is the date of the last time a user clicked through from search and stayed on the page.

πŸ‘‰ Invest in brand-building activities to establish a strong brand presence, as there are indications Google considers brand signals like "siteNavBrandingScore" and "siteNavBrandQualityScore".

πŸ‘‰ Ensure your content stays within the core topic focus of your site, as there are metrics like "siteRadius" that measure how far a page's content deviates from the main site topics.

πŸ‘‰ Monitor and optimize for overall site traffic and engagement metrics like "chromeInTotal" (total Chrome browser views to the site) and "siteImpressions/siteClicks" (search impressions and clicks to the site), as these appear to be factored in.

πŸ‘‰ Utilize comprehensive analytics tools to track and respond to user behavior on their your site.

πŸ‘‰ Ensure a seamless user journey across all touchpoints, from search to site navigation.

Other Key Factors

In addition to the major areas already covered, the leaked documents provide insight into several other interesting factors that appear to influence Google's ranking systems.

Brand Signals

There are multiple references suggesting Google considers brand-related signals when ranking websites:

  • The "GoogleApi.ContentWarehouse.V1.Model.QualityNsrNsrData" module mentions attributes like "siteNavBrandingScore" and "siteNavBrandQualityScore" which seem to be measures of how well a site conveys its brand through its navigation and overall quality.
  • The "GoogleApi.ContentWarehouse.V1.Model.NavBoostDocData" module has a "navBrandWeight" attribute, implying click data is weighted differently for navigational (brand) queries.
  • There are references to Google tracking searches for brand names versus non-branded queries about topics/products.

This aligns with Google's increasing prioritization of brands and brand entities in recent years. Having a strong, authoritative brand presence that users are familiar with and search for by name appears to provide ranking benefits.

Author Signals

While not as prominent as brand signals, there are some indications that author authority and reputation is a consideration:

  • The "WebrefMentionRatings" module suggests Google can identify authors of documents and has ways to score the prominence/authority of those authors.
  • It references an "authorReputationScore" which seems to be a measure of the author's reputation/authority.

This is particularly interesting given Google's past statements downplaying the importance of author expertise as part of its E-E-A-T guidelines. The documentation implies author authority could play more of a direct role than Google previously let on.

Vertical-Specific Algorithms

Several modules point to Google potentially using different ranking systems, factors, or adjustments for specific verticals like:

  • Travel ("GoogleApi.ContentWarehouse.V1.Model.QualityTravelSitesData")
  • News ("GoogleApi.ContentWarehouse.V1.Model.NewsPublisherScores")
  • Video ("GoogleApi.ContentWarehouse.V1.Model.VideoContentSearchScoringSignals")
  • Shopping/E-commerce ("GoogleApi.ContentWarehouse.V1.Model.ShoppingAnnotations")

This aligns with the idea that Google's algorithm is highly tuned and customized based on query intent and content type. Ranking factors are likely weighted differently across verticals.

πŸ‘€
Take Aways
πŸ‘‰ Tailor optimization strategies to the unique requirements of each vertical. For example, local search optimization may focus on Google My Business listings and local reviews, while travel SEO might emphasize reviews and booking information.

πŸ‘‰ Invest in brand-building activities, such as securing mentions in authoritative media and engaging in traditional PR.

πŸ‘‰ Encourage branded searches and build strong social media communities around the brand.

πŸ‘‰ Have fewer authors as opposed to multiple freelancers

πŸ‘‰ The authors on your blog should be authoritative and reputable in the topics they write about.

New SEO Strategies or Emphasis Based on the Leak:

Let's dive into specifics about how your existing SEO strategies may change based on what the leak revealed. Some best practices were validated, while other new tactics could boost your rankings.

To complete signup, click the confirmation link in your inbox.

If it doesn't arrive within 3 minutes, check your spam folder!