How Google Works: Website Evaluation: Crawling, Indexing & Ranking
Ever wondered how Google ranks websites? Well, you are not alone. Thanks to a great question from RobertvH in Munich, Matt Cutts, Head of Web Spam at Google took it upon himself to explain exactly how Google works – and in a respectable 8 minutes! This video was every marketing manager, webmaster and SEO's dream! Online marketers owe a lot to RobertvH in Munich!
So... Matt talks about the 3 things you need to do well in order to be the world's best search engine:
- Crawl the web comprehensively and deeply
- Index pages ('documents' as he commonly refers to them) and...
- Rank/serve and return the most relevant pages first
'Crawling the Web'
Crawling is more difficult than we think – understandable given the volume of content online nowadays, but essentially Page Rank (PR) is the primary determinant - this is the rank given to your website that is based upon:
- The number of people linking to your site (backlinks)
- The reputation of those people who link to your website (their PR and authority)
Put simply, a good PR basically helps Google find your website quicker and easier. This is common sense – the more links pointing to your site (and the more links you have from very high authority sites like news sites; the CNN and New York Times of this world) the more likely and faster the Googlebot will crawl your site. It basically spends its time crawling around the web starting with the most authoritative sites and follows links as it comes across them. Hence if you have a long standing website the Googlebot will usually come from another site to yours to rank it – the more it is encouraged to come the more you are likely to benefit.
A bit of history
Historically Google crawled the web for several weeks at a time, indexed it for a week and then pushed out the ranked pages it found over another week period. This was called the 'Google Dance'.
However this long winded 'Google Dance' meant Google was forever out of date with what was actually happening online. News sites particularly, as we all know, change on an hourly basis. This world!
Back in 2003, as part of Update Fritz, Google started to crawl a significant amount of the web every day. By breaking up the web into segments (determined by Page Rank) Google crawled each segment more frequently and ensured that (at the very least) the most commonly updated websites rankings were refreshed daily. Google also had a 'supplemental' index or 'supplemental results' which were not crawled and refreshed quite as often. This index included more websites by number and was where most small, less well known websites sat in the indexing process.
Incremental indexing however has improved over time and today Google can see updates made to sites very quickly and easily. The process gets ever quicker and more efficient. These are the basic principles of 'comprehensive crawling' of the Internet. Next Google has to index the web...
'Indexing' and 'Document Selection'
Rather than looking for search terms in word order amongst the millions of online documents and webpages, Google breaks down search terms into individual words and checks which online pages they appear in – this is the 'index'.
To give Cutts' example – if searching for 'Katy Perry' Google looks for all the documents with the word 'Katy' in and all the documents with 'Perry' in. It searches the web separately for both words and then cross references the list of documents where they appear separately to then see if they appear in both. Only if the keywords searched appear in both documents does Google consider including this page in its search results. Google looks at both on-page content AND off-page anchor text to see if those keywords appear. This is the process of 'Document Selection'. The indexing process is the reverse of what one might expect. Rather than having documents with the search term in word order as was typed – Google lists the places the word appears in document order.
Now, unsurprisingly this is where Google becomes more muted – Google uses PR and over 200 other factors to help determine the authority of the document. Matt openly admits that even the most authoritative page with a high PR, might not have many mentions of the keyword searched. Common sense says this page cannot be as relevant despite its authority, so Google looks at links pointing to the page, the proximity of the keywords to each other within the content and the repetition of those searched terms. It balances the reputation of the document with relevance to the searched term and presents results according to its latest indexed data.
So what actually happens when someone presses the 'Google Search' button?
So let's say a searcher comes to Google and types their query. Google has to send that search query to the closest data centre and from here it is sent out to hundreds of different machines around the world, all at the same time. Those machines look through the fraction of data they hold from their latest index from Google and return the pages they think are the most relevant matches for the search query.
Google looks at these returned results and tries to determine what Cutts describes as "crème de la crème" of pages. Once it has decided which pages are the very best in terms of relevance and authority, Google displays those pages to the searcher on the SERP. Pages are passed back including a snippet that shows the keyword in its context on the page, and the search is completed. This all takes place in under half a second which frankly is incredible!
So what can Marketers takeaway from Matt's video?
While this might not be ground breaking news to some tech geeks, this eloquent explanation of Google's indexing and ranking process is very interesting and provides a great basis for understanding when it comes to SEO.
I believe many people simply didn't understand this process, and the video bridges some gaps and helps educate people about the reasons why black hat SEO techniques do not work. Many felt Google were purposely evasive about how they choose what results to display, but this video could be the start of a more transparent ranking process – or so one might hope?
What Matt doesn't discuss is how Google over the years has come to understand, seek out and penalise websites who try to trick Google's very important Googlebots. It's quite plain that the ranking algorithm looks at page rank, backlinks and content, and these things cannot be faked (well if they are they are just as quickly caught and penalised).
The fact is that page rank is determined by authority which flows down from the pages who link to you. It's very hard to get a link from the BBC or CNN for example without having relevant and newsworthy content as I am sure you may have already found. The best sites who can benefit your SEO most, do not give away links lightly. They are the Holy Grail for SEO's. Fundamentally however, everything comes back to the need for genuinely good content. You can obtain links from these sites but only if you have highly creative, newsworthy stories and content to share. Sadly this is not something you can cheat at. One link from a news site like the BBC is worth far more to your SEO and your websites profitability than hundreds and hundreds of spammy, low-quality links that you can buy quickly and easily. It's a fact I am afraid.
Page rank flows down the chain from high authority website page and through the links in that page out into other third party websites (we call it 'link juice'). Imagine a champagne tower if you are struggling with this concept! This flow is what helps determine your page rank and helps you be noticed by Google in the first instance.
Good quality, consumer-centric content (and lots of it) is crucial. The more 'documents' or web pages that talk in a user friendly way about what you do, offer or sell – the more chance you have of Google ranking your page in it's 'crème de la crème' results. Simple.
Gone are the days of tricking Google – it's just not worth the time or money. So my takeaway is: Do it once, do it properly and you will see the long term benefits.
Watch Matt's video for yourself:
I'd love to hear your thoughts about Matt's video – what did you learn from it, did you like it and what if anything would you ask Google?