Create a list of Pages & URL’s Indexed in Google for a Domain

Categorised in: , ,

January 12, 2015 9:59 am | Published by Matt

It is very important to stay on top of the URL’s that Google and other search engines have in their indexes to ensure that removed pages are correctly redirected to appropriate, relevant content.

To maintain external links and social shares to removed content, along with reducing 404 errors and potentially loosing visitors, the implementation of 301 redirects is a must for every web site. This is why we need to employ the methods detailed in this post to identify, monitor and maintain the indexed pages within search engines.

How to Identify Indexed Pages

The simplest method to identify the pages and URL’s that a search engine such as Google or Bing has indexed for your domain is to use the following search modifier.

site:yourdomain.com

Please replace ‘yourdomain.com’ with the domain that you wish to return results for.

Indexed URL's listed in Google

You may wish to append your sub domain extension to return specific sub domains results. The example below would return results for the specified sub domain.

site:subdomain.yourdomain.com

This search modifier queries the search engines index and returns all of the specified domain’s URL’s in the search results.

List & Export Indexed URL’s to a Google Drive Spread Sheet

Now that we have an idea of what URL’s are indexed for your specified domain it would be useful to export this data in to a document so that the data can be utilised.

The simplest method is to use the xpath syntax to pull the data in to a Google Drive (formally Google Docs) spread sheet, which can be done by following the steps below.

Firstly, sign in to Google Drive and create a new spread sheet document.

Secondly, in the first cell within the document (A1), input the following query, replacing ‘yourdomain.com’ with your specific domain name.

=importXml(“https://www.google.com/search?q=site:yourdomain.com&num=100&start=1”, “//h3/a/@href”)

This will return the first 100 results within Google index for your specified domain name. If you have more than 100 URL’s indexed, then in the first cell on line 101 (A101) input the amended query, replacing ‘start=1‘ with ‘start=101‘.

=importXml(“https://www.google.com/search?q=site:yourdomain.com&num=100&start=101”, “//h3/a/@href”)

Repeat this step as many times are required to pull out all of your sites indexed URL’s. Please be aware that there is a limit of 1000 URL’s that can be imported.

List indexed pages from Google

As you may have noticed, the returned URL’s have various query strings appended to them, so our final step is to clean these up so that we are left with a clean list of pages.

The final step is to paste the following code in to the first cell in our second column (B1).

=mid(A1,search(“?q=”,A1)+3,search(“&sa=”,A1)-(search(“?q=”,A1)+3))

Clean up Google indexed pages list for a domain

As you can see, we are now left with the pages full URL which is indexed. To extrapolate this data for all of the other rows in our spread sheet, simply click on the small blue square of the highlighted cell (B1) and drag in down to your last row that contains data.

Google Docs / Drive SpreadSheet of Indexd pages for a Domain

You should now have a spread sheet containing all of your chosen domains URL’s that Google has in its index.

With this data in a spread sheet it makes life easier to test URL’s in a browser and discover broken links that can then be redirected with relevant 301 redirects.

share this article:

  • This saves endless hours of compiling data, I may need to extend the tutorial to then write 301 redirects in another column.