I'm currently renovating my website… please be patient…

Get all the URLs of the indexed pages on Google for a website through Python

Last time I check this task was presented as overly complicated in some Python for SEO tutorial.

So here I am, humbly giving my contribution with the hope to do of some help to some fellow and honest SEO professional and colleague.

If this has helped you => don’t forget to pay me at least a coffee (see below), drop a line in a comment and say thank you!

Also, if you are happy to connect, feel free to add on the some social network of your choice.

Well, said that, without further ado…

here’s the code:

First, let’s import all the modules we need:

from selenium import webdriver
from selenium.webdriver.common.by import By
from urllib.parse import urlparse
from urllib.parse import parse_qs
import time

There’s also a fancy way to use selenium for the same waiting function we use with time, but frankly, I don’t care! 😀

VERY IMPORTANT: you also need to download a webdriver (see for example Chromedriver) and put on your working directory (the one we will call it through the code).

ALSO VERY IMPORTANT: the selenium script will launch the webdriver, which is an actual web browser that you can see and interact with. Since we are using a web browser to surf the Google index, Google will ask you to accept its conditions… Just click accept and let the script work its magic.

Now… Let’s write our function, shall we?

def scan_that_index(website):
    
    url = "https://www.google.com/search?q=site:" + website
    
    url_list = []
    pages_to_be_scanned = []
    pages_scanned = []

    driver = webdriver.Chrome(executable_path="./chromedriver")
    
    pages_to_be_scanned.append(url)
    
    while len(pages_to_be_scanned) > 0:
        
        url = pages_to_be_scanned[0]
    
        driver.get(url)

        xPath = '//*[@id="rso"]/div/div/div/div[1]/div//h3/parent::a'
        elems = driver.find_elements(By.XPATH, xPath)

        for elem in elems:

            cur_url = elem.get_attribute('href')

            if cur_url not in url_list:

                url_list.append(cur_url)
                
        pages_to_be_scanned.remove(url)
        pages_scanned.append(url)
        
        #finding pagination
        xPath = '//*[@id="botstuff"]/div/div[2]/table/tbody/tr/td/a'

        for elem in driver.find_elements(By.XPATH, xPath):

            cur_url = elem.get_attribute('href')
            
            u = urlparse(cur_url)
            the_query = u.query
            for key, value in parse_qs(u.query).items():
                if ( key != 'q' and key!= 'start' ) or ( key == 'start' and value[0] == '0') :
                    replace_str = '&' + key+'='+value[0]
                    the_query = the_query.replace(replace_str,'')
                                
            clean_url = cur_url.replace(u.query, the_query)
            
            if clean_url not in pages_scanned and clean_url not in pages_to_be_scanned:
                pages_to_be_scanned.append(clean_url)
        
        print('\npages_to_be_scanned:', len(pages_to_be_scanned))
        print('pages_scanned:', len(pages_scanned))
        
        time.sleep(20)
    
    drive.close()
    
    return url_list

A few things to know with the code above…

So, here we are: the libraries are loaded and the function is written.
Let’s run it!

website = "https://jonathanseo.com" #of course here you change the value and put whatever you want (as long it's a proper website url)
url_list = scan_that_index(website)

Once you run this code, this will happen:

  1. the webdriver will be launched
  2. it will go to Google (remember to accept its conditions as explained above)
  3. it will go to the 1st page of results for the query “site:” + the website you put on the variable “website” as explained in the code above
  4. it will gather all the urls of the indexed pages
  5. it will wait 20 seconds!!!
  6. it will go to the 2nd page of the results…
  7. it will gather all the urls of the indexed pages
  8. it will wait 20 seconds!!!
  9. it will go to the 3rd page of the results…

All the urls gathered are stored in a python list called “url_list”.

You can access the python list in a number of ways, and the best way for you depends on how you are running python (through a jupyter notebook, through a IDE, etc) for example:

for url in url_list:
    print(url)

Or maybe you want to write it on a csv, in this case here’s one of the way:

import csv
with open('jonathanseo.com.url_list.csv', 'w') as f:
    writer = csv.writer(f)
    for url in url_list:
        writer.writerow([url])

So, that’s pretty much it!

I hope you enjoyed my script and explanation, feel free to drop a line to say thank you or if you have any comment…

And don’t forget to pay me a coffee ! 😁

int(1)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.