Get all the URLs of the indexed pages on Google for a website through Python
Last time I check this task was presented as overly complicated in some Python for SEO tutorial.
So here I am, humbly giving my contribution with the hope to do of some help to some fellow and honest SEO professional and colleague.
If this has helped you => don’t forget to pay me at least a coffee (see below), drop a line in a comment and say thank you!
Also, if you are happy to connect, feel free to add on the some social network of your choice.
Well, said that, without further ado…
here’s the code:
First, let’s import all the modules we need:
from selenium import webdriver
from selenium.webdriver.common.by import By
from urllib.parse import urlparse
from urllib.parse import parse_qs
import time
There’s also a fancy way to use selenium for the same waiting function we use with time, but frankly, I don’t care! 😀
VERY IMPORTANT: you also need to download a webdriver (see for example Chromedriver) and put on your working directory (the one we will call it through the code).
ALSO VERY IMPORTANT: the selenium script will launch the webdriver, which is an actual web browser that you can see and interact with. Since we are using a web browser to surf the Google index, Google will ask you to accept its conditions… Just click accept and let the script work its magic.
Now… Let’s write our function, shall we?
def scan_that_index(website):
url = "https://www.google.com/search?q=site:" + website
url_list = []
pages_to_be_scanned = []
pages_scanned = []
driver = webdriver.Chrome(executable_path="./chromedriver")
pages_to_be_scanned.append(url)
while len(pages_to_be_scanned) > 0:
url = pages_to_be_scanned[0]
driver.get(url)
xPath = '//*[@id="rso"]/div/div/div/div[1]/div//h3/parent::a'
elems = driver.find_elements(By.XPATH, xPath)
for elem in elems:
cur_url = elem.get_attribute('href')
if cur_url not in url_list:
url_list.append(cur_url)
pages_to_be_scanned.remove(url)
pages_scanned.append(url)
#finding pagination
xPath = '//*[@id="botstuff"]/div/div[2]/table/tbody/tr/td/a'
for elem in driver.find_elements(By.XPATH, xPath):
cur_url = elem.get_attribute('href')
u = urlparse(cur_url)
the_query = u.query
for key, value in parse_qs(u.query).items():
if ( key != 'q' and key!= 'start' ) or ( key == 'start' and value[0] == '0') :
replace_str = '&' + key+'='+value[0]
the_query = the_query.replace(replace_str,'')
clean_url = cur_url.replace(u.query, the_query)
if clean_url not in pages_scanned and clean_url not in pages_to_be_scanned:
pages_to_be_scanned.append(clean_url)
print('\npages_to_be_scanned:', len(pages_to_be_scanned))
print('pages_scanned:', len(pages_scanned))
time.sleep(20)
drive.close()
return url_list
A few things to know with the code above…
- the input is not fool-proof
the script is designed for people who knows what they are doing and they are willing to do it properly.
Well, “designed” is a big word: let’s say simply I hadn’t had the time to think about all the mistakes humans can make and design a more complex code!
Example: the variable “website” should be filled with a proper website (simple right?) - It assumes that the webdriver you downloaded is called “chromedriver” and it can be found in the working directory, but you can change the code if you wish
- things change
the script works now (December 2022) but Google can change its code and maybe some tweaking might become necessary (you can contact me and ask for help, but I might ask you a fee for that 😀 ) - I made the script pause for 20 seconds for each page
you can easily change that – “time.sleep(20)” – but you are scraping Google for God’s sake! Show a little respect! - Works only with X number of indexed pages
I haven’t checked the script with big website with thousands of indexed pages…
But even if you need a comprehensive list of all the indexed pages of a website with few hundreds pages… and you have to do it (semi)manually, the task will be daunting! So I hope I’ll save you a lo of time and you’ll be grateful! 🙂
So, here we are: the libraries are loaded and the function is written.
Let’s run it!
website = "https://jonathanseo.com" #of course here you change the value and put whatever you want (as long it's a proper website url)
url_list = scan_that_index(website)
Once you run this code, this will happen:
- the webdriver will be launched
- it will go to Google (remember to accept its conditions as explained above)
- it will go to the 1st page of results for the query “site:” + the website you put on the variable “website” as explained in the code above
- it will gather all the urls of the indexed pages
- it will wait 20 seconds!!!
- it will go to the 2nd page of the results…
- it will gather all the urls of the indexed pages
- it will wait 20 seconds!!!
- it will go to the 3rd page of the results…
- …
All the urls gathered are stored in a python list called “url_list”.
You can access the python list in a number of ways, and the best way for you depends on how you are running python (through a jupyter notebook, through a IDE, etc) for example:
for url in url_list:
print(url)
Or maybe you want to write it on a csv, in this case here’s one of the way:
import csv
with open('jonathanseo.com.url_list.csv', 'w') as f:
writer = csv.writer(f)
for url in url_list:
writer.writerow([url])
So, that’s pretty much it!
I hope you enjoyed my script and explanation, feel free to drop a line to say thank you or if you have any comment…
And don’t forget to pay me a coffee ! 😁
int(1)
Leave a Reply