Retrieving SERP results at Scale using a Custom Search Engine

Retrieving SERP results at Scale using a Custom Search Engine

Henry Mills

If you want to take your SEO strategy to the next level, you need to not only know which search queries you want to be triggered by but also the current rankings on the search engine result page (SERP) for those queries. By knowing what results appear for specific terms and what landing pages the results point to, we can better understand the competition. This allows us to build a more targeted keyword strategy. If done manually, (i.e. manually searching on Google and recording the results) performing this analysis can be very time consuming and often not worth the effort. However, as with all repetitive tasks, making a machine perform the hard work is always preferable. So, in the name of automation, we went ahead and built a mini-solution using a Custom Search Engine from Google. In this guide, we will show you how to do the same. You can find the complete code for this post here. Feel free to make a copy and try it out yourself.

Getting our environment ready

Before we start on the fun data retrieval process, we need to build the environment we will interact with. Firstly, we need to set up a Custom Search Engine (CSE), which will allow us to interact with the SERP via the Custom Search JSON API. Secondly, we need a place to store the search term we want to retrieve the results for, as well as having a place to out the data. For this we chose Google Sheets. Lastly, we need an environment to run our solution. Google Colaboratory is really useful in this regard, and we will later show how to easily run our Python3 script directly from our browser. 

All of the environments and interfaces in this example are provided and hosted by Google, so we need to be logged in with a GSuite account to access them. Additionally, we also need to setup a basic Google Cloud Platform (GCP) project, and a guide for that can be found here. 

Custom Search Engine 

The first step is to set up a Custom Search Engine (CSE). The main purpose of the Google CSE platform is to allow developers to create basic customized search engines which are based on Google Search. These can be used for searching specific webpages as a rudimentary search engine or for “emulating” fetching results from Google, which is the part we are interested in. Setting up a CSE for the purpose we wish to use it requires a few specific steps to set-up, but can be done quite quickly.

Firstly, we need to create a CSE by following this link. Here, you will need to “Add” a new search engine, for the URL you can use any for now – in the settings we specify “Search the entire web”. Language can be set to any, though you may wish to specify one if your Custom Search Engine is only for a specific region (we determine region in the settings).

Read more: AI in 2019: Hardware, Automation, and High Stakes Politics

Once created you will need to edit the settings. For this you will need to go to “Modify Search Engine”. In these settings it is important to set “Region” to the region you wish to gather results, switch “Search the entire web” to ON and set the Language if necessary. Here in the settings you will also need to make a note of the Search Engine ID, this is used to interact with the Custom Search JSON API. 

Lastly, we need to generate an API key via this page. Go to the “Get a Key” and using the dropdown menu, pick a GCP project. Additionally, although not necessary, we will also recommend setting up billing for the GCP project used to generate the API key, since otherwise, you are limited to 100 queries per day. With billing, the limit is 10 000 queries per day and cost $5 per 1 000 queries.   

Google Sheet

Once you have set up your CSE and got an API key, it is time to set up the Google Sheet. We have already prepared a template and the later examples rely on the same names for headers and tabs. Make a copy of the provided template, and note down the sheet ID (found in the URL of your copy) as we will need it later. In the template sheet, go to the tab named “search_terms”. Here you will find a small list of words, but they are there only as an example. It is here you will need to put in your own search term, keeping in mind that if you have not enabled billing for your GCP project, the maximum results pulled per day is from 100 queries. 

Google Colaboratory Notebook

With a CSE setup and Google Sheet with a lot of search terms, it is time to set up our coding environment. For this example, we will use a Google Colaboratory notebook to run our Python3 script. As previously mentioned, we need to be logged into the GSuite to utilize this service. Colaboratory is handy since it comes with the packages “google.colab.auth” and “oauth2client.client.GoogleCredentials”, which allows us to interact with Google Sheets easily. Besides the packages for handling authentication, we need a few libraries for both for interacting with our CSE via HTTP requests and to process/handle the data. Luckily, the ones we need are pre-installed in Colaboratory, so we simply need to import them. For reference, here are the libraries and versions we used:


pandas==0.25.3
requests==2.21.0
json==2.0.9
gspread==3.0.1
re==2.2.1

Another great feature of Colaboratory is form inputs. Using a simple syntax, we can create a form, which makes it easy to change user inputs:


#@markdown Custom Search Engine Inputs
api_key = '' #@param {type:"string"}
cse_id = '' #@param {type:"string"}

#@markdown ---
#@markdown Google Sheet Inputs
spreadsheet_id = '' #@param {type:"string"}
search_term_tabname = 'search_terms' #@param {type:"string"}
results_tabname = 'search_results' #@param {type:"string"}

(For more information about forms in Colaboratory, read here) 

Before we can start retrieving SERP results, we need to add the API key, the ID of the CSE and the ID of the spreadsheet to the input form. If the tab names have been altered from the template, these also need to be changed. Once we have added all the inputs to the form, we are ready to start retrieving the search results.

Retrieving SERP Results

The process for pulling SERP results in our example, functions in three steps: 1) Retrieving the search terms we want to process from Google Sheet, 2) fetch the results using a RESTful requests to interaction with the Custom Search JSON API, and 3) print the results to a Google Sheet. It is important to remember that the SERP results are based on the settings from your CSE settings, so make sure to double check the settings.  

Retrieving Search Terms from Google Sheet

Before we can start interacting with our Google Sheet, we need to set up authentication, and as mentioned, this is quite easy in Colaboratory. Once the “auth” package has been imported, we can retrieve a temporary credentials file:


 auth.authenticate_user()


A prompt with a link and text input field will be shown below the code cell. Following the link will send you through an authorization flow, which will provide you the verification code needed to be pasted in the text input field. By accepting, you will give temporary access to all of your Google Drive (incl. Google Sheet) and GCP services, and a JSON file named adc.json will be stored locally and function as your temporary credentials. Using these credentials, we can set up a “gspread” object:


credentials = GoogleCredentials.get_application_default()
gs = gspread.authorize(credentials)


We can now both pull and push data to any Google Sheet we have access to. However, data pulled using the gspread library will return a 2-dimensional list, which can be difficult to work with. To solve this, we will use a simple function to convert the response into a Pandas Dataframe, which will make it much easier to manipulate:


def get_data_from_sheet(spreadsheet_id, sheet_name, gspread_service):
    spreadsheet = gspread_service.open_by_key(spreadsheet_id)
    sheet = spreadsheet.worksheet(sheet_name)
    data = sheet.get_all_values()
    headers = data.pop(0)
    return pd.DataFrame(data, columns=headers)


Now that we have set up authentication and can pull data from a Google Sheet, we can easily get the entire list of search terms from our sheet. 

In the next step, we will show how to loop through the search terms and fetch the top 10 results from the SERP.  

Read more: Bitrise vs. CircleCI for Android in a head-to-head battle

Working with the Custom Search JSON API

With the Custom Search JSON API, you can use RESTful requests to get either web search or image search results in JSON format, and it works like most other HTTP requests. To retrieve the SERP results, we will make a GET request for each search term which we later can print into the Google Sheet. There is no need to parse any payload with the request, as the API Key, CSE ID and, search term specification is parsed along with the path/URL:


response = requests.get(f"https://www.googleapis.com/customsearch/v1?key={api_key}&cx={search_engine_id}&q={term_alphanumeric_only}")


However, this limits what characters we can use since URL strings is limited to using the US-ASCII character set, i.e. only alphanumeric characters and a few special characters (“-“, “.”, “_”, “~”). For more information about URL encoding, read here. We could build a URL encoder, however, for this example, we have opted for removing all non-alphanumeric characters from the search terms using a simple regular expression:

term_alphanumeric_only = re.sub('[^A-Za-z0-9 ]+', '', search_term)


If our GET request if successful, we will receive a response with a stringified JSON object containing the SERP results, as well as a lot of metadata and details about the request. For the purpose of this guide, we are mainly interested in the SERP results themselves. However, we also get the total number of results from the query, which allow us to properly gauge the amount of competition for a particular search term:


{ 
 'kind':'customsearch#search',
 'url{':
 ...
 },
'queries':{ 
  'request':[ 
    { 
     ...
      'totalResults':'3300000000',
      'searchTerms':'example',
      ...
    },
    ...
    ]
  }
}


Additionally, we need to store the information we are retrieving from the CSE and again, here is pandas data frames very useful since it is easy to append with new data, and do not have to specify the columns immediately:


df_search_results = pd.DataFrame();


With somewhere to store the SERP results, we need to extract them from the response. A query has the potential to trigger billions of queries so Google use pagination to limit the first call to only pull the top 10 results. The results can be found in the ‘items’ field in JSON response, and is stored as an array, so we need to loop through the result to retrieve the information we need in order to push it into the data frame:


items = json.loads(resp.content)['items']

for i in items:
    search_result = {
        'title': i['title'],
        'snippet': i['snippet'],
        'displayLink': i['displayLink'],
        'link': i['link'],
    }
    df_search_results = df_search_results.append(search_result, ignore_index=True)


There is plenty of fields and metadata that we could potentially retrieve, but we ultimately limited the pull to: Title, snippet, display link, link, total number of results, and position in SERP. Also, to make the process easier, we build a simple function to make it easier to process multiple search terms at the same time, as well as implementing some simple error handling:


def fetch_cse_results(search_term, api_key, search_engine):
    
    df_search_results = pd.DataFrame();

    for term in search_term:
        try:
            term_alphanumeric_only = re.sub('[^A-Za-z0-9 ]+', '', term)
            resp = requests.get(f"https://www.googleapis.com/customsearch/v1?key={api_key}&cx={search_engine}&q={term_alphanumeric_only}")
            total_results = float(json.loads(resp.content)['queries']['request'][0]['totalResults'])
            if total_results > 0:

                items = json.loads(resp.content)['items']
                position = 1

                for i in items:
                    search_result = {
                        'total_results': total_results,
                        'position': position,
                        'search_term': term,
                        'title': i['title'],
                        'snippet': i['snippet'],
                        'displayLink': i['displayLink'],
                        'link': i['link'],
                    }
                    position += 1
                    df_search_results = df_search_results.append(search_result, ignore_index=True)  
        except:
            error_resp = json.loads(resp.content)['error']["errors"][0]
            print(f'[{error_resp["reason"]}] {error_resp["message"]}')
            return df_search_results
    
    return df_search_results


Using the data frame containing the search terms, we parse the queries as a list to the function, providing as well the API key and CSE ID:


df_results = fetch_cse_results(df_search_terms["search_terms"].values, api_key, cse_id)


We now have a data frame containing all the information we decided to extract from each of the queries, which is now ready to be pasted into the Google Sheet. 

Pushing result to a Google Sheet

We will be using the same credentials we created a previous step for pulling data from a Google Sheet for pushing data to it. However, since we might have previous data in the sheet, we will first clear the sheet where we want to print the results. Again, we have made a function for this purpose:


def clear_worksheet(spreadsheet_id, sheet_name, arange, gspread_service):
    ss = gspread_service.open_by_key(spreadsheet_id)
    sheet = ss.worksheet(sheet_name)
    range_of_cells = sheet.range(arange+str(len(sheet.get_all_values())))

    for cell in range_of_cells:
        cell.value = ''
    sheet.update_cells(range_of_cells)

clear_worksheet(spreadsheet_id, results_tabname, "A2:G", gspread_service)


With a clean slate, we can print the data into the sheet:


def print_df_to_sheet(dataframe, spreadsheet_id, sheet_name, start_column, gspread_service):

    spreadsheet = gspread_service.open_by_key(spreadsheet_id)
    sheet = spreadsheet.worksheet(sheet_name)
    spreadsheet.values_update(
        sheet_name+'!'+start_column+str(2), 
        params={'valueInputOption': 'RAW'}, 
        body={'values': dataframe.values.tolist()}
    )

print_df_to_sheet(df_results, spreadsheet_id, results_tabname, "A", gspread_service)


Et voila, you have now created a solution for pulling thousands of search results in one go. The next step is analyzing the results.

Read more: React/Redux Featuring Hooks


Summary

By following these steps your end result should be a robust and scalable solution to fetch search results rapidly. Using this alongside a thorough SEO strategy should help to efficiently uncover insights to take your organic performance to the next level.


← Prev . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Next →





Report Page