Build a Web-scraping API in Under 30 Minutes with ChatGPT and Azure Functions

ZhongTr0n
8 min readDec 13, 2022
Image source: Pexels.com

Introduction

Anyone who dabbles in data sciences, data engineering, or data analysis has probably built a web scraper at some point. The fastest approach is to build something on your local system as it is free, fast, and hassle-free. Unfortunately, this also comes with limitations with the two most prominent being; it is entirely dependent on your machine and it is not scalable.

Thankfully in these days of public services, this can easily be overcome at a very low cost by using cloud technology by Microsoft, Amazon, Google, Alicloud, or others.

Most of these companies offer almost identical products so a lot of the knowledge and principles are transferable. In this tutorial, we will be using Azure by Microsoft.

We will use Azure to host our scraper and make it into an API making it able to run from everywhere. Alright, let’s dive into it.

Requirements

In order to replicate this project you will need the following:

Setting up the Function App

Ok now that we have everything we need, let’s start setting up the infrastructure.

Go to your Azure Portal, and at the top select ‘Function App’.

Image source: azure.microsoft.com

Then select “+ Create” again at the top left, which prompts the following menu.

Image source: azure.microsoft.com

You can create the function app with the following parameters;

Subscription: Whatever subscription you want to use from your list. Most like “Pay-As-You-Go”
Resource Group: You can create a new one for this scraper. But this is not essential for this project.
Function App Name: a global unique name of your choosing that will be used to trigger the function by using THENAMEYOUCHOSE.azurewebsites.net
Publish: Code
Runtime stack: Python
Version: 3.9
Region: as close to your current location as possible
Operating System: Linux
Plan: Consumption (Serverless)

You can now click “Review + create” at the bottom. There are more settings in different tabs, but for this project, you can leave them all in the default setting. One thing these default settings will do is automatically create a new storage account that will be used for this project.

Now you have a function app with a storage account linked to it. However, the function app does not contain any functions or code yet. Let’s head over to VS Code to write some code and deploy it to the function.

Creating a Function in VS Code

There are multiple ways to deploy code on a function app, but the most used and obvious choice is Visual Studio Code. Before you continue, make sure you have all the necessary plugins (Azure Tools and Azure Functions)​​ installed and login to your Azure Subscription in VS Code.

Once logged in, you can navigate to the Azure tab by clicking the Azure logo on the left side. Here, you can add a new function in your workspace by selecting the lightning bolt.

Image source: azure.microsoft.com

This will prompt some menus, where you can select the following;

Select the folder for your function project; here you can either use an existing directory or create a new one.
Select a Language: Python
Template: HTTP Trigger
Provide Name: a name of your choosing
Authorization Level: Anonymous (attention, this means anyone with the link can trigger your function, it is recommended to secure this later).

In your workspace (bottom left) you can now see the function. Next to the create function (lightning bolt) icon you used earlier, you can now see a “Deploy” icon. Clicking this icon will prompt a menu where you can choose one of your function apps. Select the one we set up earlier (most likely, this is your first function so only one will be shown anyways).

Image source: azure.microsoft.com

The deployment might take a minute to run. During this deployment VS code takes the template (a .json file with settings we set up earlier) and uploads it to your Azure Function App which is cloud-based. This, in turn, creates all the necessary files for the function like an __init__.py and requirements.txt on Azure, completing the function template.

When the deployment is completed, you now have a complete function template running on your Azure cloud service. Navigating to the YOURAPPNAME.azurewebsites.net/api/YOURFUNCTIONNAME?name=SOMEPARAMETER should trigger your function.

Now the app is up and running we can modify it and add a web scraper to it.

Building The Scraper

Let’s go back to VS Code to build the scraper. As mentioned earlier, the deployment automatically generated all the necessary files for the function app.

The core of the app runs from the __init__.py file which you can find in VS code by going to the Azure tab → Function App → YOUR FUNCTION APP → Files.

Here you can create a new file named scraper.py.

As of writing, it is December 2022, so I am going to sit here and write code myself like a caveman. I asked ChatGPT the following: “write a python webscraper that grabs all email addresses of a page”, which returns;

Image source: chat.openai.com

import re
import requests
from bs4 import BeautifulSoup

def get_email_addresses(url):
# Make an HTTP request to the given URL to retrieve the page's HTML
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Create a regular expression pattern to match email addresses
pattern = r"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

# Find all the text on the page that matches the pattern
emails = re.findall(pattern, soup.text)

# Return the list of email addresses
return emails

Paste this code in the scraper.py file and save it. Now we will go back to the __init__.py file to use this function.

In the __init__.py file we will add the following;

  • add a second parameter
  • rename both paramters to url_domain & url_ext
  • construct a url from both parameters
  • import the scraper function
  • run the scraper function

The __init__.py code now looks like this;

import logging
import azure.functions as func
from . import scraper

def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')

url_domain = req.params.get('domain')
url_ext = req.params.get('extension')
url_full = "http://www." + url_domain + "." + url_ext

email_result = scraper.get_email_addresses(url_full)

if not url_domain:
try:
req_body = req.get_json()
except ValueError:
pass
else:
url_domain = req_body.get('domain')

if url_domain:
return func.HttpResponse(f"Email result: {email_result}.")
else:
return func.HttpResponse(
"This HTTP triggered function executed successfully. Pass a name in the query string or in the request body for a personalized response.",
status_code=200
)

All this new code is still on our local machine at this moment. Before we deploy it to Azure we need to modify one more the file; requirements.txt

This file lists all dependencies the code needs to run. As we added the scraper it needs the requests and beautiful soup library. Azure will then use this file as guide to know which libraries to install for the function app.

Updating the requirements can be done manually but the easiest way is to open a terminal in Visual Studio Code and run the following command;

pip freeze -> requirements.txt

pip freeze -> requirements.txt

This will take all the dependencies of your current venv (virtual environment) and write them to the file. Careful, in order for this to work, you will first need to pip install install requests and beautifulsoup in your venv if you haven’t already done so.

Your requirements.txt file should look like this

beautifulsoup4==4.11.1
bs4==0.0.1
certifi @ file:///private/var/folders/sy/f16zz6x50xz3113nwtb9bvq00000gp/T/abs_0ek9yztvu3/croot/certifi_1665076692562/work/certifi
charset-normalizer==2.1.1
idna==3.4
requests==2.28.1
soupsieve==2.3.2.post1
urllib3==1.26.13

Deployment and Testing

As I mentionned earlier, all of the code is still on the local machine. In order to get it as a cloud service we need to deploy it to Azure. Simply right click your function in the VS Code Explorer and select “Deploy to Function App…”

Image source: azure.microsoft.com

Once this process is completed, we can test the function. Just like we did with the template we can run the function by making a HTTP request in your browser. However, we will no longer use the ‘name’ parameter. This time the following two parameters will be used;

  • domain
  • extension (e.g. com)

The resean I split them up is because it is a bit more complex to pass a url as parameter to a parent url. As this is just a demo script, a simple split bypassess this issue.

This was the URL we used with the template;
YOURAPPNAME.azurewebsites.net/api/YOURFUNCTIONNAME?name=SOMEPARAMETER

Replace ‘name’ by ‘domain’ and add second parameter ‘extension’ like this.

YOURAPPNAME.azurewebsites.net/api/YOURFUNCTIONNAME?domain=SOMEDOMAINNAME&extension=SOMEEXTENSIONWITHOUTTHEDOT

If all went well, the function should now return all e-mailaddresses it found on the page as a list. If the result is empty it means either something went wrong or the page does not contain any e-mail address.

For debugging you can go to ‘Monitor’ under your function in the Azure Portal. Here you can see the logs and find what is causing problems.

Summary

There you have it, a fully functioning cloud-based API with AI generated code set up in no time. You can use this as a starting point for your custom web scrapers and API’s.

But be careful, the API we set up here is public and thus can be used by anyone with the link. And even though the Azure costs are extremely low, you probably want to use some sort of authentication to protect the usage of your API.

Lastly, you should know that Azure Function can be used for many other applications like for example a Twitter Bot. I recently built a Twitter Bot myself that checks NBA results and tries to estimate when Lebron James will break the NBA All-Time scoring record. The bot can be found here https://twitter.com/LebronPtsRecord.

Azure Functions takes away a lot time, maintenance and security of setting up your own infrastructure for an extremely low cost. Even though I mainly work with Azure myself (for professional reasons), you can find similar solutions by competitors like AWS, Google Cloud and the Chinese AliYun.

In another article I might break down on how I deployed the Twitter Bot on Azure so don’t forget to subscribe to stay updated.

About Me: My name is Bruno and I work as a data consultant. If you want to see the other stuff I built, like a mumble rap detector, make sure to take a look at my profile. Or connect with me via my website: https://www.zhongtron.me

--

--