In Part 1, we identified the limitations of User Defined Functions (UDFs) in BigQuery and explored a solution called Remote Functions that allowed us to overcome these limitations.
In short, UDFs allow us to define our own custom SQL, written in either SQL or JavaScript, to process data in BigQuery. While they are a powerful tool for performing complex data transformations (see here an example) or operations, they have some limitations. They cannot be used to make API calls to external services, and they can only be written in SQL or JavaScript. To overcome these limitations Google Introduced Remote Functions that let you use Cloud Functions.
As a practical example, in Part 1 of this tutorial, we built a web app that used the Azure Translator API to take text input and translate it to the desired language.
In Part 2, we will build a more complex web app that uses MaxMind's free offline database to take an IP address as input and retrieve geolocation information such as the country, city, and postal code.
This can be very useful for a variety of purposes, such as analyzing the geographic distribution of our users.
You may be curious about the cost of building this IP address lookup functionality, as similar services that perform this task in batch can be expensive. However, in this case, we will not break the bank.
For the web app, we will be using Google’s Cloud Functions which are more cost-effective than renting a server since they are only charged when used. We will be using the free Tier, which currently offers 2 million invocations per Month for free. Read more here.
For the IP address lookup, we will be using free offline MaxMind's dataset. Since it is free certain information such as Latitude and Longitude will be less accurate than the paid version of the service. Also since we are using an offline database, we should update it frequently.
The full code can be found here.
o continue we need to make sure we have done the following:
Moreover, in order to run the code snippets that follow we need to replace the following variables with our own:
*An easy way to follow the tutorial is to copy Readme.md and Search-Replace the values above with your own using an editor. *
git clone https://github.com/justdataplease/bigquery-iplookup.git
cd bigquery-iplookup
The repo directory has the following structure:
.env_sample: Includes the environmental variables with some sample values. Later on, we will rename it from .env_sample ->.env and replace the sample values with our own.
db folder: Includes GeoLite2-City database, which is the free version (less accurate) of MaxMind's geo GeoIP2-City database.
iplookup.py: Includes the GeoLocation class, which is designed to determine the geographical location (country, state, city, postal code, latitude, and longitude) of an IP address.
The update_db() method is used to download and install a MaxMind GeoLite2 database if it is not already present on the system. It does this by sending a GET request to the MaxMind download URL, with the appropriate parameters to download the GeoLite2-City database. It then extracts the contents of the downloaded .tar.gz file and renames the extracted folder to "db". To update the database using this method, we need to set the MAXMIND_LICENCE_KEY in our .env and delete the existing db folder, located inside the repo.
The lookup_ip() method takes an IP address as an argument and returns a dictionary containing the location information for the IP address (country, state, city, postal_code, latitude, and longitude).
main.py: Includes the code for our cloud function.
import json
import functions_framework
from iplookup import GeoLocation
@functions_framework.http
def iplookup(request):
"""
Defines iplookup Google Cloud Function
:param request:
:return:
"""
request_json = request.get_json()
calls = request_json['calls']
replies = []
lookup = GeoLocation()
for call in calls:
ip_address = call[0]
rs = lookup.lookup_ip(ip_address=ip_address)
# each reply is a STRING (JSON not currently supported)
replies.append(json.dumps(rs, ensure_ascii=False))
return json.dumps({'replies': replies})
This function has a special form in order to work with BigQuery. It should accept multiple rows (calls) as input and iterate through each one (call) to perform the translation process. For each row (call), it pulls 1 column (call[0]) which is treated as the IP Address that we want to retrieve information. Finally, we gather each response to a list named replies and we convert this list into a string.
test_cloud_function.py: Includes the code to test our cloud function. We will use it after the Cloud Function is deployed.
First of all, to deploy our Cloud Function we need to make sure
Then we should run the following command to deploy our directory as a Cloud Function. To learn more about the parameters used read here.
gcloud functions deploy bigquery-iplookup --gen2 --runtime python39 --trigger-http --project=<your-project-id> --entry-point=iplookup --source . --region=europe-west3 --memory=128Mi --max-instances=3 --allow-unauthenticated
--allow-unauthenticated : For simplicity, we added this parameter which makes our function public. If reproduce it is better to avoid using this parameter.
From the output of the last command, we should note the URI (i.e. https://bigquery-iplookup-xxxxxx.a.run.app) or visit Google Cloud Console Functions.
Now to test the Cloud Function, we can update CLOUD_FUNCTION_URL in our .env file and run the following code.
python test_cloud_function.py
# Input
{'calls': [['190.61.88.147'],['139.99.237.62'],
['20.111.54.16'],['185.143.146.171'],['121.126.20.41']]}
# Output
{'replies': [
'{"country": "Guatemala", "state": "Departamento de Guatemala", "city": "Guatemala City", "postal_code": "01010", "latitude": 14.6343, "longitude": -90.5155}',
'{"country": "Australia", "state": "New South Wales", "city": "Sydney", "postal_code": "2000", "latitude": -33.8715, "longitude": 151.2006}',
'{"country": "France", "state": "Paris", "city": "Paris", "postal_code": "75001", "latitude": 48.8323, "longitude": 2.4075}',
'{"country": "Ukraine", "state": "Kyiv City", "city": "Kyiv", "postal_code": "03187", "latitude": 50.458, "longitude": 30.5303}',
'{"country": "South Korea", "state": "Seoul", "city": "Guro-gu", "postal_code": "083", "latitude": 37.4975, "longitude": 126.8501}']}
From the output that we got, we can see that our Cloud Function is working!
Keep in mind that we can also test our function locally, before deployment, by running the following command (target is the name of the function defined in our main.py file)
functions_framework --target=iplookup
From the output we get, we update CLOUD_FUNCTION_URL in our .env file with the appropriate localhost (i.e http://192.168.2.7:8080)
python test_cloud_function.py
Finally, we will get the same output as above.
We need to connect BigQuery and Cloud Functions in order to use a Cloud Function as a Remote Function.
gcloud components update
bq mk --connection --display_name='my_gcf_conn' --connection_type=CLOUD_RESOURCE --project_id=<your-project-id> --location=EU gcf-conn
bq show --project_id=<your-project-id> --location=EU --connection gcf-conn
From the output of the last command (bq show), we should note the name (i.e. xxxxxx.eu.gcf-conn) because we will need this later.
To test our function we will create a toy dataset. This dataset will include our Remote Function and a table with some test data.
bq mk --dataset_id=<your-project-id>:iplookup --location=EU
To test our Remote Function we will create a table with test data.
CREATE OR REPLACE TABLE `<your-project-id>.iplookup.example_dataset` (
ip_address STRING);
INSERT INTO `<your-project-id>.iplookup.example_dataset`(ip_address)
VALUES ('190.61.88.147'),
('139.99.237.62'),
('20.111.54.16'),
('185.143.146.171'),
('121.126.20.41');
Finally, we will create our Remote Function.
CREATE OR REPLACE FUNCTION `<your-project-id>.iplookup.lookup`(ip_address STRING)
RETURNS STRING
REMOTE WITH CONNECTION `<gcf-conn-name>`
OPTIONS (
-- change this to reflect the Trigger URL of your cloud function (look for the TRIGGER tab)
endpoint = '<gcf-endpoint>'
);
To test our Remote Function we will run it on t the test data. The output of the function is a string so we need to parse it to extract the information into columns.
WITH A AS (SELECT `<your-project-id>.iplookup.lookup`(ip_address) ip_address_location,ip_address FROM `<your-project-id>.iplookup.example_dataset`)
SELECT
ip_address,
json_value(ip_address_location, '$.country') country,
json_value(ip_address_location, '$.state') state,
json_value(ip_address_location, '$.city') city,
json_value(ip_address_location, '$.postal_code') postal_code,
json_value(ip_address_location, '$.latitude') latitude,
json_value(ip_address_location, '$.longitude') longitude
FROM A;
We did it! Our function is working.
To remove the Cloud Function the Remote Function and the Toy Dataset, we need to run the following commands:
# Remove Cloud Function (gcf)
gcloud functions delete bigquery-iplookup --region=europe-west3 --project=<your-project-id> --gen2
# Remove DATASET
bq rm -r -f -d <your-project-id>:iplookup
# Remove connection between BigQuery and Cloud Functions (gcf-conn)
bq rm --connection --location=EU <gcf-conn-name>
In the second part of this tutorial, we showed you how to use Remote Functions to perform IP Address Lookup and retrieve geolocation information. Although this is the final part of our tutorial on Microservices as Functions in BigQuery, we will also be exploring other advanced features in the near future.
Stay tuned since BigQuery Quest just started!
If you enjoy reading stories like this and want to support me as a writer, please subscribe!