Understanding IP addresses for Data Analysis – A practical example using BigQuery

understanding-ip-addresses-for-data-analysis-using-bigquery
databases devopedia

By analyzing IP addresses, we can gain a better understanding of the geographical distribution of users, identify trends in their behavior, and detect fraudulent activity. But what exactly are these IP addresses?

IP addresses make the Internet work

It’s most likely that you have seen an IP address, but did you know that IP addresses make the Internet work?

Essentially, an IP address is a numerical label assigned to each device connected to a computer network that uses a set of rules that governs Internet communication. This set of rules stands for Internet Protocol and the IP address for Internet Protocol address.

These addresses serve as unique identifiers for devices and allow them to communicate with each other by sending and receiving data over the internet. This is why everything in our life, including our laptops, smartphones, smartwatches, smart home devices, and even our cars have an IP address. Without them, it would be impossible for these devices to access the internet.

The postal service analogy

Let’s imagine we have a friend who lives in a different country than ours. If we want to send him a letter, we need to write his address on the envelope so the postman knows where to deliver it.

Just like a house needs a unique address, a device in a computer network —like the Internet— also needs a unique address called IP address. So when we send data, like an email or a message, to a specific device, the network knows exactly where to deliver it.

1_postal_service_analogy

In order for a postal service or a network to work effectively, it’s important that every house and every device has a unique address. If there were two houses with the same address, the postman wouldn’t know which house to deliver the letter to. Similarly, if two devices had the same IP address, the computer network wouldn’t know which one to send the data to.

What is my IP address?

To find the IP address of our computer, we can use one of the following methods:

  • On Windows, open the Command Prompt (CMD) and type ipconfig.
  • On Mac or Linux, open the terminal and type ifconfig.

2_ipconfig

We can see that our IP address and more specifically IP version 4 (IPv4 address) is 192.168.1.204.

NOTE: There is a chance that we have the same IP address but how can this happen since we said that IP addresses are unique? The short answer is that this is our private IP address and private addresses are not unique on a global scale, but rather are unique within the local network. We will explain what that means in "Private and Public Networks" section.

IPv4 address

An IPv4 address is a 32-bit length, which means it contains 32 binary digits. A 32-bit address space limits the number of unique addresses to 232 (~4.3 billion IPv4 addresses).

3_ipv4

So in this example, 192.168.1.204 can be written as 11000000.10101000.00000001.11001100 . The most common format, known as dotted decimal, is x.x.x.x. It contains 4 sections which are called octets (8 bits) and these octets are separated by dots. Each octet in theory can contain any number between 0 and 255, since 255 is the largest number that can be composed of 8 bits — 255 -> 11111111.

NOTE: A binary digit, or bit, is the smallest unit of information in a computer. It is used for storing information and has a value of true/false. An individual bit has a value of either 0 or 1.

An IP address can be divided into 2 parts. The first part represents the network portion and the second part represents the host portion. The network part specifies the unique number that is assigned to our network and the host portion uniquely identifies this device on the network. Returning to the postal service analogy a network can be the street address and the host is what makes our address unique the street number.

To know which part of the IP Address represents the network we use something called subnet mask. A subnet mask is always paired with an IP Address and is used to identify the host portion and the network portion of the address. It is also 32-bit length and in this example, 255.255.255.0 can be written as 11111111.11111111.11111111.00000000. In its simplest form whenever we see 255 this is the network portion of the address and when we see 0 this is the host portion. In this example, the network portion is the 192.168.1. and the host portion is the 204.

NOTE: A subnet mask, a network mask, a netmask, or simply a mask are in most cases the same thing.

Finally, we will also briefly discuss two other concepts: CIDR notation and network prefix.

CIDR notation is a compact way of representing an IP address and its corresponding network mask. It consists of an IP address, a slash character (‘/’), and a decimal number (bit mask). This decimal number represents the number of consecutive 1-bits in the network mask, starting from the leftmost bit. In this example, the CIDR notation for the IP address 192.168.1.204 with a network mask of 255.255.255.0 would be written as 192.168.1.204/24.

A network prefix is used to identify the network to which a device belongs. When a network mask is applied to the IP address using a bitwise AND operation we get the network prefix. In this example, the network prefix of the IP address 192.168.1.204 with a network mask of 255.255.255.0 would be written as 192.168.1.0. What we did is to find the network portion and then add zeros at the end of this prefix so that it is a valid 32-bit address again.

NOTE: A network prefix, routing prefix or simply network bin are in most cases the same thing.

Classes of IP addresses

In the early days of IP addresses, it was decided to divide all available addresses into groups called classes where each class has a range of IP addresses. This was done to make address allocation scalable and to control the number of hosts available to each network. There were three main classes: Class A, Class B, and Class C. There was also a class called Class D, which was reserved for multicast addresses, and Class E, which was reserved for experimental use.

Class A has 3 octets available for host allocation, so a single Class A network can have 16,777,214 hosts. Class B has 2 octets available, so a single Class B network can have 65,534 hosts. Class C has 1 octet available, so a single Class C network can have 254 hosts.

4_classes_ipv4

In practice, the allocation of IPv4 addresses based on Class A Class B, and Class C included inefficiencies that caused the pool of unassigned IPv4 addresses to drain faster than it should be. The way it used to work was that IPv4 address blocks of Class A Class B and Class C were assigned directly to organizations by IANA (Internet Assigned Numbers Authority). So, for example, if an organization needed more than 254 host machines, it would need to switch to Class B. However, this could waste over 65k hosts if the business did not need to use them. For this reason, the Internet Engineering Task Force introduced CIDR design in 1993 to fix this problem. CIDR involves assigning blocks of addresses in a hierarchical manner, beginning with large blocks that are provided to large organizations. These organizations then could divide the blocks and assign them to smaller groups. CIDR design is also used today for the new version of IP addresses IPv6.

Public and Private Networks

No one could have predicted the massive explosion of computers and the internet, but this has led to a problem. There are no more unallocated IPv4 addresses left (2011). To solve this problem, a new version of IP address, IPv6, has been designed to provide more than enough IP addresses. However, there is also a solution to extend the life of IPv4. This solution involves carving out small sections from each of the three classes and calling them private IP addresses. The remaining addresses are called public IP addresses. Private and public IP addresses both use the same subnet mask for their respective class and can still have the same number of hosts. Public IP addresses must be unique, but private IP addresses can be reused, saving millions of public IP addresses.

5_private_public_networds

As an example, we will use the 192.168.1.0 Class C network. The network address is 192.168.1 and the last number is the host address, so all devices can communicate with each other without any problems. Private IP addresses can be used by anyone – for example, our neighbors might be using the same addresses. Private IP addresses only need to be unique within our own network. Private IP addresses cannot be used over the internet, otherwise we would have duplicate IP addresses.

For that reason, only public IP addresses can be used over the internet. When someone signs up with an ISP (Internet Service Provider), they will issue a public IP address.

NOTE: To find your public IP address you can visit whatsmyip.org, whatismyip.com, or any similar website.

This helps to extend the life of IPv4. Our devices communicate with the internet using the public IP address through a process called Network address Translation (NAT). NAT allows multiple devices on a private network to share a single public IP address when accessing the internet. This helps to conserve public IP addresses and allows private networks to use private IP addresses internally.

NOTE: To better understand the importance of Public IP addresses, every web server is assigned a public IP address by its Internet Service Provider (ISP). Websites live on web servers. However, we don’t typically access a website using IP addresses. Instead, we use domain names, like www.google.com. The Domain Name System (DNS) is a server that translates these domain names into IP addresses. Some of the most common DNS servers are Google Public DNS, Cloudflare DNS, and OpenDNS. .

Toward IPv6

When the IPv4 protocol was first deployed in 1983 by the Advanced Research Projects Agency Networks (ARPANET), it was not expected that the 3.2 billion available IP addresses provided by its 32-bit architecture would eventually become insufficient. However, the pool of available IPv4 addresses started to drain due to the rapid expansion of the internet and the allocation inefficiencies, leading to the development of IPv6, which was standardized in 1998. Although today the pool of available IPv4 addresses has been drained, these addresses are still in use and will continue to be used for the foreseeable future.

An IPv6 address is a 128-bit length, which means it contains 128 binary digits. A 128-bit address space allows for a significantly larger number of unique addresses, specifically 2^128 (~340 undecillion or 340 trillion trillion trillion) IPv6 addresses.

6_ipv6

IPv6 is written in hexadecimal notation, separated into 8 groups of 16 bits by the colons, thus (8 x 16 = 128) bits in total.

A valid example of IPv6 is 2001:1234:5678:1234:5678:0000:EF12:1234/64. The above address could also be written as 2001:1234:5678:1234:5678::EF12:1234/64 where consecutive zeros are eliminated and replaced by a double colon sign (::). If an address consists of multiple all-zero fields and those zeros occur in different parts of the IP, then only the leftmost zeros are the ones that are compressed. It is important to note that when an IP address contains multiple fields of all zeros, only the zeros on the left side of the address can be compressed.

In IPv6, subnet masks are not used. Instead, a Classless Inter-Domain Routing (CIDR) notation is

used to divide the address into a network portion and a host portion. To determine which part of the IPv6 address represents the network the bit mask of CIDR notation is used. For example in 2001:1234:5678:1234:5678:0000:EF12:1234/64, 64 bits represent the network portion so 2001:1234:5678:1234 and the rest is the host 5678:0000:EF12:1234. To find the network prefix we need to add zeros at the end of this prefix so that it is a valid 128-bit address again. In this case, the network prefix would be 2001:1234:5678:1234:0000:0000:0000:0000 or 2001:1234:5678:1234::.

Practical Example

We will write a program to gather a collection of random IP addresses from the internet and treat each one as a unique customer. Then, we will simulate some fake orders to analyze the geographic distribution and behavioral patterns of our customers.

The full code can be found here.

If you want to just test or use this functionality in BigQuery, without implementing it, go to — 8. IP address geographic distribution Public Demo.

0. Prerequisites

To continue we need to make sure we have done the following:

  • Create a free account at MaxMind and find MaxMind’s Licence Key in your profile.
  • Copy and rename .env_sample to .env
  • Update MAXMIND_LICENCE_KEY variable with your own MaxMind’s License Key.

Moreover, in order to run the code snippets that follow we need to replace the following variables with our own:

An easy way to follow the tutorial is to copy Readme.md and Search-Replace with your own using an editor.

1. Clone the repository (CLI)

We will start by cloning the repo.

git clone https://github.com/justdataplease/bigquery-ipaddresses.git

2. Download MaxMind’s GeoLite 2 database

We will download the latest MaxMind's GeoLite 2 City database in zip format.

pip install -r requirements.txt
python update_db.py

As a result, a folder named GeoLite2-City-CSV_YYYYMMDD will be created (where YYYYMMDD is the current date).

3. Upload GeoLite 2 City database to BigQuery

Next, we need to navigate to the folder named GeoLite2-City-CSV_YYYYMMDD.

cd .\GeoLite2-City-CSV_YYYYMMDD

The following files will exist inside this folder :

GeoLite2-City-Blocks-IPv4.csv
GeoLite2-City-Blocks-IPv6.csv
GeoLite2-City-Locations-en.csv

We now need to create a new dataset in BigQuery.

bq mk ip_address

Finally, we should run the following commands to upload the CSV files to the new dataset. For this tutorial, we will be using the Europe (eu) location.

bq load --replace --location=eu --project_id=<your-project-id> --skip_leading_rows=1 ip_address.geolite2_city_blocks_ipv4 GeoLite2-City-Blocks-IPv4.csv "network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"
bq load --replace --location=eu --project_id=<your-project-id> --skip_leading_rows=1 ip_address.geolite2_city_blocks_ipv6 GeoLite2-City-Blocks-IPv6.csv "network:STRING,geoname_id:INTEGER,registered_country_geoname_id:INTEGER,represented_country_geoname_id:INTEGER,is_anonymous_proxy:BOOL,is_satellite_provider:BOOL,postal_code:STRING,latitude:FLOAT64,longitude:FLOAT64,accuracy_radius:FLOAT64"
bq load --replace --location=eu --project_id=<your-project_id> --skip_leading_rows=1 ip_address.geolite2_city_locations GeoLite2-City-Locations-en.csv "geoname_id:INTEGER,locale_code:STRING,continent_code:STRING,continent_name:STRING,country_iso_code:STRING,country_name:STRING,subdivision_1_iso_code:STRING,subdivision_1_name:STRING,subdivision_2_iso_code:STRING,subdivision_2_name:STRING,city_name:STRING,metro_code:STRING,time_zone:STRING,is_in_european_union:BOOL"

4. Create new tables

IP addresses found in MaxMind’s Geolite2 Database are in CIDR notation. For us to be able to use them to perform IP address Lookup we need to split them into the network bin and the network mask.

To better understand how we can split them, we can run the following query:

-- Input
SELECT '79.166.18.62' ip_address,NET.IP_TO_STRING((NET.IP_FROM_STRING('79.166.18.62') & NET.IP_NET_MASK(4, 16))) network_bin, NET.IP_TO_STRING(NET.IP_NET_MASK(4, 16)) net_mask

-- Output
ip_address    network_bin     net_mask
79.166.18.62   79.166.0.0      255.255.0.0

Finally, to create the new tables we should run the following:

CREATE OR REPLACE TABLE `<your-project_id>.ip_address.geolite2_city_ipv4`
AS
SELECT
NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `ip_address.geolite2_city_blocks_ipv4` b
JOIN `ip_address.geolite2_city_locations` l
USING(geoname_id);
CREATE OR REPLACE TABLE `<your-project_id>.ip_address.geolite2_city_ipv6`
AS
SELECT
-- Split CIDR notation to the network bin and the netmask.
NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin,
CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask,
l.city_name, l.country_iso_code, l.country_name, b.latitude, b.longitude
FROM `ip_address.geolite2_city_blocks_ipv6` b
JOIN `ip_address.geolite2_city_locations` l
USING(geoname_id);

5. Create a sample dataset

For educational purposes, we will use a list of free HTTP proxy addresses as IPv4 addresses. These addresses are collected daily in this GitHub repository. In this example, we will only use IPv4 addresses but if you want to also test IPv6 addresses check out section 8. IP address geographic distribution Public Demo. We will also assume that these IP addresses identify customers — an ip_address is a customer_id — and we will simulate some fake orders in order to inspect behavioral patterns.

Inside the file collect_free_proxies.py, we can find 2 functions:

The save_ipaddresses() function retrieves and returns a list of HTTP proxy addresses. The create_fake_orders() function generates 100,000 fake orders for the years 2021-2023, with random amounts and timestamps chosen uniformly from a range. To create diverse customer behavior patterns for each order, a customer is randomly selected with a probability from the Beta(2,2) distribution.

We can initiate both functions by running the following:

python collect_free_proxies.py

As a result, we will get a CSV file called free-proxies.csv and a CSV called fake-orders.csv which we will upload to BigQuery.

6. Upload IP addresses to BigQuery

To upload free-proxies.csv and fake-orders.csv to BigQuery we should run the following:

bq load --replace --location=eu --project_id=<your-project-id> --skip_leading_rows=1 ip_address.example_ip_addresses free-proxies.csv "ip_address:STRING"
bq load --replace --location=eu --project_id=<your-project-id> --skip_leading_rows=1 ip_address.example_orders fake-orders.csv "customer_id:STRING,amount:FLOAT,datetime:DATETIME"

7. IP address geographic distribution

In this step, we will retrieve geolocation information (i.e Country and City) for the IP addresses we have collected in Step 5. We will also materialize the results into a table in order to use it for further analysis.

CREATE OR REPLACE TABLE ip_address.example_ip_addresses_with_location AS

WITH sample_dataset AS (
SELECT ip_address FROM ip_address.example_ip_addresses
),
-- Find IP address version and convert string IP address to binary format (bytes).
sample_dataset_1 AS (
SELECT
ip_address,
NET.SAFE_IP_FROM_STRING(ip_address) ip_address_in_bytes,
CASE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip_address))
WHEN 4 THEN 'IPv4'
WHEN 16 THEN 'IPv16'
ELSE 'other' END ip_address_version
FROM sample_dataset
),
-----PROCESS IPV4 Addresses
-- Select only IPv4 IP Addresses
ipv4_addresses AS (
SELECT DISTINCT * FROM sample_dataset_1 WHERE ip_address_version='IPv4'
),
-- Create all possible netmasks from 255.0.0.0 to 255.255.255.255
ipv4_netmasks AS (
SELECT mask FROM UNNEST(GENERATE_ARRAY(8,32)) mask
),
-- Lookup Addresses on MaxMind's Geolite2 Database
ipv4d_addresses AS (
SELECT * FROM
(
-- Find the network bin that identifies the network
-- to which ip address belongs
SELECT ip_address, ip_address_in_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4_addresses
CROSS JOIN ipv4_netmasks
)
-- Keep what matches with MaxMind's Geolite2 Database
JOIN `eu.geolite2_city_ipv4` USING (network_bin, mask)
),
-----PROCESS IPV6 Addresses
-- Select only IPv6 IP Addresses
ipv6_addresses AS (
SELECT DISTINCT * FROM sample_dataset_1 WHERE ip_address_version='IPv16'
),
-- Create all possible netmasks from ffff:e000:: to ffff:ffff:ffff:ffff::
ipv6_netmasks AS (
SELECT mask FROM UNNEST(GENERATE_ARRAY(19,64)) mask
),
-- Lookup Addresses on MaxMind's Geolite2 Database
ipv6d_addresses AS (
SELECT * FROM
(
-- Find the network bin that identifies the network
-- to which ip address belongs
SELECT ip_address, ip_address_in_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6_addresses
CROSS JOIN ipv6_netmasks
)
-- Keep what matches with MaxMind's Geolite2 Database
JOIN `eu.geolite2_city_ipv6` USING (network_bin, mask)
)
-- Combine results
SELECT * FROM ipv4d_addresses
UNION ALL
SELECT * FROM ipv6d_addresses

We will also merge the above table with the table that includes the fake orders that we have generated in Step 5. As we mentioned earlier we assumed that an IP address (ip_address) refers to a Customer ID (customer_id).

CREATE OR REPLACE TABLE ip_address.example_orders_with_location AS
SELECT * FROM ip_address.example_orders o
JOIN ip_address.example_ip_addresses_with_location l ON l.ip_address=o.customer_id

8. IP address geographic distribution Public Demo

We can also test or use this functionality in BigQuery without implementing it, by using the publicly available tables (eu.geolite2_city_ipv4 , eu.geolite2_city_ipv6). Keep in mind that this dataset will not get updated.

The following example is for Europe-based (eu) data sets.

WITH sample_dataset AS (
SELECT ip_address FROM
UNNEST(['2604:bc80:8001:1064::2','2a02:587:b213:7db4:462e:5d32:93ea:76e8','62.38.6.90','69.162.81.155']) ip_address
),
-- Find IP address version and convert string IP address to binary format (bytes).
sample_dataset_1 AS (
SELECT
ip_address,
NET.SAFE_IP_FROM_STRING(ip_address) ip_address_in_bytes,
CASE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip_address))
WHEN 4 THEN 'IPv4'
WHEN 16 THEN 'IPv16'
ELSE 'other' END ip_address_version
FROM sample_dataset
),
-----PROCESS IPV4 addresses
-- Select only IPv4 IP addresses
ipv4_addresses AS (
SELECT DISTINCT * FROM sample_dataset_1 WHERE ip_address_version='IPv4'
),
-- Create all possible netmasks from 255.0.0.0 to 255.255.255.255
ipv4_netmasks AS (
SELECT mask FROM UNNEST(GENERATE_ARRAY(8,32)) mask
),
-- Lookup addresses on MaxMind's Geolite2 Database
ipv4d_addresses AS (
SELECT * FROM
(
-- Find the network bin that identifies the network
-- to which ip address belongs
SELECT ip_address, ip_address_in_bytes & NET.IP_NET_MASK(4, mask) network_bin, mask
FROM ipv4_addresses
CROSS JOIN ipv4_netmasks
)
-- Keep what matches with MaxMind's Geolite2 Database
JOIN `justfunctions.eu.geolite2_city_ipv4` USING (network_bin, mask)
),
-----PROCESS IPV6 addresses
-- Select only IPv6 IP addresses
ipv6_addresses AS (
SELECT DISTINCT * FROM sample_dataset_1 WHERE ip_address_version='IPv16'
),
-- Create all possible netmasks from ffff:e000:: to ffff:ffff:ffff:ffff::
ipv6_netmasks AS (
SELECT mask FROM UNNEST(GENERATE_ARRAY(19,64)) mask
),
-- Lookup addresses on MaxMind's Geolite2 Database
ipv6d_addresses AS (
SELECT * FROM
(
-- Find the network bin that identifies the network
-- to which ip address belongs
SELECT ip_address, ip_address_in_bytes & NET.IP_NET_MASK(16, mask) network_bin, mask
FROM ipv6_addresses
CROSS JOIN ipv6_netmasks
)
-- Keep what matches with MaxMind's Geolite2 Database
JOIN `justfunctions.eu.geolite2_city_ipv6` USING (network_bin, mask)
)
-- Combine results
SELECT * FROM ipv4d_addresses
UNION ALL
SELECT * FROM ipv6d_addresses

or by using the following public table function (created by Author — JustFunctions)

SELECT * FROM `justfunctions.eu.ip_lookup`("2a02:587:b213:7db4:462e:5d32:93ea:76e8,62.38.6.90,69.162.81.155")
We can also directly use the final table that we have created in Step 7 for analysis:

We can also directly use the final table that we have created in Step 7 for analysis:

SELECT * FROM `justfunctions.eu.example_orders_with_location`

9. The analysis

By visualizing the results we can examine the geographical distribution and the behavioral patterns of our customers. Our analysis shows that China and the United States are our top-performing countries, contributing more than 20% to our total revenue. Additionally, four of our most valuable customers are from China. Interestingly, China and the United States have contributed almost equally to our total revenue, but our most valuable customers are all from China. This suggests that we may have fewer customers in China who are making larger purchases, while we have more customers in the United States making smaller purchases. Examining our time series data, we can see that although China generally generates more revenue, there are certain months such as January and February 2022, where the United States outperforms China in terms of profitability.

7_fake_orders_dashboard

References

[1] Geolocation with Bigquery, Google Cloud

[2] What you need to know about IPv6, Redhat.com

[3] IPv4 address exhaustion, Wikipedia.com

[4] Classless Inter-Domain Routing, Wikipedia.com

[5] Internet Protocol version 4, Wikipedia.com

[6] CIDR, TechTarget.com

[7] IPv6, Wikipedia.com