Extend BigQuery NLP armory with Stemmers

extend-bigquery-nlp-armory-with-stemmers-implement-english-spanish-and-greek-stemmers-using-js-udfs
stemmers nlp

Motivation

Why extend the functionality of a modern database instead of writing a short Python function?

In the era of cloud computing and serverless architectures, many organizations are shifting away from traditional, on-premises systems and moving their data and applications to the cloud. This allows them to take advantage of the many benefits of cloud computing, such as scalability, flexibility, and cost-efficiency.

One example is the extensive use of modern databases like Google Bigquery, Amazon Redshift, and Snowflake. A modern database cannot only store and process efficiently large volumes of data but also support a wide range of additional functionalities that go beyond traditional database capabilities. These functionalities include features such as data integration, data processing, data visualization, and even machine learning. Since all these different tasks can be streamlined, users can reduce the amount of time and effort they need to spend and instead focus on most important matters.

For example, we have gathered a large dataset of news articles from a variety of sources and we want to train an NLP model. This requires knowledge in several areas, including programming, machine learning frameworks, databases, and server administration. It is essential to know how to write scripts in a programming language such as Python to prepare the data, how to use a machine learning framework like TensorFlow or PyTorch to train the model, and how to use a NoSQL database like MongoDB to store and retrieve the data. Additionally, training a machine learning model on a large dataset requires a significant amount of computing power, and managing the infrastructure to support this can be a challenging task that requires expertise in server administration. On the other hand, performing the same steps in a modern database like BigQuery only requires knowledge of SQL, a widely-used language for working with data in relational databases.

In that way, the use of a modern database can greatly simplify the process of working with data, making it more accessible to a wider range of users.

Stemming

1_stemming

Stemming is a normalization word technique and a very common preprocessing step in natural language processing (NLP) that is used to reduce words to their base form or stem. Stem refers to the morphological variant of a word's root form, which does not always have to be a valid root form. Stemming allows words with the same meaning to be treated as a single item, such as "connect", "connected", "connecting", "connection," and "connects" which all derive from the same base form "connect". This process reduces the number of unique words in the dataset which can significantly improve the efficiency and accuracy of text analysis algorithms.

Google Search adopted word stemming in 2003. Previously a search for “fish” would not have returned “fishing” or “fishes” [1]. 

Stemming algorithms typically use heuristics or rules-based approaches to identify and remove affixes (such as prefixes and suffixes) from words, resulting in a simplified word form. Although this approach can be very fast, it can also be error-prone so should be preferred when speed and efficiency are important considerations.

An error can be characterized as over-stemming or under-stemming. Over-stemming occurs when the algorithm reduces words to the same stems when it should reduce them to different stems. For example, based on Porter stemmer, "university" and "universe" are reduced to "univers" when should be reduced to two different stems since they do not have the same meaning.

Under-stemming occurs when the algorithm reduces words to different stems when it should reduce them to the same stem. For example, based on Porter stemmer, the words “data” and “datum.” are reduced to  "data" and "datum" respectively when should be reduced to the same stem.

Stemming algorithms broadly fall into one of two categories:

2_types_of_stemming

Rule-based

A brute force approach is when words are mapped to stems, using table lookup.  Another commonly known approach is to reduce words to their stem by applying a series of rules to strip affixes. Other processes take into account word morphology, including part of speech. These approaches require an extensive set of linguistic rules and thus are language-dependent. For that reason, most known rule-based stemmers are available for widely studied languages(English). Among the rule-based stemmers are Lovins, Dawson (Lovins based), Porter (recently Porter2), Paice/Husk (aka Lancaster) stemmers.

Statistical

A statistical approach requires training a model from a lexicon, a corpus, or character-based n-grams of the language's words. Using that approach, affix-stripping rules are derived. Statistical stemmers are language-independent since they do not require any language-specific information. Among the statistical stemmers are N-Gram, Krovetz, and Xerox stemmers. Statistical approaches are also an open field of research since there is an interest in text searches over languages with few linguistic resources.

In this tutorial, we will be implementing some rule-based stemmers using parts of pre-existing implementations in Javascript:

  • Lovins (English) [3] - GitHub repo (MIT Licence)
  • Porter (English) [3] - GitHub repo repo (MIT Licence)
  • Paice-Husk / Lancaster (English) [3] - GitHub repo repo (MIT Licence)
  • UniNE (Spanish) [3] - GitHub repo repo (MIT Licence)
  • Greek Stemmer (Greek) [4] - GitHub repo (MIT Licence)

BigQuery User-Defined Functions (UDFs)

A user-defined function (UDF) lets you create a function by using a SQL expression or JavaScript code. A UDF accepts columns of input, performs actions on the input, and returns the result of those actions as a value. Google Documentation

As we discussed earlier there are several reasons why we choose to use BigQuery UDFs (User-Defined Functions) instead of writing a Python program for a particular task.

  • Simplicity and ease of use: Allows to implement custom logic in Bigquery using SQL. This has many benefits, especially for users who are not familiar with Python or other programming languages.
  • Performance and scalability: BigQuery UDFs can be executed in parallel across multiple virtual CPUs (slots) in BigQuery distributed system, allowing for efficient and scalable processing of large datasets. In contrast, a Python program would typically run on a single server and may not be able to scale as well.
  • Cost savings: BigQuery UDFs allow users to leverage the existing infrastructure and resources of BigQuery, without the need to manage additional computational resources for running Python programs

Bigquery supports two types of UDFs, SQL-based and Javascript-based (Also supports remote functions UDFs but we will refer to this functionality in another article).

SQL-based

When we want to create a function that uses BigQuery's existing functionality we should use SQL UDFs. For example, we want to create a function that deduplicates letters from a word.

CREATE OR REPLACE FUNCTION `my-project.my-dataset.dedup_chars`(s STRING) AS (
(SELECT
  STRING_AGG(
    IF
      (c = SPLIT(s, '')[SAFE_OFFSET(off - 1)],
        NULL,
        c), '' ORDER BY off)
FROM
  UNNEST(SPLIT(s, '')) AS c
  WITH OFFSET off)
);

3_sql_based_udf

Javascript-based

On the other hand, when we want to create a function that cannot be found on BigQuery's functions but can be found in an external Javascript library, we should use Javascript-based UDFs.

In this tutorial, we will combine different Javascript libraries into a custom Javascript library and use it to create some Javascript-based UDFs.

Stemmers Implementation

Check out the complete code here.

If you want to just test or use stemming functionality in BigQuery, without implementing it, skip this section and go to the next - Stemmers Demo.

To implement stemmers to Bigquery, we will execute the following steps:

  1. Clone the repository
git clone https://github.com/justdataplease/bigquery-stemmers

Our repository has the following structure:

4_project_structure

  • package.json: Holds various metadata relevant to the project and handles the project's dependencies.
  • src folder: Include our javascript functions. Inside there is an index.js file where we define the functions (as well as their names), that we want to implement as UDF functions.
  • webpack.config.js: Bundles all our functions (defined in index.js) into a single file stemmers.js, using webpack package.
  • dist folder: Includes the webpack output, stemmers.js.

  • Install necessary packages defined in package.json.

npm install
  1. Run webpack.config.js to create our package using webpack.
npm run-script build
  1. Create a bucket in GCS. Make sure to change "yourproject" with your GCP project id and "yourbucket" with your desired bucket name (has to be universally unique).
gsutil mb -c nearline -l europe-west3 -p yourproject gs://yourbucket
  1. Copy webpack output or stemmers.js to GCS.
gsutil cp dist/stemmers.js gs://yourbucket
  1. Implement stemmers.

As an example, we will translate the following English sentence into Greek and Spanish.

**"Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language" **

-- Define corpus
DECLARE grString,enString,esString STRING;

SET grString = "Η επεξεργασία φυσικής γλώσσας είναι ένα υποπεδίο της γλωσσολογίας της επιστήμης των υπολογιστών και της τεχνητής νοημοσύνης που ασχολείται με τις αλληλεπιδράσεις μεταξύ των υπολογιστών και της ανθρώπινης γλώσσας";

SET enString = "Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language";

SET esString = "El procesamiento del lenguaje natural es un subcampo de la lingüística la informática y la inteligencia artificial que se ocupa de las interacciones entre las computadoras y el lenguaje humano";

-- Create greek stemmer
CREATE TEMP FUNCTION grStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.greekStemmer(word);
""";

-- Create english stemmer
CREATE TEMP FUNCTION poStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.porterStemmer(word);
""";

-- Create Spanish stemmer
CREATE TEMP FUNCTION esStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.unineStemmer(word);
""";

-- Remove accents and convers to upper case - greek stemmer requirement
CREATE TEMP FUNCTION fix_word(word STRING) AS ((
SELECT UPPER(regexp_replace(normalize(word, NFD), r"\pM", ''))
));

-- Union all sentenses
WITH corpus AS (
  SELECT word, 'gr' lang FROM UNNEST(SPLIT(grString," ")) word 
  UNION ALL
  SELECT word, 'en' lang FROM UNNEST(SPLIT(enString," ")) word 
  UNION ALL
  SELECT word,'es' lang FROM UNNEST(SPLIT(esString," ")) word 
)

-- Run stemmers
SELECT 
lang,
string_agg(
CASE lang 
WHEN 'gr' THEN grStemmer(fix_word(word))
WHEN 'en' THEN poStemmer(word)
WHEN 'es' THEN esStemmer(word) 
ELSE 'missing' END," ") stemmed, 
string_agg(word," ") original
FROM corpus
GROUP BY 1

5_stemming_output

Stemmers Demo

You can use the publicly available functions to test or use stemmers' functionality in BigQuery.

The following example is for Europe-based data sets. To change location to United States-based datasets change “eu” to “us” (i.e justfunctions.eu.porterStemmer->justfunctions.us.porterStemmer)


-- Define corpus
DECLARE grString,enString,esString STRING;
SET grString = "Η επεξεργασία φυσικής γλώσσας είναι ένα υποπεδίο της γλωσσολογίας της επιστήμης των υπολογιστών και της τεχνητής νοημοσύνης που ασχολείται με τις αλληλεπιδράσεις μεταξύ των υπολογιστών και της ανθρώπινης γλώσσας";
SET enString = "Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language";
SET esString = "El procesamiento del lenguaje natural es un subcampo de la lingüística la informática y la inteligencia artificial que se ocupa de las interacciones entre las computadoras y el lenguaje humano";


-- Remove accents and convers to upper case - greek stemmer requirement
CREATE TEMP FUNCTION fix_word(word STRING) AS ((
SELECT UPPER(regexp_replace(normalize(word, NFD), r"\pM", ''))
));

-- Union all sentenses
WITH corpus AS (
  SELECT word, 'gr' lang FROM UNNEST(SPLIT(grString," ")) word 
  UNION ALL
  SELECT word, 'en' lang FROM UNNEST(SPLIT(enString," ")) word 
  UNION ALL
  SELECT word,'es' lang FROM UNNEST(SPLIT(esString," ")) word 
)

-- Run stemmers
SELECT 
lang,
string_agg(
CASE lang 
WHEN 'gr' THEN justfunctions.eu.grStemmer(fix_word(word))
WHEN 'en' THEN justfunctions.eu.porterStemmer(word)
WHEN 'es' THEN justfunctions.eu.esStemmer(word) 
ELSE 'missing' END," ") stemmed, 
string_agg(word," ") original
FROM corpus
GROUP BY 1

Conclusion

This article highlights the importance of modern databases and how BigQuery user-defined functions (UDFs) can be used to perform complex pre-processing tasks, such as stemming. We also demonstrated this by implementing some rule-based stemmers in three different languages.

References

[1] Stemming, Wikipedia.

[2] Singh, J., Gupta, V., A systematic review of text stemming techniques. Artif Intell Rev 48, 157–217 (2017), Springer.

[3] Plique, G., Talisman (2020), Journal Of Open Source Software (JOSS).

[4] Ntais, G., Development of a Stemmer for the Greek Language (2006),  Master Thesis - the Department of Computer and Systems Sciences at Stockholm University