Why extend the functionality of a modern database instead of writing a short Python function?
In the era of cloud computing and serverless architectures, many organizations are shifting away from traditional, on-premises systems and moving their data and applications to the cloud. This allows them to take advantage of the many benefits of cloud computing, such as scalability, flexibility, and cost-efficiency.
One example is the extensive use of modern databases like Google Bigquery, Amazon Redshift, and Snowflake. A modern database cannot only store and process efficiently large volumes of data but also support a wide range of additional functionalities that go beyond traditional database capabilities. These functionalities include features such as data integration, data processing, data visualization, and even machine learning. Since all these different tasks can be streamlined, users can reduce the amount of time and effort they need to spend and instead focus on most important matters.
For example, we have gathered a large dataset of news articles from a variety of sources and we want to train an NLP model. This requires knowledge in several areas, including programming, machine learning frameworks, databases, and server administration. It is essential to know how to write scripts in a programming language such as Python to prepare the data, how to use a machine learning framework like TensorFlow or PyTorch to train the model, and how to use a NoSQL database like MongoDB to store and retrieve the data. Additionally, training a machine learning model on a large dataset requires a significant amount of computing power, and managing the infrastructure to support this can be a challenging task that requires expertise in server administration. On the other hand, performing the same steps in a modern database like BigQuery only requires knowledge of SQL, a widely-used language for working with data in relational databases.
In that way, the use of a modern database can greatly simplify the process of working with data, making it more accessible to a wider range of users.
Stemming is a normalization word technique and a very common preprocessing step in natural language processing (NLP) that is used to reduce words to their base form or stem. Stem refers to the morphological variant of a word's root form, which does not always have to be a valid root form. Stemming allows words with the same meaning to be treated as a single item, such as "connect", "connected", "connecting", "connection," and "connects" which all derive from the same base form "connect". This process reduces the number of unique words in the dataset which can significantly improve the efficiency and accuracy of text analysis algorithms.
Google Search adopted word stemming in 2003. Previously a search for “fish” would not have returned “fishing” or “fishes” [1].
Stemming algorithms typically use heuristics or rules-based approaches to identify and remove affixes (such as prefixes and suffixes) from words, resulting in a simplified word form. Although this approach can be very fast, it can also be error-prone so should be preferred when speed and efficiency are important considerations.
An error can be characterized as over-stemming or under-stemming. Over-stemming occurs when the algorithm reduces words to the same stems when it should reduce them to different stems. For example, based on Porter stemmer, "university" and "universe" are reduced to "univers" when should be reduced to two different stems since they do not have the same meaning.
Under-stemming occurs when the algorithm reduces words to different stems when it should reduce them to the same stem. For example, based on Porter stemmer, the words “data” and “datum.” are reduced to "data" and "datum" respectively when should be reduced to the same stem.
Stemming algorithms broadly fall into one of two categories:
A brute force approach is when words are mapped to stems, using table lookup. Another commonly known approach is to reduce words to their stem by applying a series of rules to strip affixes. Other processes take into account word morphology, including part of speech. These approaches require an extensive set of linguistic rules and thus are language-dependent. For that reason, most known rule-based stemmers are available for widely studied languages(English). Among the rule-based stemmers are Lovins, Dawson (Lovins based), Porter (recently Porter2), Paice/Husk (aka Lancaster) stemmers.
A statistical approach requires training a model from a lexicon, a corpus, or character-based n-grams of the language's words. Using that approach, affix-stripping rules are derived. Statistical stemmers are language-independent since they do not require any language-specific information. Among the statistical stemmers are N-Gram, Krovetz, and Xerox stemmers. Statistical approaches are also an open field of research since there is an interest in text searches over languages with few linguistic resources.
In this tutorial, we will be implementing some rule-based stemmers using parts of pre-existing implementations in Javascript:
A user-defined function (UDF) lets you create a function by using a SQL expression or JavaScript code. A UDF accepts columns of input, performs actions on the input, and returns the result of those actions as a value. Google Documentation
As we discussed earlier there are several reasons why we choose to use BigQuery UDFs (User-Defined Functions) instead of writing a Python program for a particular task.
Bigquery supports two types of UDFs, SQL-based and Javascript-based (Also supports remote functions UDFs but we will refer to this functionality in another article).
When we want to create a function that uses BigQuery's existing functionality we should use SQL UDFs. For example, we want to create a function that deduplicates letters from a word.
CREATE OR REPLACE FUNCTION `my-project.my-dataset.dedup_chars`(s STRING) AS (
(SELECT
STRING_AGG(
IF
(c = SPLIT(s, '')[SAFE_OFFSET(off - 1)],
NULL,
c), '' ORDER BY off)
FROM
UNNEST(SPLIT(s, '')) AS c
WITH OFFSET off)
);
On the other hand, when we want to create a function that cannot be found on BigQuery's functions but can be found in an external Javascript library, we should use Javascript-based UDFs.
In this tutorial, we will combine different Javascript libraries into a custom Javascript library and use it to create some Javascript-based UDFs.
Check out the complete code here.
If you want to just test or use stemming functionality in BigQuery, without implementing it, skip this section and go to the next - Stemmers Demo.
To implement stemmers to Bigquery, we will execute the following steps:
git clone https://github.com/justdataplease/bigquery-stemmers
Our repository has the following structure:
dist folder: Includes the webpack output, stemmers.js.
Install necessary packages defined in package.json.
npm install
npm run-script build
gsutil mb -c nearline -l europe-west3 -p yourproject gs://yourbucket
gsutil cp dist/stemmers.js gs://yourbucket
As an example, we will translate the following English sentence into Greek and Spanish.
**"Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language" **
-- Define corpus
DECLARE grString,enString,esString STRING;
SET grString = "Η επεξεργασία φυσικής γλώσσας είναι ένα υποπεδίο της γλωσσολογίας της επιστήμης των υπολογιστών και της τεχνητής νοημοσύνης που ασχολείται με τις αλληλεπιδράσεις μεταξύ των υπολογιστών και της ανθρώπινης γλώσσας";
SET enString = "Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language";
SET esString = "El procesamiento del lenguaje natural es un subcampo de la lingüística la informática y la inteligencia artificial que se ocupa de las interacciones entre las computadoras y el lenguaje humano";
-- Create greek stemmer
CREATE TEMP FUNCTION grStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.greekStemmer(word);
""";
-- Create english stemmer
CREATE TEMP FUNCTION poStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.porterStemmer(word);
""";
-- Create Spanish stemmer
CREATE TEMP FUNCTION esStemmer(word STRING)
RETURNS STRING
LANGUAGE js
OPTIONS (
library=["gs://yourbucket/stemmers.js"]
)
AS r"""
return utils.unineStemmer(word);
""";
-- Remove accents and convers to upper case - greek stemmer requirement
CREATE TEMP FUNCTION fix_word(word STRING) AS ((
SELECT UPPER(regexp_replace(normalize(word, NFD), r"\pM", ''))
));
-- Union all sentenses
WITH corpus AS (
SELECT word, 'gr' lang FROM UNNEST(SPLIT(grString," ")) word
UNION ALL
SELECT word, 'en' lang FROM UNNEST(SPLIT(enString," ")) word
UNION ALL
SELECT word,'es' lang FROM UNNEST(SPLIT(esString," ")) word
)
-- Run stemmers
SELECT
lang,
string_agg(
CASE lang
WHEN 'gr' THEN grStemmer(fix_word(word))
WHEN 'en' THEN poStemmer(word)
WHEN 'es' THEN esStemmer(word)
ELSE 'missing' END," ") stemmed,
string_agg(word," ") original
FROM corpus
GROUP BY 1
You can use the publicly available functions to test or use stemmers' functionality in BigQuery.
The following example is for Europe-based data sets. To change location to United States-based datasets change “eu” to “us” (i.e justfunctions.eu.porterStemmer->justfunctions.us.porterStemmer)
-- Define corpus
DECLARE grString,enString,esString STRING;
SET grString = "Η επεξεργασία φυσικής γλώσσας είναι ένα υποπεδίο της γλωσσολογίας της επιστήμης των υπολογιστών και της τεχνητής νοημοσύνης που ασχολείται με τις αλληλεπιδράσεις μεταξύ των υπολογιστών και της ανθρώπινης γλώσσας";
SET enString = "Natural language processing is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language";
SET esString = "El procesamiento del lenguaje natural es un subcampo de la lingüística la informática y la inteligencia artificial que se ocupa de las interacciones entre las computadoras y el lenguaje humano";
-- Remove accents and convers to upper case - greek stemmer requirement
CREATE TEMP FUNCTION fix_word(word STRING) AS ((
SELECT UPPER(regexp_replace(normalize(word, NFD), r"\pM", ''))
));
-- Union all sentenses
WITH corpus AS (
SELECT word, 'gr' lang FROM UNNEST(SPLIT(grString," ")) word
UNION ALL
SELECT word, 'en' lang FROM UNNEST(SPLIT(enString," ")) word
UNION ALL
SELECT word,'es' lang FROM UNNEST(SPLIT(esString," ")) word
)
-- Run stemmers
SELECT
lang,
string_agg(
CASE lang
WHEN 'gr' THEN justfunctions.eu.grStemmer(fix_word(word))
WHEN 'en' THEN justfunctions.eu.porterStemmer(word)
WHEN 'es' THEN justfunctions.eu.esStemmer(word)
ELSE 'missing' END," ") stemmed,
string_agg(word," ") original
FROM corpus
GROUP BY 1
This article highlights the importance of modern databases and how BigQuery user-defined functions (UDFs) can be used to perform complex pre-processing tasks, such as stemming. We also demonstrated this by implementing some rule-based stemmers in three different languages.
[1] Stemming, Wikipedia.
[2] Singh, J., Gupta, V., A systematic review of text stemming techniques. Artif Intell Rev 48, 157–217 (2017), Springer.
[3] Plique, G., Talisman (2020), Journal Of Open Source Software (JOSS).