bigquery

Word Tokenization

Remove spaces

Replace HTML tags

Replace URLs

Replace English Contractions

Remove English Stopwords

Remove Accents

Deduplicate word characters

Replace special characters

Perform ASCII Transliteration

Perform Stemming

Perform Language Translation

BigQuery World News

Conclusion

Natural Language Processing (NLP) is a subfield of computer science and AI that aims to enable computers to understand and process human language like text and speech, as people do.

NLP preprocessing is an essential step in NLP pipelines because raw text data is usually messy and unstructured. NLP preprocessing helps to clean, normalize, and structure the text data into a format that is suitable for NLP algorithms to process. This step is crucial for improving the performance of NLP models accuracy and ensuring the reliability of their results.

But why perform NLP preprocessing in BigQuery?

BigQuery’s serverless architecture makes it well-suited for NLP preprocessing tasks due to its ability to handle large amounts of text data and perform preprocessing at scale. This eliminates the need to worry about the performance of the processing pipeline, as BigQuery is designed to manage this automatically. Additionally, BigQuery provides User Defined Functions (UDFs) in Javascript or SQL, which can be used to easily implement custom processing logic.

In this tutorial we will go through the following preprocessing steps:

Word Tokenization
Remove spaces
Replace HTML tags
Replace URLs
Replace English contractions
Remove English stopwords
Remove Accents
Deduplicate word characters
Replace Special Characters
Perform ASCII Transliteration
Perform Stemming
(BONUS) Perform Language Translation

You can find all the above BigQuery UDFs in airtable or in github, documented and available for public use (justfunctions.eu. or justfunctions.us. based on your location of choice). You can also use the source code to create your own unique functions.

1_functions_documentation_airtable

So let’s start!

Word Tokenization

Word tokenization is the process of dividing text into individual words or tokens. We will make use of this function latter is some of the preprocessing steps.

For the following example our word tokenization symbol will be the whitespace “\s+”.

DECLARE txt STRING;

-- Example
SET txt = "this is a sentence";

-- Query
SELECT justfunctions.eu.word_tokens(txt,r"\s+")

-- Result
[ "this","is","a","sentence" ]

Remove spaces

remove_extra_spaces

Removing unnecessary spaces is important to improve text’s readability, consistency, and make it easier to parse for text processing purposes.

This function removes whitespaces as well as the following characters \n\t\r, if these are hardcoded.

remove_extra_whitespaces

This function removes only whitespaces.

DECLARE txt STRING;

-- Example
SET txt = "Hi     there\\n.";

-- Query
SELECT justfunctions.eu.remove_extra_whitespaces(txt)

-- Result
Hi there\\n.

Replace HTML tags

replace_html_tags

HTML tags are used to format and structure web content, but when the same content is displayed as text, these tags can interfere with the intended format and cause issues like incorrect line breaks, unreadable characters, and unexpected formatting.

This function removes html tags in the form of anything between.

DECLARE txt STRING;

-- Example
SET txt = "<div class=\'test\'>hello world<a href=\'#\'>hello world<\a><\\div>";

-- Query
SELECT justfunctions.eu.replace_html_tags(txt ," ")

-- Result
hello world hello world

Replace URLs

replace_urls

Removing URLs during NLP preprocessing is important because URLs do not contribute much to the meaning of the text.

DECLARE txt STRING;

-- Example
SET txt = "Google it https://www.google.com/ !";

-- Query
SELECT justfunctions.eu.replace_urls(txt,"")

-- Result
Google it !

Replace English Contractions

replace_en_contractions

Contractions are a combination of two words, where an apostrophe is used to represent the omitted letters. For example, “don’t” is a contraction of “do not”. Contractions are commonly used in written English for informal and conversational contexts. In NLP using this function can also reduce the number of features used to train our model, making it faster and more accurate. This function allows to replace the contraction with the full form. It supports basic contractions as “don’t”->”do not” as well as some slag contractions “thx”->”thank you”.

For this function to work we use a static dictionary and word tokenization. For the following example our word tokenization symbol will be whitespace “\s+”. Keep in mind that you can create your own function here to enrich the static dictionary.

DECLARE txt STRING;

-- Example
SET txt = "I'll be great tmr thx";

-- Query
SELECT justfunctions.eu.replace_en_contractions(txt,r"\s+")

-- Result
i will be great tomorrow thanks

Remove English Stopwords

remove_en_stopwords

Stopwords are common words in a language that are typically filtered out during NLP preprocessing because they do not carry much semantic meaning. Some examples of stopwords are “the”, “and”, “of”, “in”, “a”, “to”, etc. By removing stopwords, the dimensionality of the feature space is decreased, which helps in faster processing and storage.

For this function to work we use a static dictionary and RegExp expression (we could also use word tokenization as in the case of replace_en_contractions). For the following example use used the RegExp expression of word boundaries “\bword\b” to detect stopwords inside text. Keep in mind that you can create your own function here to enrich the static dictionary or change word tokenization method.

DECLARE txt STRING;

-- Example
SET txt = "The query in the database is returning the rows with the specified column values.";

-- Query
SELECT justfunctions.eu.remove_en_stopwords(txt)

-- Result
query   database  returning  rows   specified column values.

Remove Accents

remove_accents

Accented text refers to written words that contain diacritical marks, such as accents, tildes, or umlauts, to indicate pronunciation or stress in a language. These marks are used in many languages, such as French, Spanish, German or Greek, to differentiate between words that may be spelled similarly but have different meanings and pronunciations. Although accented text is important for communication, it can worsen the accuracy of our models. The process or removing accents is a method of text normalization.

DECLARE txt STRING;

-- Example
SET txt = "¿Dóndé Éstá Mí Ágúá?";

-- Query
SELECT justfunctions.eu.remove_accents(txt,r"\s+")

-- Result
¿Donde Esta Mi Agua?

dedup_chars

Deduplicate word characters

Another method of text normalization is character deduplication in words. This process is used to eliminate duplicate characters that may have been introduced due to typing errors, such as when individuals type quickly and accidentally repeat characters in a word. The presence of duplicated characters can lead to misspellings thus to incorrect representation of words causing resulting in words being treated as different features by NLP systems.

We should note this process can sometimes lead to incorrect spelling of words as the system may not be able to differentiate between intentional duplicated characters (such as in words with double letters) and unintentional duplications (such as typos). For example if we have the words hello, helllooo, heloo, helllo after applying the function we will get helo, helo, helo, helo.

DECLARE txt STRING;

-- Example
SET txt = "Helloooo!";

-- Query
SELECT justfunctions.eu.dedup_chars(txt)

-- Result
Helo!

Replace special characters

replace_special_chars

There can be many special characters in written text, such as punctuation marks, emoticons, mathematical symbols, currency symbols, etc. These special characters can appear frequently in texts, especially in social media posts, online comments, and chat conversations. They can have a significant impact on the outcome of NLP tasks, and therefore need to be handled properly during preprocessing.

Let us suppose that in our case special characters are not important. The following function tries to replace them using a static dictionary so it works for any Language. Keep in mind that you can create your own function here to enrich the static dictionary.

replace_special_chars_latin

This function performs the same task as replace_special_chars but without using a static dictionary. It uses a basic RegExp expression “[^a-zA-Z0-9\s]+” which means keep everything that is not a latin character or number. Using this function has the advantage of working well with Latin characters, but it does not perform well with languages that use Unicode characters, such as Greek, Arabic, or Chinese.

DECLARE txt STRING;

-- Example
SET txt = "爱=Love♥ & 幸福=Happiness▲!";

-- Query
SELECT justfunctions.eu.replace_special_chars(txt,"")

-- Result
爱Love 幸Happiness

replace_special_chars_lang

If we include some more RegExp rules in the above function, we can support Latin, Greek, Arabic and Cyrillic Languages:

- Latin [a-zA-Z]
- Greek [\u0370-\u03ff\u1f00-\u1fff]
- Chinese [\\p{Han}]
- Arabic [\\p{Arabic}]
- Cyrillic [\\p{Cyrillic}] (if we want Russian only we should use [\u0401\u0451\u0410-\u044f]

DECLARE txt STRING;

-- Example
SET txt = "$幸福=Happiness▲ Ζωή and Радость الحب!!!";

-- Query
SELECT justfunctions.eu.replace_special_chars_lang(txt,"")

-- Result
幸福Happiness Ζωή and Радость الحب

Perform ASCII Transliteration

ASCII transliteration is the conversion of text written in non-Latin characters into Latin characters. It is important in NLP because it helps ensure consistency in the representation of text written as it can reduce misspellings or accented characters.

Some examples of ASCII transliteration in NLP are:

Иван -> Ivan (Russian name)
Καλημέρα -> Kalimera (Greek accented word)
Καλιμερα -> Kalimera (Greek misspelled word)

DECLARE txt STRING;

-- Example
SET txt = "Καλημέρα καλιμερα καλειμερα καλως όρισες καλος ορησες καλος ορησαις";

-- Query
SELECT string_agg(DISTINCT(justfunctions.eu.transliterate_anyascii(word))," ") FROM unnest(justfunctions.eu.word_tokens(txt,r"\s+")) word

-- Result
kalimera kaleimera kalos orises orisais

We got 5 distinct tokens out of 9, of our original tokens.

Perform Stemming

Stemming is a normalization word technique and a very common preprocessing step in natural language processing (NLP) that is used to reduce words to their base form or stem. Stem refers to the morphological variant of a word’s root form, which does not always have to be a valid root form. Stemming allows words with the same meaning to be treated as a single item, such as “connect”, “connected”, “connecting”, “connection,” and “connects” which all derive from the same base form “connect”. This process reduces the number of unique words in the dataset which can significantly improve the efficiency and accuracy of text analysis algorithms.

For this example we will use Porter stemmer (also available stemmer_lancaster, stemmer_greek) and also make use of our word tokenizer.

For more details on the subject you can check the article I wrote here.

Perform Language Translation

Translation as a preprocessing step is important in NLP because it allows NLP models to processes multilingual data. Unfortunately, Translation cannot be done directly in NLP without the use of an external API like Google Translate or Azure Translator. However, Google BigQuery enables the implementation of external APIs as SQL functions using Remote Functions. If you want to find out how you can implement Azure Translator API and perform language translation using SQL functions, check the article I have written here.

BigQuery World News

Recently I discovered a very promising open source library called BigFunctions that provide access to several open source functions as well as a framework to test, deploy, document and monitor to your own BigQuery UDFs. It helped me a lot so if you find it useful too do not forget to give it a star!

Conclusion

This tutorial covered several BigQuery functions designed specifically for NLP preprocessing. As more functions related to text similarity, statistics, and geospatial analysis become available, they will be included in this list. Stay curious and keep an eye out for more updates on BigQuery functions in future articles.

2023-12-20

10 Advanced public BigQuery functions to elevate your game in NLP Preprocessing

bigquery

Table of Contents

Word Tokenization

Remove spaces

remove_extra_spaces

remove_extra_whitespaces

Replace HTML tags

replace_html_tags

Replace URLs

replace_urls

Replace English Contractions

replace_en_contractions

Remove English Stopwords

remove_en_stopwords

Remove Accents

remove_accents

dedup_chars

Deduplicate word characters

Replace special characters

replace_special_chars

replace_special_chars_latin

replace_special_chars_lang

Perform ASCII Transliteration

Perform Stemming

Perform Language Translation

BigQuery World News

Conclusion

Related

How to retrieve Project ID, Dataset ID, and Location ID in Google BigQuery

JustSQL (BigQuery Copilot)

JustFunctions for BigQuery

Extend BigQuery NLP armory with Stemmers

Contact

Company

Useful Links

Newsletter