Using Regular Expressions in SQL for Text Normalization

Shu-Yu Huang
3 min read6 days ago

--

Text normalization is a crucial step in data preprocessing, ensuring that text data is consistent, standardized, and ready for analysis. Regular expressions (regex) are a powerful tool for pattern matching and text manipulation, and they can be particularly useful when working with SQL databases. In this article, we’ll explore how to leverage regular expressions in SQL for effective text normalization.

Applying Regular Expressions in SQL

To use regular expressions in SQL, you need to have a database management system (DBMS) that supports regex functions. Many popular DBMS, such as Oracle, PostgreSQL, and SQL Server, provide built-in functions for working with regular expressions.Here are some common regex functions used in SQL:

  1. REGEXP_REPLACE: Replaces a substring that matches a regular expression pattern with a specified replacement string.
  • Example:
SELECT REGEXP_REPLACE(address, '\s+', ' ', 'g') AS normalized_address 
FROM customers;
  • This replaces all occurrences of one or more whitespace characters with a single space.

2. REGEXP_INSTR: Returns the position of a substring that matches a regular expression pattern.

  • Example:
SELECT 
REGEXP_INSTR(
phone_number,
'^\+\d{1,2}\s?$$\d{3}$$\s?\d{3}[-.]?\d{4}$'
) AS is_valid_phone
FROM contacts;
  • This checks if the phone number matches a specific pattern for international phone numbers.

3. REGEXP_SUBSTR: Returns a substring that matches a regular expression pattern.

  • Example:
SELECT REGEXP_SUBSTR(name, '(\w+)\s(\w+)') AS full_name
FROM employees;
  • This extracts the first and last name from a full name column.

Text Normalization Use Cases

Regular expressions can be used for various text normalization tasks, such as:

  1. Removing special characters or punctuation:
  • Example:
SELECT
REGEXP_REPLACE(
description,
'[^a-zA-Z0-9\s]',
'',
'g'
) AS clean_description
FROM products;
  • This removes all non-alphanumeric and non-whitespace characters from the product description.

2. Standardizing capitalization:

  • Example:
SELECT
UPPER(REGEXP_REPLACE(
title,
'(\w+)',
'\u\1',
'g'
)) AS title_case
FROM articles;
  • This converts the first letter of each word in the title to uppercase.

3. Extracting specific information:

  • Example:
SELECT
REGEXP_SUBSTR(
email,
'(\w+)@(\w+)\.(\w+)'
) AS email_parts
FROM users;
  • This extracts the username, domain, and top-level domain from email addresses.

4. Replacing abbreviations or misspellings:

  • Example:
SELECT
REGEXP_REPLACE(
description, 'dept|department', 'Department', 'i'
) AS normalized_description
FROM employees;
  • This replaces ‘dept’ and ‘department’ (case-insensitive) with ‘Department’ in the employee description.

Example: Normalizing Email Addresses

Let’s consider an example of using regular expressions in SQL to normalize email addresses. Suppose we have a table named users with a column named email that contains various email formats.

SELECT
REGEXP_REPLACE(
email,
'^(\w+)@(\w+)\.(\w+).*$',
'\1@\2.\3'
) AS normalized_email
FROM users;

In this example, the REGEXP_REPLACE function uses a regular expression pattern to match the email address and extract the username, domain, and top-level domain. The replacement string '\1@\2.\3' constructs the normalized email address using the extracted parts.The regular expression pattern '^(\w+)@(\w+)\.(\w+).*$' breaks down as follows:

  • ^ matches the start of the string
  • (\w+) matches and captures one or more word characters (letters, digits, or underscores) as the username
  • @ matches the literal @ symbol
  • (\w+) matches and captures the domain name
  • \. matches a literal period
  • (\w+) matches and captures the top-level domain
  • .* matches any remaining characters
  • $ matches the end of the string

By applying this regular expression, the function extracts the relevant parts of the email address and constructs a normalized version using the captured groups.

Conclusion

Regular expressions are a powerful tool for text normalization in SQL. By leveraging regex functions like REGEXP_REPLACE, REGEXP_INSTR, and REGEXP_SUBSTR, you can perform various text manipulation tasks to ensure data consistency and quality. Regular expressions provide a flexible and efficient way to handle complex text patterns and streamline your data preprocessing workflows.

--

--

Shu-Yu Huang

AI engineer in Taiwan AI Academy| Former process engineer in TSMC| Former Research Assistance in NYMU| Studying few-shot-learning and GAN