How to Use RegEx for Data Extraction

Regex, short for REGular EXpression, is a very powerful tool. Often overlooked, this is an absolute necessity if you are interested in pattern matching or extracting useful information from large texts efficiently.

Regex tools are used across many different industries. It is used heavily for searching or extracting characters (or words), list of characters (or words), or complex patterns that are hard to express compactly without the regular expressions.

A short history of Regex
What are regular expressions?
Practical use cases of Regex

The regular expression is used for different tasks like Search Engine Optimisation, web analytics, digital marketing, or even human resource management. We will cover some of the popular use cases of Regex tools in the latter half, but before that, we will look at what exactly regular expressions are and how they are written.

A short history of Regex

Regular expressions are pretty old, to be honest. It was invented by Stephen Cole Kleene in 1951. In 1968, Ken Thompson used it for pattern matching for the 'QED' text editor. He would later port this to the 'ed' text editor as well. The warm welcome by the user base and the UNIX community led to the inclusion and integration of Regex into many other UNIX tools like grep, lex, AWK, and sed among many others.

Today, many programming languages out of the box support Regex - the standard library of Java and Python, for example, support it.

What are regular expressions?

A regular expression is essentially a pattern matching system. As mentioned, it lets you match and search character combinations is usually long pieces of texts. Regex (often stylised as Regexp) is more powerful than another common patten matching tool, wildcards.

Wildcards and Regex are very common in operating systems like Linux. Most of the Linux users would be familiar with the notation '*.txt' (without the quotation marks). It simply means 'all files that end with .txt' (in other words, have '.txt' extension). So, we can delete (or do something else) all the '.txt' files without even breaking a sweat. The best part is that you do not even need to know the file names to do something with them. How cool is that?

But that just skims the top. That was possibly the most basic usage of the wildcard and regular expression. They can be used to search for very complex patterns as well. Consider the following example.

Say you want to make a list of all the email addresses present in a piece of text. One easier solution is obviously to search for the '@' symbol. The Regex way of doing it would be to use the expression:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

Following the standard rules of Google of a valid email address, this Regex would capture any email address. Simply test it out.

These days Regex are part of most of the general-purpose programming languages by default. Even if they do not support this powerful tool out of the box, you can find libraries without putting in much effort. That being said, many professionals are still either hesitant or outright ignorant and end up losing the opportunity of taking advantage of this immensely powerful tool.

Practical use cases of Regex

Regular expressions find lots of practical applications. Some of the use cases of the Regex among a plethora of applications involve pattern recognitions, metadata extraction, spell checking etc. Here we elaborate some of these use cases and reveal the power of Regular expressions.

Detect typos

Checking and detecting spelling mistakes is an integral part of being a web publisher. Using wrong spelling often means a lower conversion rate, as you are losing traffic due to having wrongly typed keywords.

But with the help of regular expressions, you can check for commonly misspelt words, and in many ways automate the detection and replacing process. That being said, you need to know these misspelt words beforehand. But the good thing is that the list of the commonly misspelt words is much smaller compared to the entire dictionary.

The set of words (or the keywords, to be even more specific) become even smaller as we are talking about niche sectors. Add to that the language-specific constraints and you are left with only a handful of words.

Say, you deal in digital gadgets - mostly phones. So some of the misspelt words that you should be aiming to include 'celphone', 'lollipop', 'androide' etc. The following Regular Expression would detect these misspelt words:

(?i)celphone | lollipop | androide

Bigger the collection of these words, higher the possibility that you would detect the typos.

Detect incorrect capitalisation

Strictly speaking, having wrong capitalisations does not affect the traffic of a website as much as the earlier typos. That being said, having the wrong capitalisation lead to poor viewing and reading experience. So, it indirectly affects the reputation of your online presence.

Another reason to keep an eye on the proper capitalisation is to target specific products that are very much case sensitive. For example, you might notice the usage 'IBM DB2' till today while the 'proper' spelling is 'IBM Db2' (notice the lowercase 'b').

In any case, as the examples suggest, often care needs to be taken to have proper capitalisation in your text. Regular expressions are the saviour of course. Say you want to have the term 'AB+' in your text, not 'ab+' or some wrong variation. So, you simply need to use the following Regex to detect the wrongly capitalised variants:

(ab|aB|Ab)+

Detect missing trademark/copyright mark

Using the trademark/copyright symbol in the right places is very important. The trademark symbol is used when you use some trademarked word or phrase. Similarly, you should make sure that you are using the copyright symbol for an intellectual property for which the original author has the copyright.

Often you would see texts omitting these symbols. In most of the cases, they are benign and harmless. But if things got serious, you might be alleged of using/reproducing/stealing someone else's work without having the permission. In the worst-case scenario, the owner can even file a lawsuit against you.

So, the best bet is to try and include the trademark/copyright information and credit the owner/author whenever needed. Using Regex it becomes as simple as using the following commands:

ABC®

ABC(?!®)

The first regular expression simply makes sure that the word 'ABC' is followed by the symbol '®' (without the quotes). The second one, on the other hand, detects the occurrences where the term is not immediately followed by the '®' symbol.

Extract metadata

Web analytics is an important part of understanding and optimising web usage. You need lots of data for this task though. Crawling a website is a good option for generating the data. You simply need to fire up an automated crawler bot and let it do the grunt job of collecting the bulk information.

Say you want to crawl a website and make a list of all the articles that fulfil a certain condition. You can get a list of related metadata by going to the page source. More often than not, the source would be very obfuscated and make next to no sense.

So your task is to extract the interesting information from this heap. Regex is exactly what you need at this point. Say you are interested in making a list of all the posts on ByteScout that are doing a comparative analysis of different tools or platforms. The following regular expression run on the page source would highlight the 'articleSection' part along with the associated values (with the double quotes):

"articleSection":"(.*?)"

This way you can easily make a list of related and important articles which comes in handy in the web analytics.

Other useful articles: