How to Find Word Frequencies using Regex
In this article, you will see how to find frequencies of different words within a C# string using regular expressions. You will first see how to count the number of words in a string, next you will see how to find the frequency of occurrence of each word within the input string. So, let’s begin without an ado.
Counting All Words
To count all the words in a string, you can use the Matches() method from the System.Text.RegularExpressions module.
The Matches() method accepts a regular expression and the input string as the first and second parameters, respectively. The regex expression that returns all the words (non digits) in a string is “\w+”.
The Matches() method returns a collection of Match objects where each object contains information about one of the matched words. To find the total number of words, you can use the Count attribute of the object returned by the Matches() method.
Furthermore, you can use the Value attribute of each of the Match objects to print the corresponding values of words. Similarly, the Index attribute prints the index of words.
The following script prints the total count of all the words in the input string along with the text and index for each word.
using System; using System.Text.RegularExpressions; namespace RegexCodes { class Program { static void Main(string[] args) { string input = "Your name is [John] and your age is (32). You are {married}"; string regex = @"\w+"; var result = Regex.Matches(input, regex); if (result.Count > 0) { Console.WriteLine("Total Number of Words found: " + result.Count); int i = 1; foreach (Match m in result) { Console.WriteLine("Word '" + m.Value + "' found at index: " + m.Index); i++; } } Console.ReadLine(); } } }
Output:
Finding Frequency of Each Word
You can also find the frequency of occurrence of each word. To do so, you can again use the Matches() method which matches all the words in the input string. The regex expression will remain the same i.e. ““\w+””.
Next, you can create a C# Dictionary collection with key as string type and value as integer type. The keys of this dictionary will store word texts while values will correspond to the frequency of occurrences of words.
After that you can iterate through the collection of Match objects returned by the Matches() method. If the word doesn’t already exist in the dictionary that you created, add the word text as the dictionary key and assign it a value of 1. Else if the word already exists in the dictionary as a dictionary key, increment the corresponding dictionary value by 1.
The following script returns frequencies of occurrences of all the words in an input string.
using System; using System.Collections.Generic; using System.Text.RegularExpressions; namespace RegexCodes { class Program { static void Main(string[] args) { string input = "Two bananas and two apples. Three bananas, and two mangoes."; string regex = @"\w+"; var result = Regex.Matches(input, regex); Dictionary<string, int> words = new Dictionary<string, int>(); if (result.Count > 0) { Console.WriteLine("Total Number of Words found: " + result.Count); foreach (Match m in result) { if (!words.ContainsKey(m.Value.ToLower())) words.Add(m.Value.ToLower(), 1); else words[m.Value.ToLower()]++; } foreach (var item in words) { Console.WriteLine(item.Key + " " + item.Value); } } Console.ReadLine(); } } }
Output:
Finding Frequency of Specific Words
The regex expression “\w+” returns all the words from the input string. You can also find frequencies of specific words within a string by filtering the words using regex expressions.
For instance, in the script below, the regex expression used is “\w*s\b” which matches all the words that end with an “s”.
The rest of the process is similar to what you saw in the previous section. You can create a dictionary where keys correspond to word texts, while values correspond to frequencies of occurrences for words.
using System; using System.Collections.Generic; using System.Text.RegularExpressions; namespace RegexCodes { class Program { static void Main(string[] args) { string input = "Two bananas and two apples. Three bananas, and two mangoes."; string regex = @"\w*s\b"; var result = Regex.Matches(input, regex); Dictionary<string, int> words = new Dictionary<string, int>(); if (result.Count > 0) { Console.WriteLine("Total Number of Words found: " + result.Count); foreach (Match m in result) { if (!words.ContainsKey(m.Value.ToLower())) words.Add(m.Value.ToLower(), 1); else words[m.Value.ToLower()]++; } foreach (var item in words) { Console.WriteLine(item.Key + " " + item.Value); } } Console.ReadLine(); } } }
Output:
Finding Frequencies using Split Method
In addition to using the Matches() method, you can also use the regex Split() method to count the frequencies of all the words in a string. The Split() method simply splits a string using some delimiter and returns the resulting substrings.
To count frequencies of all the words using the Split() method approach, you can use the regex value “\s+” which splits a string using spaces.
Next, using the returning collection of strings, you can create a dictionary which contains words and corresponding frequencies, as you saw in the previous section.
Here is an example of how to use the Split() method approach to finding word frequencies within an input string.
using System; using System.Collections.Generic; using System.Text.RegularExpressions; namespace RegexCodes { class Program { static void Main(string[] args) { string input = "Two bananas and two apples. Three bananas, and two mangoes."; string regex_clean = @"[^\w\s]+"; string regex = @"\s+"; string clean = Regex.Replace(input, regex_clean, ""); var result = Regex.Split(clean, regex); Dictionary<string, int> words = new Dictionary<string, int>(); if (result.Length > 0) { Console.WriteLine("Total Number of Words found: " + result.Length); foreach (string s in result) { if (!words.ContainsKey(s.ToLower())) words.Add(s.ToLower(), 1); else words[s.ToLower()]++; } foreach (var item in words) { Console.WriteLine(item.Key + " " + item.Value); } } Console.ReadLine(); } } }
Output
Other useful articles:
- How to Use RegEx for Data Extraction
- How to Find Total Tax Using a Regular Expression in C#
- How to Find a Number Using Regular Expressions in C#
- How to Find Invoice Numbers Using Regular Expressions in C#
- Find SSN Using a Regular Expression in C#
- Find Total Amount Using a Regular Expression in C#
- How to Find Website Links using Regex
- Module 1: Regular Expressions for Beginners
- Module 1: Regex Usage and Tool Demo
- Module 2: Regex Engine Basics (Part 1)
- Module 2: Regex Engine Basics (Part 2)
- Module 2: Regex Syntax in Detail (Part 1)
- Module 2: Regex Syntax in Detail (Part 2)
- Module 2: Quantifiers in Reg Ex for Beginners
- Module 2: Short Codes in Reg Ex for Beginners
- Module 2: Anchors and Boundaries in Detail
- Module 2: Grouping and Subpattern in Detail
- Module 3: Realtime Use Case of Regular Expressions - Part 1
- Module 3: Realtime Use Case of Regular Expressions - Part 2
- Module 3: Realtime Use Case of Regular Expressions - Part 3
- Module 3: Realtime Use Case of Regular Expressions - Part 4
- How to Find Quantity Field Using Regular Expression in C#
- How to Find Phone Numbers without a Specific Format
- How to Find Date Using Regular Expression in C#
- How to Find Time Using Regular Expression in C#
- How to Find a Sentence Using Regular Expressions in C#
- Find a Word Using Regular Expression in C#
- Find a Due Date using Regular Expressions in C#
- How to Find the End of a String Using Regular Expression in C
- How to Find the Start of a String Using Regular Expression in C
- How to Find a Comma using Regular Expression in C Sharp
- How to Find a Dot using Regular Expression in C
- How to Find a Semicolon using Regular Expression in C Sharp
- How to Find a Double Space using Regular Expression in C
- How to Split Text Using Regex
- How to Find HTML Tags Using Regex
- How to Validate Email Address via Regex in C#
- How to Extract Amount with Currency Symbols using Regex in C#
- How to Find Brackets with Regex
- How to Find Hash Sign with Regex
- How to Find Percentage Symbol with Regex
- How to Find Word Frequencies using Regex