How to Find HTML Tags using Regex
This article shows how you can extract HTML tags and content within the HTML tags using the C# regular expressions (regex).
Extracting HTML tags from strings can be extremely useful while parsing web pages. With regex, you can parse HTML tags, the content within the HTML tags, or both. This article explains these three use cases.
Finding HTML Tags Only
You can use the Matches() method from the Regex class to find all the HTML tags within a string. You can use the regular expression “<.*?>“ to do so. This regular expression matches anything that occurs between the opening and closing greater than and less than symbols.
If a string contains the pattern <>, the count attribute of the Matches() method returns True. You can then iterate through all the Match objects within the Matches collection, and access the matched string via the value attribute.
Here is an example. In the script below the Matches() method matches opening and closing bold <b> and paragraph <p> tags.
Note: You will need to import the “System.Text.RegularExpressions” module before running the script below.
class Program { static void Main(string[] args) { string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>. " + "This is <p>paragraph</p>"; string regex = @"<.*?>"; var matches = Regex.Matches(input, regex); if (matches.Count > 0) { Console.WriteLine("Match found:"); foreach (Match m in matches) { Console.WriteLine(m.Value); } } Console.ReadLine(); } }
Output:
Finding HTML Tags Including Content
You can also find HTML tags and the content within the HTML tags using the Match() and Matches() method. The Match() method searches for a single occurrence.
Let’s see an example. If you want to find the bold <b> tag and the content within this tag, you can use the regex expression “<b>\s(.+?)\s</b>”. This regex expression matches anything that occurs within the opening bold <b> and closing bold </b> tags.
If a match is found the Match() method’s Success attribute returns true. In that case, you can access the matched value via the Value attribute. Here is a sample script:
class Program { static void Main(string[] args) { string input = "This written in <b>bold fonts</b>. This is simple font"; string regex = @"<b>\s*(.+?)\s*</b>"; var match = Regex.Match(input, regex); if (match.Success == true) { Console.WriteLine("Match found"); Console.WriteLine(match.Value); } Console.ReadLine(); } }
Output:
If you want to search for multiple tags within a string, you can use the Matches() method which returns a collection of Match class objects. You can then access all the matches tagged via the value attributes of all the matched objects.
The script below searches for all the bold <b> tags within the input string.
class Program { static void Main(string[] args) { string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>"; string regex = @" <b>\s*(.+?)\s*</b>"; var matches = Regex.Matches(input, regex); if (matches.Count > 0) { Console.WriteLine("Match found:"); foreach (Match m in matches) { Console.WriteLine(m.Value); } } Console.ReadLine(); } }
Output:
In the output above, you can see that the tags along with the content are found.
Finding Content within HTML Tags
Finally, you can also find only the content within HTML tags. To do so, you can use the Match() method. The regular expression used for this purpose is “<b>\s(.+?)\s</b>”. This regular expression will match whatever occurs within the opening and closing bold fonts.
The HTML tags will be stored at the first index of the Groups collection which is an attribute of the Match object. The content can be accessed by indexing the second index (the index referenced by 1).
Look at the script below for example:
class Program { static void Main(string[] args) { string input = "This written in <b>bold fonts</b>. This is simple font"; string regex = @"<b>\s*(.+?)\s*</b>"; var match = Regex.Match(input, regex); if (match.Success == true) { Console.WriteLine("Match found"); Console.WriteLine(match.Groups[1].Value); } Console.ReadLine(); } }
Output:
In the output of the above script, you can see only the content from the HTML tag printed on the console.
Finally, you can find content from multiple HTML tags. To do so, you can use the Matches() method with the same regular expression that you saw in the previous script. Here is an example of how to do that.
class Program { static void Main(string[] args) { string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>"; string regex = @" <b>\s*(.+?)\s*</b>"; var matches = Regex.Matches(input, regex); if (matches.Count > 0) { Console.WriteLine("Match found:"); foreach (Match m in matches) { Console.WriteLine(m.Groups[1].Value); } } Console.ReadLine(); } }
Output:
Other useful articles:
- How to Use RegEx for Data Extraction
- How to Find Total Tax Using a Regular Expression in C#
- How to Find a Number Using Regular Expressions in C#
- How to Find Invoice Numbers Using Regular Expressions in C#
- Find SSN Using a Regular Expression in C#
- Find Total Amount Using a Regular Expression in C#
- How to Find Website Links using Regex
- Module 1: Regular Expressions for Beginners
- Module 1: Regex Usage and Tool Demo
- Module 2: Regex Engine Basics (Part 1)
- Module 2: Regex Engine Basics (Part 2)
- Module 2: Regex Syntax in Detail (Part 1)
- Module 2: Regex Syntax in Detail (Part 2)
- Module 2: Quantifiers in Reg Ex for Beginners
- Module 2: Short Codes in Reg Ex for Beginners
- Module 2: Anchors and Boundaries in Detail
- Module 2: Grouping and Subpattern in Detail
- Module 3: Realtime Use Case of Regular Expressions - Part 1
- Module 3: Realtime Use Case of Regular Expressions - Part 2
- Module 3: Realtime Use Case of Regular Expressions - Part 3
- Module 3: Realtime Use Case of Regular Expressions - Part 4
- How to Find Quantity Field Using Regular Expression in C#
- How to Find Phone Numbers without a Specific Format
- How to Find Date Using Regular Expression in C#
- How to Find Time Using Regular Expression in C#
- How to Find a Sentence Using Regular Expressions in C#
- Find a Word Using Regular Expression in C#
- Find a Due Date using Regular Expressions in C#
- How to Find the End of a String Using Regular Expression in C
- How to Find the Start of a String Using Regular Expression in C
- How to Find a Comma using Regular Expression in C Sharp
- How to Find a Dot using Regular Expression in C
- How to Find a Semicolon using Regular Expression in C Sharp
- How to Find a Double Space using Regular Expression in C
- How to Split Text Using Regex
- How to Find HTML Tags Using Regex