How to Find HTML Tags using Regex

This article shows how you can extract HTML tags and content within the HTML tags using the C# regular expressions (regex).

Extracting HTML tags from strings can be extremely useful while parsing web pages. With regex, you can parse HTML tags, the content within the HTML tags, or both. This article explains these three use cases.

Finding HTML Tags Only

You can use the Matches() method from the Regex class to find all the HTML tags within a string. You can use the regular expression “<.*?>“ to do so. This regular expression matches anything that occurs between the opening and closing greater than and less than symbols.

If a string contains the pattern <>, the count attribute of the Matches() method returns True. You can then iterate through all the Match objects within the Matches collection, and access the matched string via the value attribute.

Here is an example. In the script below the Matches() method matches opening and closing bold and paragraph tags.

Note: You will need to import the “System.Text.RegularExpressions” module before running the script below.

 class Program
    {
        static void Main(string[] args)
        {
            string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>. " +
                "This is <p>paragraph</p>";

            string regex = @"<.*?>";
            var matches = Regex.Matches(input, regex);

            if (matches.Count > 0)
            {
                Console.WriteLine("Match found:");
                foreach (Match m in matches)
                {
                    Console.WriteLine(m.Value);
                }
            }
            Console.ReadLine();
        }
    }

Output:

Regular Expressions Find Tags

Finding HTML Tags Including Content

You can also find HTML tags and the content within the HTML tags using the Match() and Matches() method. The Match() method searches for a single occurrence.

Let’s see an example. If you want to find the bold tag and the content within this tag, you can use the regex expression “\s(.+?)\s”. This regex expression matches anything that occurs within the opening bold and closing bold tags.

If a match is found the Match() method’s Success attribute returns true. In that case, you can access the matched value via the Value attribute. Here is a sample script:

 class Program
    {
        static void Main(string[] args)
        {
            string input = "This written in <b>bold fonts</b>. This is simple font";
            string regex = @"<b>\s*(.+?)\s*</b>";
            var match = Regex.Match(input, regex);
            if (match.Success == true)
            {
                Console.WriteLine("Match found");
                Console.WriteLine(match.Value);
            }
            Console.ReadLine();
        }
    }

Output:

Regular Expressions Find HTML Tags

If you want to search for multiple tags within a string, you can use the Matches() method which returns a collection of Match class objects. You can then access all the matches tagged via the value attributes of all the matched objects.

The script below searches for all the bold tags within the input string.

 class Program
    {
        static void Main(string[] args)
        {
            string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>";
            string regex = @" <b>\s*(.+?)\s*</b>";
            var matches = Regex.Matches(input, regex);
            if (matches.Count > 0)
            {
                Console.WriteLine("Match found:");
                foreach (Match m in matches)
                {
                    Console.WriteLine(m.Value);
                }
            }
            Console.ReadLine();
        }
    }

Output:

Regular Expressions How to Find HTML Tags

In the output above, you can see that the tags along with the content are found.

Finding Content within HTML Tags

Finally, you can also find only the content within HTML tags. To do so, you can use the Match() method. The regular expression used for this purpose is “\s(.+?)\s”. This regular expression will match whatever occurs within the opening and closing bold fonts.

The HTML tags will be stored at the first index of the Groups collection which is an attribute of the Match object. The content can be accessed by indexing the second index (the index referenced by 1).

Look at the script below for example:

class Program
    {
        static void Main(string[] args)
        {
            string input = "This written in <b>bold fonts</b>. This is simple font";
            string regex = @"<b>\s*(.+?)\s*</b>";
            var match = Regex.Match(input, regex);
            if (match.Success == true)
            {
                Console.WriteLine("Match found");
                Console.WriteLine(match.Groups[1].Value);
            }
            Console.ReadLine();
        }
    }

Output:

Reg Ex Find Tags

In the output of the above script, you can see only the content from the HTML tag printed on the console.

Finally, you can find content from multiple HTML tags. To do so, you can use the Matches() method with the same regular expression that you saw in the previous script. Here is an example of how to do that.

 class Program
    {
        static void Main(string[] args)
        {
            string input = "This written in <b>bold fonts</b>. This is simple font <b>again bold fonts</b>";
            string regex = @" <b>\s*(.+?)\s*</b>";
            var matches  = Regex.Matches(input, regex);
            if (matches.Count > 0)
            {
                Console.WriteLine("Match found:");
                foreach (Match m in matches)
                {
                    Console.WriteLine(m.Groups[1].Value);
                }
            }
                Console.ReadLine();
        }
    }

Output:

Reg Ex Find HTML Tags

Other useful articles: