How to Find Website Links Using Regular Expressions in C#

Text documents contain different types of information such as alphabets, numbers, images, special characters, and website links, etc. One of the most important tasks is to extract all the website links that appear in a document. In this article, you will see how you can use regular expressions in C# to find website links in text documents. It is important to mention that you will be finding only explicitly mentioned website links and not the links embedded inside the text. So let’s begin without any ado.

Finding A Single Website Link

To find a single website link the Regex expression “\b(?:https?://|www.)\S+\b”. The explanation of the regular expression is as follows:

\b: looks for the start of a word after a period or empty space.
?:https?://|www\: Search for a string that starts with https, http or www.
\S+: Search for a series of non-whitespace characters
\b: Marks the end of a string.

To return a single result, the regular expression can be passed to the Match() method of the regex object. The following script returns the first website link encountered in the text.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegexCodes
{
    class Program
    {
        static void Main(string[] args)
        {

            string textFile = "Hello, search the items on www.google.com ";

            Console.WriteLine("====================");
            var myRegex = new Regex(@"\b(?:https?://|www\.)\S+\b", 
RegexOptions.IgnoreCase);

            string result = myRegex.Match(textFile).ToString();
            Console.WriteLine(result);

        }
    }
}

Here is the output of the above script. You can see that the website link has successfully been retrieved.

Output

==================
www.google.com

Finding Multiple Links

A text document can contain multiple website links starting with https, http or www. To fetch those links we can again use the “\b(?:https?://|www.)\S+\b” regular expression. However, this time we need to pass the regular expression to the “Matches()” function instead of the match function.

In the following script, the sample text contains three website links: www.google.com, https://bing.com, and http://yahoo.com. The “Matches()” function is used to return all the links. The result of the “Matches()” function is iterated via a for each loop and the value for each result is printed on the console.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

namespace RegexCodes
{
    class Program
    {
        static void Main(string[] args)
        {

            string textFile = "Hello, search the items on www.google.com 
and https://bing.com. If you dont find any answer, you can search it on 
http://yahoo.com as well";

            Console.WriteLine("====================");
            var myRegex = new Regex(@"\b(?:https?://|www\.)\S+\b", 
RegexOptions.IgnoreCase);

            var results = myRegex.Matches(textFile);

            foreach(Match result in results)
            {
                Console.WriteLine(result.ToString());
            }
           

        }
    }
}

The output of the above script is as follows. You can see that all the website links in the text have successfully been retrieved.

Output

==================
www.google.com
https://bing.com
http://yahoo.com

Other useful articles: