One thing I keep having to do again and again (why???) is extract links from a webpage. I recently created a tiny application that gives you the list of all the links on a given web page, and I thought I’d share it with everyone. Given the url of the page, the first thing to do is to get the source code of the page so that we can screen it for links. There are a couple of ways to do this, but the easiest way, as far as I’m aware of, is to use a WebClient object:

WebClient webClient = new WebClient();

//Get the HTML from the given page byte[] response_html = webClient.DownloadData(url);

UTF8Encoding utf8 = new UTF8Encoding();

string html = utf8.GetString(response_html);

Read more about the WebClient class at MSDN.

Now that we have the source code, the next thing to do is search for all the links. You can do this the hard way — i.e., use the String class’ IndexOf function and hack your way out of the predicaments that come your way (and trust me, there are quite a few of them).

The easy way is to use regular expressions.

The pattern for matching the href = "wherever.whatever" (i.e., the url) part of a link is: href\s=\s(?:"(?<1>[^"]*)"|(?<1>\S+)). Looks a little ugly, but the point is that it works in 90% of the cases. What that regular expression actually means is the subject of another article, but suffice to say, it works.

So, the code then:

private void GetLinks(string url) { //using System.Net WebClient webClient = new WebClient();

     //Get the HTML from the given page
     byte[] response_html = webClient.DownloadData(url);

     //using System.Net
     UTF8Encoding utf8 = new UTF8Encoding(); 

     string html = utf8.GetString(response_html);

//using System.Text.RegularExpressions Regex r = new Regex ("hrefs=s(?:"(?<1>[^"]*)"|(?<1>S+))", RegexOptions.IgnoreCase | RegexOptions.Compiled);

     //using System.Text.RegularExpressions
     //Get all the matches
     MatchCollection mcl = r.Matches(html);

     //using System.Collections
     ArrayList a = new ArrayList(); 

     foreach (Match m in mcl)
         a.Add(m.Value);

//A gridview object grdLinks.DataSource = a; grdLinks.DataBind(); }