HTML Agility Pack

Introduction

Many people already know HTML Agility Pack (HAP), a cool free component developed by Simon Mourier (former Microsoft architect consultant and now CTO of SoftFluent) that really simplifies your life when it comes to manipulate HTML code.

What exactly HAP?

HAP is a HTML parser that builds a read/write Document Object Model and supports plain XPATH or XSLT. It is an assembly that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

This article will demonstrate how easy it is to download pictures from any public web site and store them in a local folder. You can also find 3 others samples in the downloadable version of HAP.

The scenario

Problem

Mido, a graphic designer with software development skills, often needs to download a bunch of pictures from internet web sites. As he is fed up with clicking on each picture and saving it locally, he decides to develop his own picture grabber. But he soon faces a major problem: how can he process the HTML stream received from websites (often malformed) in order to locate the correct strings containing links to the pictures he wishes to download? He could write some funny (or weird depending on how you think about it J) RegEx code but there is a much easier solution: HTML Agility Pack.

Solution

After some research on the net, Mido finds the HTML Agility Pack component. He soon decides to base his development on HAP. Here is how the code looks like (full source code is downloadable):

  • Instantiates the HtmlWeb (a utility class to get HTML document from HTTP) class:

    HtmlWeb hw
    = new HtmlWeb();

    Connect to the remote URL with the Load method which returns a HtmlDocument object:

    HtmlDocument doc
    = hw.Load(remoteUrl);

  • Extract the links to the pictures to be downloaded: we decided to create a GetLinks method for sake of clarity, which select through an XPATH query all the strings starting with "a href" and returns a StringCollection object containing the full URL to remotes pictures to be downloaded:

    HtmlNodeCollection atts
    = doc.DocumentNode.SelectNodes("//a[starts-with(@href, '/')]");

    Important: this is where HAP is very powerful; it allows selection of any part of the HTML document with an XPATH query.

  • Downloads each files using System.Net.WebClient class:
                            
            static void DownloadPictures(StringCollection sc, string destination)
            {
                WebClient myWebClient = new WebClient();
    
                myWebClient.Credentials = CredentialCache.DefaultCredentials;
                foreach (String filePath in sc)
                {
                    Uri uri = new Uri(filePath);
                    int length = uri.Segments.Length;
                    string fileName = uri.Segments[length - 1];
    
                    if (!System.IO.File.Exists(destination + HttpUtility.UrlDecode(fileName)))
                    {
                        myWebClient.DownloadFile(HttpUtility.UrlPathEncode(filePath), destination + fileName);
                        Console.WriteLine("File {0} has been downloaded.",
                        HttpUtility.UrlDecode(fileName));
                    }
                }
            }
                            

Another application example

If you wish to automatically download all PDC files, read the HTML Agility Pack articles and use the DownloadImage sample after you have modified the remote URL with http://commnet.microsoftpdc.com/content/downloads.aspx and the file extensions with "ppt" and "doc" instead of "gif" and "jpg". As you will see, it's very easy :)

Conclusion

Html Agility Pack is a very powerful tool that will help you navigate inside any HTML stream exactly the same way you do with XML files. It allows you to build new applications (web service, web site, etc.) by wrapping the source HTML application. For instance, you can create a RSS proxy for any website allowing RSS reader to retrieve syndications from a non RSS-enabled website. You will find the RSS sample inside the zip package.

Download sample code

Omid Bayani & Simon Mourier

Back