How do you read HTML? Parser or RegEx?

When reading HTML with C#, do you use a parser or do you use Regular Expressions? This post will discuss how parsing is better than regular expressions.

Written by Jonathan "JD" Danylko • Last Updated: December 19^th, 2014 • Develop •

Over the past weekend, I've been working on my CMS (Content Management System) and came across some old code that uses regular expressions to read HTML.

WHAT!?!?? OMG...what have I done!? I'm succumbing to the dark side.

My thoughts were laid to rest when I saw some Html Agility Pack code used that was near my regex code. Phew! I must not have converted some old RegEx code to use the parser code.

I immediately replaced the regular expressions with parser code. Then, I had to take a shower because I felt so dirty for what I had done with regular expressions.

Once I saw the final code, then, and only then, did unicorns and Obama play together as one.

Obama riding Unicorn

Now, I don't know where I got this picture, but we all know it's not real, right? (The tip-off is that there's no way Obama can shoot rainbows from his hands!)

Kind of like parsing HTML with Regular Expressions...it's not real!

What is your reasoning?

Still think regular expressions are better for reading HTML, huh?

Ok, let's go through a couple common questions as to why parsing is better!

Question: Why do I need to use a parser? I can write regular expression code in fewer lines.

I'm sure you could, but honestly, it's almost the same amount of lines.

Let's compare the two ways of reading an HTML title tag:

Html Agility Pack

var doc = new HtmlDocument { OptionOutputAsXml = true };
doc.LoadHtml(htmlContent);
var metaNode = doc.DocumentNode.SelectSingleNode("//head/title");
Console.Write("Title: {0}", metaNode.InnerText);

Regular Expression

var ex = new Regex(@"(?<=<title.*>)([\s\S]*)(?=</title>)", RegexOptions.IgnoreCase);
return ex.Match(htmlPage).Groups[1].Value.Trim();

Yes, the parser may be a line longer, but the next time you need to examine another part of HTML, you just need to call one line to XPath to the location and one line to display/assign the value/attribute.

With the regular expression, you have to create a new RegEx instance to start the whole process over again to find another tag.

Winner: Html Parser

Question: Why can't I use Regular Expressions? Everyone is using it to parse HTML!

I remember using regular expressions a long time ago. I kept saying there had to be a better way to read the HTML.

I'm a lazy programmer. Most programmers are lazy folks by nature. We find better ways of doing things and automate the crap out of it. Heck, that's our job.

Html Parsers provide better interfaces and perform all of the heavy lifting instead of writing code to find specific tags. It makes things easier on us.

If you are using the Html Agility Pack, you see the same parsing code all over the Internet. Maybe it's not exactly the same code, but it's consistent. It's just a one-off from the original parsing code.

Now...regular expressions...Yes, everyone is using regular expressions to parse HTML, but do you know how many versions of regular expressions I find on web sites that read a particular tag?

At least 5. It's different for every developer who wants to "read HTML" and they came up with the most awesome way of doing it...until it doesn't work with a specific web page.

I would rather stick to the tride-and-true, one piece of parser code to read HTML instead of 5 ways of reading it and waiting for it to fail on "Bob's Fabulous Bank" web site that is improperly formatted.

Also, this one StackOverflow question (at last check) has 4,428 developers who agree that you should NOT use Regular Expressions to parse your HTML.

NOT everyone is doing it!

Winner: Html Parser

Conclusion

While I can appreciate developers "giving their all" at trying to find a tag and extract the inner text from it, I feel that there are companies that sometimes encourage this type of thinking (umm...Regular Expression Examples at Microsoft.com).

Regular expressions do have their place (matching on string patterns, splitting text, string validation, etc.), but using it on HTML is not a good idea. Too many unknowns.

Keep the Html Parsers parsing data and let the regular expressions perform their string duties and keep the RegEx out of the HTML arena.

Do you have any reasons why parsing (or regular expressions) is better? What are your thoughts on this topic? Post your comments below!

ASP.NET 8 Best Practices by Jonathan Danylko

Reviewed as a "comprehensive guide" and a "roadmap to excellence" with over 120 Best Practices for ASP.NET Core 8, Jonathan's first book by Packt Publishing explores proven techniques for every phase of the SDLC.

Learn industry-standard concepts to improve your coding, debugging, and deployment of ASP.NET Core websites.

Jonathan "JD" Danylko is an author, web architect, and entrepreneur who's been programming for over 30 years. He's developed websites for small, medium, and Fortune 500 companies since 1996.

He currently works at Insight Enterprises as an Architect.

When asked what he likes to do in his spare time, he replies, "I like to write and I like to code. I also like to write about code."

comments powered by Disqus

What's New

October 2^nd, 2024

Created a presentation on Improving Website SEO (Search Engine Optimization) using ASP.NET.
September 9^th, 2024

Updated the About JD page by adding my book and other tidbits
January 10^th, 2024

Added more reviews for my latest book, "ASP.NET 8 Best Practices."
June 22^nd, 2023

Amazon posted the pre-order page for my book (and yes, I'm freaking out a bit).
May 26^th, 2023

Added the Stir Trek 2023 videos link to the post.
November 3^rd, 2022

Updated Technology Trends for Developers by adding the Web Almanac (State of the Web Report)
July 4^th, 2022

Taking it easy today...Happy 4th of July everyone!
June 15^th, 2022

Yikes! My first vidcast EVER! Thanks Ed for the opportunity.
March 14^th, 2022

Updated Collection:Git Resources for Beginners w/ 20 Git Commands for Beginners
March 14^th, 2022

Updated Collection: Web API Best Practices w/ How to design REST APIs and The 10 REST Commandments.

Welcome

Jonathan Danylko Hi! Welcome to DanylkoWeb. This is the PERSONAL web site of Jonathan "JD" Danylko where I focus on Microsoft web technologies including ASP.NET using C#, Web Performance, Code Exorcisms (refactorings), and business lessons learned over a period of 30 years of programming.

I've also collected the best ways to learn C#.

If you have any questions, feel free to contact me.

Search

Morning Coffee Link

April 23^rd, 2025Default styles for h1 elements are changing

View All

Be Social

Follow my blog with Bloglovin

Latest Book

Amazon Unbound

By Brad Stone

ISBN: 1398500976 / 978-1398500976

See the latest books I've read at the Reading Corner.