Monday, May 21, 2012

Remove HTML tags from a text string in ASP.NET using RegularExpression

This article aims to explain a very simple method to remove HTML tags from a text using RegularExpressions.

This will remove all HTML tags, or character references (like, &nbsp, &amp) from a text, and will return plain text.

    public static string RemoveHtmlTags(string htmlText, bool preserveNewLine)
    {
        System.Web.UI.HtmlControls.HtmlGenericControl divNew = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
        divNew.InnerHtml = htmlText;
        if (preserveNewLine)
        {
            divNew.InnerHtml = divNew.InnerHtml.Replace("<br>", "\n");
            divNew.InnerHtml = divNew.InnerHtml.Replace("<br/>", "\n");
            divNew.InnerHtml = divNew.InnerHtml.Replace("<br />", "\n");
        }
        return System.Text.RegularExpressions.Regex.Replace(divNew.InnerText, "<[^>]*>", "");
    }

if there is a requirement to preserve new line, then we need to convert HTML line breaks into new line character (which is "\n" in C#).

Example output using above method:
Input: &nbsp;Take one <strong>notebook</strong>, pen, <u>pencil</u> and <em>eraser</em> with you.
Result:  Take one notebook, pen, pencil and eraser with you.
(In above example, &nbsp; is replaced by a space character in the beginning)

2 comments:

  1. Hi, I am wondering where in mvc project would I place this code? Also what would I replace in this code with my code?

    ReplyDelete

Thanks for visiting my blog.
However, if this helped you in any way, please take a moment to write a comment.

Thanks
Nirman