Special Characters in Custom CMSs and Forms

by Terri Ann on January 30, 2008

It’s been a while since I wrote about Encoding & and HTML Entities in PHP and that was just one part of solving the problem of special characters (such as ™, ©, ») and saving/retrieving them successfully from a database/cms interface.

So now, 3 months and a week later I bring you the solution to those tricky copy and paste form word characters for your custom forms and CMS’s!

I call it: Nothing more than a well thought out, tested and executed function, complete with required content-type.

Sounds less fancy when I call it that but that’s all there is to it!

Again to describe the problem we had:

  1. We’d enter the HTML from our text editor where our « looked like « and past that into our simple text area/text field
  2. The special character would be stored into the database and would display on the web page as well as the CMS editor (when the page was open to edit)
  3. We’d pull the information from the database to the simple HTML text area/text field where the special characters would display as the special character, not the HTML code.
  4. When the data was saved back into the database it would save as the special character, not the intended HTML character code. 5 Now, depending on the character type of the document or sometimes the browser, this symbol would not display and often be replaced by a box or question mark.

We also needed to account for cases where the first two steps were omitted because of the dreaded “copy and paste from word” where characters would be entered into the text area directly a special characters.

Updates to the Form Add/Edit Page

So here’s how to fix that problem, on the pages that display your forms (be they add or edit) you need to make sure your content-type charset encoding is utf-8.

HTML

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Update to The Server Side – Saving to The Database Page

Add the following function to your included custom functions file. (The bottom of the function should look familiar form the old Encoding & and HTML Entities in PHP article.

PHP

    function convertSpecialCharacters($text){

       $badwordchars=array(
           "\xe2\x80\x98", // left single quote
           "\xe2\x80\x99", // right single quote
           "\xe2\x80\x9c", // left double quote
           "\xe2\x80\x9d", // right double quote
           "\xe2\x80\x94", // em dash
           "\xe2\x80\x93", // en dash
           "\xc2\xbb",     // right arrow quote
           "\xc2\xab",     // left arrow quote
           "\xc2\xa9",     // copyright
           "\xc2\xae",     // registered
           "\xe2\x84\xa2", // trademark
           "\xe2\x82\xac", // euro  
           "\xe2\x80\xa2", // bullet    
           "\xe2\x80\xa6"  // elipses   
       );
       $fixedwordchars=array(
           "&#8216;",
           "&#8217;",
           '&#8220;',
           '&#8221;',
           '&mdash;',
           '&ndash;',
           '&raquo;',
           '&laquo;',
           '&copy;',
           '&reg;', 
           '&trade;',
           '&euro;',
           '&bull;',
           '&#8230;'
       );
       $text = str_replace($badwordchars,$fixedwordchars,$text);
       $text = preg_replace('/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)/', '&amp;', $text);
       return $text;
    }

And then of course: call the function and pass it the variable or text that needs to be encoded.

PHP

    convertSpecialCharacters($_POST['title']);

It really is a very easy solution to a problem that appeared so complicated the first time we encountered it at work.

{ 1 comment… read it below or add one }

1 awter January 16, 2009 at 12:51 pm

Great stuff, your replacement function seems more complete than others I have found on the Internet. Thanks!

But: - please try to optimise this page for SE (the phrase “microsoft smart quotes” for instance would help to find this page) - why the preg_replace? What does it do?

Thanks again!

Leave a Comment

Previous post:

Next post: