This isn't about "Word formatted" characters but about all UTF-8 (non-ASCII) ones. The system has to process them as UTF-8 every step in a long chain:
- Blue writes a story in Frontpage, which is configured to UTF-8
- copies it into the Blammo admin script, which runs with Perl settings for UTF-8 processing
- it gets stored in MySQL, which along with its tables and string columns are set to UTF-8 (utf8_unicode_ci)
- the front-end script (with the same settings*) retrieves two days of news and writes the static .shtml files with UTF-8 characters
- the switch script serves the .shtml page with or w/o ads**
- alternatively, the board script (with the same settings*) does the same with news stories dynamically, for Share and Comment pages; ditto for the articles, logos, and other scripts
- all generated HTML5 starts with meta charset="utf-8"
- Apache sends all HTML over HTTPS to the browser with Content-Type: text/html; charset=utf-8
- the browser should render the received bytes as UTF-8 characters, but doesn't always
* Those I fixed early October, which I thought covered all bases, until the problem re-emerged three weeks ago.
** In an oversight, this script did not yet fully use Perl UTF-8 settings until this morning
. So I believe the frontpage now correctly shows UTF-8 characters again. At least, in dozens of refreshes I haven't seen malformed "Word" quotes, ellipses, mdashes, etc. anymore. Please let me know if you catch them as yet.
Separately, we still have a problem with UTF-8 characters in single stories. In fact, the two Cyberpunk Exposé stories showing up in the popular threads box mid-January is how I first noticed it. And one of them is still erratic on the Share/Comments links:
For Blue and me, on story 1 the headline and browser title bar always show the accented e (eacute): é.
But on #2, they erratically flip between the é and the question mark on a diamond background, i.e. the replacement character �
What I noticed as a possibly relevant difference between the two stories is that é is a 2-byte character (C3 A9), and while in story 1 multiple 3-byte characters occur (e.g. mdash —, E2 80 94), story 2 has none. Adding an mdash character to story 2 too (in testing) causes it to no longer show the replacement character in dozens of refreshes.
But I don't know the reason nor remedy for this yet; it's really weird that the browser only renders UTF-8 characters correctly and consistently if they're 3-byte ones (and then the 2-byte ones too), but not if a page includes only
2-byte characters, given that HTML and HTTPS instruct it to use that character set.
Yes, a workaround remedy may be to insure that the page always includes a 3-byte character somewhere, but I'd still like to understand the underlying problem if possible.
Btw, I hardly spent any time on this until yesterday, because I initially hit a
trying figure out what happened when, and because I was immersed in another programming project (non-BN, in fact I haven't done any real BN development since the smilies modal early November). But at least today there's some