General Discussion is the text corruption on the front page being investigated?

View : : :
Re: is the text corruption on the front page being investigated?
Feb 5, 2021, 14:50
Re: is the text corruption on the front page being investigated? Feb 5, 2021, 14:50
Feb 5, 2021, 14:50
This isn't about "Word formatted" characters but about all UTF-8 (non-ASCII) ones. The system has to process them as UTF-8 every step in a long chain:

  • Blue writes a story in Frontpage, which is configured to UTF-8
  • copies it into the Blammo admin script, which runs with Perl settings for UTF-8 processing
  • it gets stored in MySQL, which along with its tables and string columns are set to UTF-8 (utf8_unicode_ci)
  • the front-end script (with the same settings*) retrieves two days of news and writes the static .shtml files with UTF-8 characters
  • the switch script serves the .shtml page with or w/o ads**
  • alternatively, the board script (with the same settings*) does the same with news stories dynamically, for Share and Comment pages; ditto for the articles, logos, and other scripts
  • all generated HTML5 starts with meta charset="utf-8"
  • Apache sends all HTML over HTTPS to the browser with Content-Type: text/html; charset=utf-8
  • the browser should render the received bytes as UTF-8 characters, but doesn't always Confused

* Those I fixed early October, which I thought covered all bases, until the problem re-emerged three weeks ago.

** In an oversight, this script did not yet fully use Perl UTF-8 settings until this morning. So I believe the frontpage now correctly shows UTF-8 characters again. At least, in dozens of refreshes I haven't seen malformed "Word" quotes, ellipses, mdashes, etc. anymore. Please let me know if you catch them as yet.

Separately, we still have a problem with UTF-8 characters in single stories. In fact, the two Cyberpunk Exposé stories showing up in the popular threads box mid-January is how I first noticed it. And one of them is still erratic on the Share/Comments links:

For Blue and me, on story 1 the headline and browser title bar always show the accented e (eacute): é.
But on #2, they erratically flip between the é and the question mark on a diamond background, i.e. the replacement character �.

What I noticed as a possibly relevant difference between the two stories is that é is a 2-byte character (C3 A9), and while in story 1 multiple 3-byte characters occur (e.g. mdash —, E2 80 94), story 2 has none. Adding an mdash character to story 2 too (in testing) causes it to no longer show the replacement character in dozens of refreshes.

But I don't know the reason nor remedy for this yet; it's really weird that the browser only renders UTF-8 characters correctly and consistently if they're 3-byte ones (and then the 2-byte ones too), but not if a page includes only 2-byte characters, given that HTML and HTTPS instruct it to use that character set.

Yes, a workaround remedy may be to insure that the page always includes a 3-byte character somewhere, but I'd still like to understand the underlying problem if possible.

Btw, I hardly spent any time on this until yesterday, because I initially hit a Wall trying figure out what happened when, and because I was immersed in another programming project (non-BN, in fact I haven't done any real BN development since the smilies modal early November). But at least today there's some progress. Nice
-- Frans
Avatar 1258
Feb 1, 14:50Feb 1 14:50
Feb 2, 23:10Feb 2 23:10
Feb 3, 14:43Feb 3 14:43
Feb 5, 14:50Feb 5 14:50
Re: is the text corruption on the front page being investigated?
Feb 6, 08:27Feb 6 08:27