Answered Everything I know about UTF-8 is wrong.

Cecil · Mar 25, 2015

I was importing a plain text file into a LONGCHAR datatype. The file had an unexpected extended character which was causing me errors. "Error -26 in file to longchar." My codepage stream & internal is set to UTF-8.

I've produced a little test code and under as UTF-8 session I seems to be only outputting 2 bytes when I was expecting 3 bytes. However if the OE session is iso8859-1 I get the expected 3 bytes.

Just for my sanity why when the OE session is UTF-8 it won't write the ascii character '150'?

Code:

 DEFINE STREAM SOUTPUT.
OUTPUT STREAM SOUTPUT TO 'asciitest.txt'.
PUT STREAM SOUTPUT CONTROL CHR(150).
PUT STREAM SOUTPUT CONTROL CHR(13) + chr(10).
OUTPUT STREAM SOUTPUT close.

FILE-INFO:FILE-NAME = 'asciitest.txt'.

MESSAGE Session:cpstream "File size: " FILE-INFO:FILE-SIZE.

Also I fixed my original code:

Code:

COPY-LOB FROM FILE pchFullFilename TO OBJECT chMessageBody NO-CONVERT.

to this:

Code:

COPY-LOB FROM FILE pchFullFilename TO OBJECT chMessageBody CONVERT SOURCE CODEPAGE 'ISO8859-1' TARGET CODEPAGE 'UTF-8'.

However the erroneous extended character is now stripped away as part of the COPY-LOB statement.

Info:
OE: 11.5 32bit
OS: Windows 7 64bit Pro.

Cecil · Mar 25, 2015

Cecil said:
However the erroneous extended character is now stripped away as part of the COPY-LOB statement.

Woops. I've made a mistake, the charter is not being stripped away, it just become a non-printable character.

Rob Fitzpatrick · Mar 25, 2015

Cecil said:
Woops. I've made a mistake, the charter is not being stripped away, it just become a non-printable character.

And 150 is outside of the ASCII range, whose graphical code points end at 126...

Cecil · Mar 26, 2015

Rob Fitzpatrick said:
And 150 is outside of the ASCII range, whose graphical code points end at 126...

Okay..... So does this explain why I can't write any ASCII value greater than 126 to the file when my session is UTF-8?

Rob Fitzpatrick · Mar 26, 2015

Cecil said:
Okay..... So does this explain why I can't write any ASCII value greater than 126 to the file when my session is UTF-8?

How do you determine that you can't write the data? Does the AVM throw an error? Or does the program you're using (e.g. Notepad, vim, whatever) to view the resultant file's contents not display (or appear to display) the data? Remember that not all code points are graphical (meaning that they have a corresponding glyph or character). Some are control code points like ESC, EOT, DEL, etc., dating back to the days of dummy terminals. Not all programs that visualize text agree on what to do with those.

I think of text as being nothing more than a shared hallucination; a convenient abstraction that we (mostly) agree on. Files, buffers, etc. don't contain text; they contain data. From this perspective there is no meaningful distinction between "text files" and "binary files", although we often speak colloquially in these terms.

So there isn't an "A" in your file; but there might be a byte that contains 0x41. Or, if you're using an application that interprets the data as Unicode, there might be some other byte or sequence of bytes that correspond to a graphical code point that looks like an "A". It's worth noting though that all code pages agree on the ASCII range of code points. It's a subset of all of them.

The upshot is that a text editor isn't showing you the data in the file (unless it has a hex-dump mode); it's showing you a particular interpretation or mapping of the data. And there could be other valid interpretations of the same data that differ visually. When I really want to know what's in a file I look at a hex dump. In Unix you can use tools like xxd or od. In Windows, unfortunately there isn't a decent text editor built in, so get a good third-party text editor. I like UltraEdit (which has a hex mode), but there are lots of others.

Cecil · Mar 26, 2015

Thank rob for your input, I understand where you are coming from it's just I can't understand why bytes aren't being written from the ABL code.

OK, changing tack, here is a some new testing code:

Code:

DEFINE VARIABLE inLoop AS INTEGER     NO-UNDO.
DEFINE STREAM SOUTPUT.
OUTPUT STREAM SOUTPUT TO 'asciitest.txt'.
DO inLoop = 1 TO 255:
    PUT STREAM SOUTPUT UNFORMATTED chr(inLoop).
END.
OUTPUT  STREAM SOUTPUT CLOSE.

It's loops from 1 to 255 producing every ASCII character in that range. However when you look at the text file under Notepad++ in HEX mode it stops after 127 character. This only happens when the session is UTF-8. (see screen shot below)

Same code but using the Latin 1, ISO8859-1 code page proceduces the extended character set:

Stefan · Mar 27, 2015

Code:

def var ii as int.

output to 'asciitest.utf.txt'.

do ii = 0 to 257:
   put unformatted length( chr( ii ) ) " " chr( ii ) skip.     
end.

output close.

150 is not a valid codepoint in utf-8 (http://en.wikipedia.org/wiki/UTF-8) - all values between 128 and 255 are continuation type bytes indicating what character the following byte(s) indicate.

For example, the A umlaut is at codepoint 196 in iso8859-1, but at 50052 in utf-8.

Code:

message asc("Ä")

Cecil · Mar 29, 2015

Okay, that's the answer I was looking for. Now it make sense.
Just when you think you know and understand something and then life gives you a curve ball making you realise that everything you know is wrong, hence my original title to this post..

Thanks Stefan for explaining it to me, even if you did have to go back basics.

Answered Everything I know about UTF-8 is wrong.

Cecil

19+ years progress programming and still learning.

Cecil

19+ years progress programming and still learning.

Rob Fitzpatrick

ProgressTalk.com Sponsor

Cecil

19+ years progress programming and still learning.

Rob Fitzpatrick

ProgressTalk.com Sponsor

Cecil

19+ years progress programming and still learning.

Attachments

Stefan

Well-Known Member

Cecil

19+ years progress programming and still learning.

Similar threads