Unicode on Linux (oh, and Mac)

ShaunR · September 11, 2016

I'm adding support for NTLMv2 authentication and Hashes to the Encryption Compendium for LabVIEW. I have implemented the NTLM SSP protocol in native LabVIEW (apart from one thing) and it's all working great under Windows, Now I am looking to make sure it works on the other platforms still.

The "one thing" is the Ascii to UTF16 conversion since the protocol uses unicode. With the older protocols (NTLMv1 etc) that is not an issue since the negotiation can tell the server to use ASCII strings instead of Unicode. However. NTLMv2 requires unicode strings to create the hash in the specifications so there is no negotiating it away.

(Note: There is no need to display anything so the bytes just need converting for calculation from a LabVIEW string to a u8 byte array representing the unicode equivalent. No changes to LabVIEW indicators or ini-files is required)

Under windows, the conversion from ASCII to Unicode is via calls to the OS (kernel32.dll). So am I looking at iconv, mbsrtowcs or something else to achieve the same on Linux and Mac?

.

JKSH · September 11, 2016

ICU is a widely-used library that runs on many different platforms, including Windows, Linux and macOS: http://site.icu-project.org/

It's biggest downside is its data libraries can be quite chunky (20+ MB for the 32-bit version on Windows). However, I don't see this as a big issue for modern desktop computers.

ShaunR · September 11, 2016

3 hours ago, JKSH said:

ICU is a widely-used library that runs on many different platforms, including Windows, Linux and macOS: http://site.icu-project.org/

It's biggest downside is its data libraries can be quite chunky (20+ MB for the 32-bit version on Windows). However, I don't see this as a big issue for modern desktop computers.

Is there a good reason to use a third party library like that rather than btowc which is in glibc?

JKSH · September 12, 2016

5 hours ago, ShaunR said:

Is there a good reason to use a third party library like that rather than btowc which is in glibc?

Ah, I misread your question, sorry. I thought you wanted codepage conversion, but just wanted to widen ASCII characters.

No extra library required, then.

ShaunR · September 12, 2016

6 hours ago, JKSH said:

Ah, I misread your question, sorry. I thought you wanted codepage conversion, but just wanted to widen ASCII characters.

No extra library required, then.

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

Edited September 12, 2016 by ShaunR

Rolf Kalbermatter · September 13, 2016

On 12-9-2016 at 8:37 AM, ShaunR said:

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

I haven't looked into the Linux side of this very much yet. You should however consider that most Linux systems nowadays already use UTF-8 as default codepage, so basically the strings are already in Unicode (but 8-bit UTF encoded, rather than the Windows 16-bit Unicode standard). That might make things pretty simple for the situation where the system is already using UTF-8, but could pose problems when a user runs his terminal session in one of the previous MBCS encodings, for whatever strange reason.

On Mac you have similar standard OS functions as on Windows, with potentially small variations in the translation tables, since Windows tends to not use the official Unicode collation tables but has slightly different ones. There is a standard VI library on the Mac in vi.lib somewhere, that actually helps in calling these functions, by creating MacOS X CFStringRef's that you can then use with functions like CFStringCreateWithBytes() (similar to MultiByteToWideChar()) and CFStringGetBytes() (similar to WideCharToMultiByte()).

All in all using Unicode in a MBCS environment is a pretty nasty mess and the platform differences make it even more troublesome. It's the main reason that Unicode support in LabVIEW is still just experimental. Making sure that everything MBCS based keeps working as is when upgrading to a full Unicode version of LabVIEW is a nightmare. The only way to go about that with reasonable effort is to start over again with a completely Unicode based LabVIEW version and provide some MBCS support for communicating with ASCII based devices, and accepting that there is no clean upgrade path for existing projects without some serious refactoring work, when dealing with MBCS (or ASCII) strings.

Edited September 13, 2016 by rolfk

ShaunR · September 13, 2016

@rolfk

Indeed. I've managed to avoid most of the shenanigans and dealing with various possibilities since most of NTLMv2 is taking what the server gives you and signing it. In that respect I only need to convert from LabVIEW strings that the user enters via the FP (computer name, user name and password) rather than having to deal with what the OS gives me via calls. I haven't tried yet, but it looks like the btowc will do what I need unless you can see a bear trap in there somewhere.

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it :ph34r:

With Linux, however it looks straightforward and is a must for embedded platforms. I think I might just need to update my ASCIIToUTF16 conversion primitive with a conditional disable to use the btowc. Can you confirm that is a good way forward?

Rolf Kalbermatter · September 13, 2016

3 hours ago, ShaunR said:

@rolfk

Indeed. I've managed to avoid most of the shenanigans and dealing with various possibilities since most of NTLMv2 is taking what the server gives you and signing it. In that respect I only need to convert from LabVIEW strings that the user enters via the FP (computer name, user name and password) rather than having to deal with what the OS gives me via calls. I haven't tried yet, but it looks like the btowc will do what I need unless you can see a bear trap in there somewhere.

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it

With Linux, however it looks straightforward and is a must for embedded platforms. I think I might just need to update my ASCIIToUTF16 conversion primitive with a conditional disable to use the btowc. Can you confirm that is a good way forward?

Well, straightforward is a little bit oversimplified. wchar_t on Unix systems is typically an unsigned int, so a 32-bit unicode (UTF-32) character, which is technically absolutely not the same as UTF-16. The conversion between the two is however a pretty trivial bit shifting and masking for all but some very obscure characters (from generally dead or artificial languages like Klingon ).

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

Edited September 13, 2016 by rolfk

ShaunR · September 13, 2016

35 minutes ago, rolfk said:

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

The LabVIEw strings are ASCII byte arrays so that would mean converting them to mbstr first, yes? I was under the impression that btowc was byte to widechar which seems just what I'm after. Are you saying that I would use btowc to get the multi byte string then use mbstrtowc to get that into unicode UTF 16?

Edited September 13, 2016 by ShaunR

Rolf Kalbermatter · September 13, 2016

7 hours ago, ShaunR said:

The LabVIEw strings are ASCII byte arrays so that would mean converting them to mbstr first, yes? I was under the impression that btowc was byte to widechar which seems just what I'm after. Are you saying that I would use btowc to get the multi byte string then use mbstrtowc to get that into unicode UTF 16?

Well LabVIEW is in fact MBCS aware and as such using whatever MBCS standard is set on the system. That includes UTF-8 on Linux for instance. For most things that is pretty similar to ASCII, but not always. I don't believe it to be possible to set Windows to UTF-8 though as default MBCS.

And no you would not use btowc and mbstrtowc together. Rather mbstrtowc does for a string, what btowc does for a single character (well more really mbtowc). btowc only works for single byte characters, which a LabVIEW string doesn't necessarily have to be (defininitely the asian language versions are mbcs for sure).

ShaunR · September 14, 2016

10 hours ago, rolfk said:

Well LabVIEW is in fact MBCS aware and as such using whatever MBCS standard is set on the system. That includes UTF-8 on Linux for instance. For most things that is pretty similar to ASCII, but not always. I don't believe it to be possible to set Windows to UTF-8 though as default MBCS.

And no you would not use btowc and mbstrtowc together. Rather mbstrtowc does for a string, what btowc does for a single character (well more really mbtowc). btowc only works for single byte characters, which a LabVIEW string doesn't necessarily have to be (defininitely the asian language versions are mbcs for sure).

OK. Thanks. Once I've managed to get NTLM working in Apache on a Centos VPS, I'll have a play (2 days and counting... :frusty: . ) There are obviously some trivial commands that do what I need and that's the important bit.

JKSH · September 14, 2016

On 9/12/2016 at 2:37 PM, ShaunR said:

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

Yes, iconv() is designed for charset conversion.

Possible bear trap (I haven't used it myself): A quick Google session turned up a thread which suggests that there are multiple implementations of iconv out there, and they don't all behave the same.

At the same time, I guess ICU would've been an overkill for simple charset conversion -- it's more of an internationalization library, which also takes care of timezones, formatting of dates (month first or day first?) and numbers (comma or period for separator?), locale-aware string comparisons, among others.

On 9/13/2016 at 10:29 PM, rolfk said:

wchar_t on Unix systems is typically an unsigned int, so a 32-bit unicode (UTF-32) character, which is technically absolutely not the same as UTF-16. The conversion between the two is however a pretty trivial bit shifting and masking for all but some very obscure characters (from generally dead or artificial languages like Klingon ).

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

Thinking about it some more, I believe @ShaunR does want charset conversion after all. This thread has identified 2 ways to do that on Linux:

System encoding -> UTF-32 (via mbsrtowcs()) -> UTF-16 (via manual bit shifting)
System encoding -> UTF-16 (via iconv())

On 9/13/2016 at 5:09 PM, rolfk said:

All in all using Unicode in a MBCS environment is a pretty nasty mess and the platform differences make it even more troublesome.

Hence the rise of cross-platform libraries that behave the same on all supported platforms.

On 9/13/2016 at 6:43 PM, ShaunR said:

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it

Do you have the NI Developer Suite? My company does, and we serendipitously found out that LabVIEW for OS X (or macOS, as it's called nowadays) is part of the bundle. We simply wrote to enquire about getting a license, and NI kindly mailed us the installer disc just like that

ShaunR · September 14, 2016

26 minutes ago, JKSH said:

Thinking about it some more, I believe @ShaunR does want charset conversion after all. This thread has identified 2 ways to do that on Linux:

System encoding -> UTF-32 (via mbsrtowcs()) -> UTF-16 (via manual bit shifting)

System encoding -> UTF-16 (via iconv())

I think we are up to 4. There is also the btowc which I could do char by char and LabVIEW has a Text->UTF8 primitive (and the reciprocal, but I don't care). IIRC it is in Linux (not sure about Mac).

Don't forget I don't care about the system encoding. Only from LabVIEW strings to UTF16 regardless of how the OS sees it.

40 minutes ago, JKSH said:

Hence the rise of cross-platform libraries that behave the same on all supported platforms.

Oooh. That's a point. Free Pascal is cross platform an supports unicode. I can look to see what it maps the Windows conversions to..

45 minutes ago, JKSH said:

Do you have the NI Developer Suite? My company does, and we serendipitously found out that LabVIEW for OS X (or macOS, as it's called nowadays) is part of the bundle. We simply wrote to enquire about getting a license, and NI kindly mailed us the installer disc just like that

I was actually referring to the "Third Party Licencing Toolkit" for applying my own licence. It only works on Windows so it is not possible to supply 30 day trials on other platforms.

Rolf Kalbermatter · September 14, 2016

6 hours ago, ShaunR said:

Don't forget I don't care about the system encoding. Only from LabVIEW strings to UTF16 regardless of how the OS sees it.

But LabVIEW strings are in the system encoding (codepage on Windows)!

JKSH · September 15, 2016

11 hours ago, ShaunR said:

There is also the btowc which I could do char by char

From earlier posts, it sounds to me like btowc() is simply a single-char version of mbsrtowcs(). Good to check, though.

11 hours ago, ShaunR said:

and LabVIEW has a Text->UTF8 primitive

You'd then still need to convert UTF-8 -> UTF-16 somehow (using one of the other techniques mentioned?)

11 hours ago, ShaunR said:

Don't forget I don't care about the system encoding. Only from LabVIEW strings to UTF16 regardless of how the OS sees it.

As @rolfk said, LabVIEW simply uses whatever the OS sees.

Rolf Kalbermatter · September 15, 2016

27 minutes ago, JKSH said:

From earlier posts, it sounds to me like btowc() is simply a single-char version of mbsrtowcs(). Good to check, though.

Actually btowc() is the single byte version of mbtowc(). Both are single char, but the first only works for single byte chars while the second will use as many bytes from a multi byte character sequence (MBCS) as are needed (and return an error if the byte stream in the mbcs input does start with an invalid byte code or is not long enough to describe a complete MBCS character for the current local.

mbstowc() then works on whole MBCS strings while mbtowc() only processes a single character at a time. Please note that a character is not a single byte generally although here in the western hemisphere you get quite far with assuming that that is the case, although it's not quite safe to work from. Definitely on *nix systems which nowadays often use UTF-8 as default locale, you automatically end up with multibyte characters for those Umlaut, accent and other characters many European languages do use.

Windows solves it differently by using codepages for the non-Unicode environment, which for western locales simply means that for extended characters the same byte means something different depending on the codepage you have configured. But even here you do need MBCS encoding for most non western languages anyhow

UTF-8 to UTF-16 conversion is a fairly straightforward conversion, although the simple approach of some bitshifting only, could end up with invalid UTF-16 characters. A fully compliant conversion is somewhat tricky to get right for yourself as there are some corner cases that need to be taken care of.

Edited September 15, 2016 by rolfk

ShaunR · September 15, 2016

15 hours ago, rolfk said:

But LabVIEW strings are in the system encoding (codepage on Windows)!

DOH!

I forgot that an input to the conversion requires the codepage :oops:

Sign In

Unicode on Linux (oh, and Mac)

Recommended Posts

ShaunR

JKSH

ShaunR

JKSH

ShaunR

Rolf Kalbermatter

ShaunR

Rolf Kalbermatter

ShaunR

Rolf Kalbermatter

ShaunR

JKSH

ShaunR

Rolf Kalbermatter

JKSH

Rolf Kalbermatter

ShaunR

Join the conversation

Similar Content

Unicode Display (TabControl and Tree Menu)

Read national text from MS Word document

Browse

Activity

Important Information