Jump to content

Unicode on Linux (oh, and Mac)


ShaunR

Recommended Posts

I'm adding support for NTLMv2 authentication and Hashes to the Encryption Compendium for LabVIEW. I have implemented the NTLM SSP protocol in native LabVIEW (apart from one thing) and it's all working great under Windows, Now I am looking to make sure it works on the other platforms still.

The "one thing" is the Ascii to UTF16 conversion since the protocol uses unicode. With the older protocols (NTLMv1 etc) that is not an issue since the negotiation can tell the server to use ASCII strings instead of Unicode. However. NTLMv2 requires unicode strings to create the hash in the specifications so there is no negotiating it  away.

(Note: There is no need to display anything so the bytes just need converting for calculation from a LabVIEW string to a u8 byte array representing the unicode equivalent. No changes to LabVIEW indicators or ini-files is required)

Under windows, the conversion from ASCII to Unicode is via calls to the OS (kernel32.dll). So am I looking at iconvmbsrtowcs or something else to achieve the same on Linux and Mac?

.

Link to comment
3 hours ago, JKSH said:

ICU is a widely-used library that runs on many different platforms, including Windows, Linux and macOS: http://site.icu-project.org/

It's biggest downside is its data libraries can be quite chunky (20+ MB for the 32-bit version on Windows). However, I don't see this as a big issue for modern desktop computers.

Is there a good reason to use a third party library like that rather than btowc which is in glibc?

Link to comment
5 hours ago, ShaunR said:

Is there a good reason to use a third party library like that rather than btowc which is in glibc?

Ah, I misread your question, sorry. I thought you wanted codepage conversion, but just wanted to widen ASCII characters.

No extra library required, then.

Link to comment
6 hours ago, JKSH said:

Ah, I misread your question, sorry. I thought you wanted codepage conversion, but just wanted to widen ASCII characters.

No extra library required, then.

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

Edited by ShaunR
Link to comment
On 12-9-2016 at 8:37 AM, ShaunR said:

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

I haven't looked into the Linux side of this very much yet. You should however consider that most Linux systems nowadays already use UTF-8 as default codepage, so basically the strings are already in Unicode (but 8-bit UTF encoded, rather than the Windows 16-bit Unicode standard). That might make things pretty simple for the situation where the system is already using UTF-8, but could pose problems when a user runs his terminal session in one of the previous MBCS encodings, for whatever strange reason.

On Mac you have similar standard OS functions as on Windows, with potentially small variations in the translation tables, since Windows tends to not use the official Unicode collation tables but has slightly different ones. There is a standard VI library on the Mac in vi.lib somewhere,  that actually helps in calling these functions, by creating MacOS X CFStringRef's that you can then use with functions like CFStringCreateWithBytes() (similar to MultiByteToWideChar()) and CFStringGetBytes() (similar to WideCharToMultiByte()).

All in all using Unicode in a MBCS environment is a pretty nasty mess and the platform differences make it even more troublesome. It's the main reason that Unicode support in LabVIEW is still just experimental. Making sure that everything MBCS based keeps working as is when upgrading to a full Unicode version of LabVIEW is a nightmare. The only way to go about that with reasonable effort is to start over again with a completely Unicode based LabVIEW version and provide some MBCS support for communicating with ASCII based devices, and accepting that there is no clean upgrade path for existing projects without some serious refactoring work, when dealing with MBCS (or ASCII) strings.

Edited by rolfk
Link to comment

@rolfk

Indeed. I've managed to avoid most of the shenanigans and dealing with various possibilities since most of NTLMv2 is taking what the server gives you and signing it. In that respect I only need to convert from LabVIEW strings that the user enters via the FP (computer name, user name and password) rather than having to deal with what the OS gives me via calls. I haven't tried yet, but it looks like the btowc will do what I need unless you can see a bear trap in there somewhere.

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. ;)  The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it :ph34r:

With Linux, however it looks straightforward and is a must for embedded platforms. I think I might just need to update my ASCIIToUTF16 conversion primitive with a conditional disable to use the btowc. Can you confirm that is a good way forward?

Link to comment
3 hours ago, ShaunR said:

@rolfk

Indeed. I've managed to avoid most of the shenanigans and dealing with various possibilities since most of NTLMv2 is taking what the server gives you and signing it. In that respect I only need to convert from LabVIEW strings that the user enters via the FP (computer name, user name and password) rather than having to deal with what the OS gives me via calls. I haven't tried yet, but it looks like the btowc will do what I need unless you can see a bear trap in there somewhere.

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. ;)  The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it :ph34r:

With Linux, however it looks straightforward and is a must for embedded platforms. I think I might just need to update my ASCIIToUTF16 conversion primitive with a conditional disable to use the btowc. Can you confirm that is a good way forward?

Well, straightforward is a little bit oversimplified. wchar_t on Unix systems is typically an unsigned int, so a 32-bit unicode (UTF-32) character, which is technically absolutely not the same as UTF-16. The conversion between the two is however a pretty trivial bit shifting and masking for all but some very obscure characters (from generally dead or artificial languages like Klingon :P ).

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

Edited by rolfk
Link to comment
35 minutes ago, rolfk said:

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

The LabVIEw strings are ASCII byte arrays so that would mean converting them to mbstr first, yes? I was under the impression that btowc was byte to widechar which seems just what I'm after. Are you saying that I would use btowc to get the multi byte string then use mbstrtowc to get that into unicode UTF 16?

Edited by ShaunR
Link to comment
7 hours ago, ShaunR said:

The LabVIEw strings are ASCII byte arrays so that would mean converting them to mbstr first, yes? I was under the impression that btowc was byte to widechar which seems just what I'm after. Are you saying that I would use btowc to get the multi byte string then use mbstrtowc to get that into unicode UTF 16?

Well LabVIEW is in fact MBCS aware and as such using whatever MBCS standard is set on the system. That includes UTF-8 on Linux for instance. For most things that is pretty similar to ASCII, but not always. I don't believe it to be possible to set Windows to UTF-8 though as default MBCS.

And no you would not use btowc and mbstrtowc together. Rather mbstrtowc does for a string, what btowc does for a single character (well more really mbtowc). btowc only works for single byte characters, which a LabVIEW string doesn't necessarily have to be (defininitely the asian language versions are mbcs for sure).

  • Like 1
Link to comment
10 hours ago, rolfk said:

Well LabVIEW is in fact MBCS aware and as such using whatever MBCS standard is set on the system. That includes UTF-8 on Linux for instance. For most things that is pretty similar to ASCII, but not always. I don't believe it to be possible to set Windows to UTF-8 though as default MBCS.

And no you would not use btowc and mbstrtowc together. Rather mbstrtowc does for a string, what btowc does for a single character (well more really mbtowc). btowc only works for single byte characters, which a LabVIEW string doesn't necessarily have to be (defininitely the asian language versions are mbcs for sure).

OK. Thanks. Once I've managed to get NTLM working in Apache on a Centos VPS, I'll have a play (2 days and counting...:frusty:. ) There are obviously some trivial commands that do what I need and that's the important bit.

Link to comment
On 9/12/2016 at 2:37 PM, ShaunR said:

Even so. isn't that what iconv does? I'm just trying to get a feel for the equivalencies and trying to avoid bear traps because Linux isn't my native habitat.

Yes, iconv() is designed for charset conversion.

Possible bear trap (I haven't used it myself): A quick Google session turned up a thread which suggests that there are multiple implementations of iconv out there, and they don't all behave the same.

At the same time, I guess ICU would've been an overkill for simple charset conversion -- it's more of an internationalization library, which also takes care of timezones, formatting of dates (month first or day first?) and numbers (comma or period for separator?), locale-aware string comparisons, among others.

 

On 9/13/2016 at 10:29 PM, rolfk said:

wchar_t on Unix systems is typically an unsigned int, so a 32-bit unicode (UTF-32) character, which is technically absolutely not the same as UTF-16. The conversion between the two is however a pretty trivial bit shifting and masking for all but some very obscure characters (from generally dead or artificial languages like Klingon :P ).

Also btowc() is only valid for conversion from the current mbcs (which could be UTF-8) set by the C runtime library LC_TYPE setting. Personally for string conversion I think mbsrstowc() is probably more useful, but it has the same limit about the C runtime library setting, which is process global so a nasty thing to change.

Thinking about it some more, I believe @ShaunR does want charset conversion after all. This thread has identified 2 ways to do that on Linux:

  • System encoding -> UTF-32 (via mbsrtowcs()) -> UTF-16 (via manual bit shifting)
  • System encoding -> UTF-16 (via iconv())

 

On 9/13/2016 at 5:09 PM, rolfk said:

All in all using Unicode in a MBCS environment is a pretty nasty mess and the platform differences make it even more troublesome.

Hence the rise of cross-platform libraries that behave the same on all supported platforms.

 

On 9/13/2016 at 6:43 PM, ShaunR said:

With Mac support. Well. No-one has bought a Mac licence so I only care for completeness and will drop it is it looks like to much hassle. ;)  The toolkit used to support Mac but I quietly removed it because of the lack of being able to apply the NI licence. So it still works, but I don't offer it :ph34r:

Do you have the NI Developer Suite? My company does, and we serendipitously found out that LabVIEW for OS X (or macOS, as it's called nowadays) is part of the bundle. We simply wrote to enquire about getting a license, and NI kindly mailed us the installer disc just like that :D

  • Like 1
Link to comment
26 minutes ago, JKSH said:

Thinking about it some more, I believe @ShaunR does want charset conversion after all. This thread has identified 2 ways to do that on Linux:

  • System encoding -> UTF-32 (via mbsrtowcs()) -> UTF-16 (via manual bit shifting)
  • System encoding -> UTF-16 (via iconv())

I think we are up to 4. There is also the btowc which I could do char by char and LabVIEW has a Text->UTF8 primitive (and the reciprocal, but I don't care). IIRC it is in Linux (not sure about Mac).

Don't forget I don't care about the system encoding. Only from LabVIEW strings to UTF16 regardless of how the OS sees it.

40 minutes ago, JKSH said:

Hence the rise of cross-platform libraries that behave the same on all supported platforms.

Oooh. That's a point. Free Pascal is cross platform an supports unicode. I can look to see what it maps the Windows conversions to.. 

 

45 minutes ago, JKSH said:

Do you have the NI Developer Suite? My company does, and we serendipitously found out that LabVIEW for OS X (or macOS, as it's called nowadays) is part of the bundle. We simply wrote to enquire about getting a license, and NI kindly mailed us the installer disc just like that :D

I was actually referring to the "Third Party Licencing Toolkit" for applying my own licence. It only works on Windows so it is not possible to supply 30 day trials on other platforms.

Link to comment
11 hours ago, ShaunR said:

There is also the btowc which I could do char by char

From earlier posts, it sounds to me like btowc() is simply a single-char version of mbsrtowcs(). Good to check, though.

 

 

11 hours ago, ShaunR said:

and LabVIEW has a Text->UTF8 primitive

You'd then still need to convert UTF-8 -> UTF-16 somehow (using one of the other techniques mentioned?)

 

 

11 hours ago, ShaunR said:

Don't forget I don't care about the system encoding. Only from LabVIEW strings to UTF16 regardless of how the OS sees it.

As @rolfk said, LabVIEW simply uses whatever the OS sees.

Link to comment
27 minutes ago, JKSH said:

From earlier posts, it sounds to me like btowc() is simply a single-char version of mbsrtowcs(). Good to check, though.

Actually btowc() is the single byte version of mbtowc(). Both are single char, but the first only works for single byte chars while the second will use as many bytes from a multi byte character sequence (MBCS) as are needed (and return an error if the byte stream in the mbcs input does start with an invalid byte code or is not long enough to describe a complete MBCS character for the current local.

mbstowc() then works on whole MBCS strings while mbtowc() only processes a single character at a time. Please note that a character is not a single byte generally although here in the western hemisphere you get quite far with assuming that that is the case, although it's not quite safe to work from. Definitely on *nix systems which nowadays often use UTF-8 as default locale, you automatically end up with multibyte characters for those Umlaut, accent and other characters many European languages do use.

Windows solves it differently by using codepages for the non-Unicode environment, which for western locales simply means that for extended characters the same byte means something different depending on the codepage you have configured. But even here you do need MBCS encoding for most non western languages anyhow

UTF-8 to UTF-16 conversion is a fairly straightforward conversion, although the simple approach of some bitshifting only, could end up with invalid UTF-16 characters. A fully compliant conversion is somewhat tricky to get right for yourself as there are some corner cases that need to be taken care of.

Edited by rolfk
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.