Wow, ASCII you clever thing

Neil Pate · October 9, 2021

Today I Learned... you can convert from upper case to lower case like this! Clever design hidden in plain sight.

See for yourself

Rolf Kalbermatter · October 9, 2021

4 hours ago, Neil Pate said:

Today I Learned... you can convert from upper case to lower case like this! Clever design hidden in plain sight.

See for yourself

Yes but!!!!!

That only works for the 7-bit ASCII characters! If you use another encoding it's by far not as simple. While almost all codepages including all Unicode encodings also use the same numeric value for the first 128 characters as what ASCII does (and that is in principle sufficient to write English texts), things for language specific characters get a lot more complicated. The German Umlaut also have upper and lowercase variants, the French accents are not written on uppercase characters but very important on lowercase. Almost every language has some special characters some with uppercase and lowercase variants and all of them are not as simple as just setting a single bit to make it lowercase. So if you are sure to only use English text without any other characters your trick usually works with most codepages including Unicode, but otherwise you need a smart C library like ICU, (and maybe some C runtimes nowadays( which use the correct collation tables to find out what lowercase and uppercase characters correspond to each other. With many languages it is also not always as easy as simply replacing a character with its counterpart. There are characters that have in UTF-8 for instance 2 bytes in one case and 3 bytes in the other. That's a nightmare for a C function having to work on a buffer. Well it can be implemented of course but it makes calling that function a lot more complicated as the function can't work in place at all.

And things can get even more complicated since Unicode for instance has for many diacritics two or more ways to write it. There are characters that are the entire letter including the diacritic and others where such a letter is constructed of multiple characters, first the non-diacritic character followed by a not advancing diacritic.

Edited October 9, 2021 by Rolf Kalbermatter

Neil Pate · October 9, 2021

Yeah...

But now pretend it's 1963. This is clever.

Rolf Kalbermatter · October 9, 2021

5 hours ago, Neil Pate said:

Yeah...

But now pretend it's 1963. This is clever.

It's useful and not a bad idea. Clever I would consider something else 🙂

And it's 7 bits since that was the actual prefered data size back then! Some computers had 9 bits, which made octal numbers quite popular too, 8 bits only got popular in the 70ies with the first microprocessors (we leave the 4004 out of this for the moment) which were all using an 8 bit architecture and that got the hexadecimal or more correctly written sedecimal notation popular. Hexa is actually 6 and not 16!

Edited October 9, 2021 by Rolf Kalbermatter

ShaunR · October 9, 2021

47 minutes ago, Rolf Kalbermatter said:

more correctly written sedecimal notation popular. Hexa is actually 6 and not 16!

I find it mildly amusing that a non-native English speaker knows more about my language than I do. Maybe if I had gone to Grammar School and learnt Greek and Latin, I would be more equipped to argue. However. I got an F in French so it probably wouldn't have gone well

Francois Normandin · October 10, 2021

It makes sense if one thinks of it as "6 letters added to the decimal notation".

Rolf Kalbermatter · October 10, 2021

8 hours ago, Francois Normandin said:

It makes sense if one thinks of it as "6 letters added to the decimal notation".

Yes it's not completely unlogical, just a bit strange since hexa is Greek and decem is Latin. So it's a combination of two words from two different languages. Currently it is so very much commonly used that it is in fact a moot point to debate. But sedecim is fully Latin and means 16. If you wanted to go fully Greek it would be hexadecadic.

Edited October 10, 2021 by Rolf Kalbermatter

drjdpowell · October 10, 2021

Since Latin for six is "sex", we could have gone for "sexidecimal".

ensegre · October 10, 2021

Rolf Kalbermatter · October 11, 2021

12 hours ago, drjdpowell said:

Since Latin for six is "sex", we could have gone for "sexidecimal".

Not a chance in a million years. 🙂

thols · October 11, 2021

But we could have stayed with sexadecimal: HAL 9000 and Sexadecimal: Old School Math

hooovahh · October 11, 2021

On 10/10/2021 at 6:19 AM, Rolf Kalbermatter said:

Yes it's not completely unlogical, just a bit strange since hexa is Greek and decem is Latin.

That reminds me of a joke.

"Polygamy is wrong! It is either multiamory, or polyphilia, but mixing Greek and Latin roots is wrong."

It sure sounds like several complications, likely adding to the work NI needs to do if LabVIEW is ever going to have official proper unicode.

JKSH · October 12, 2021

8 hours ago, hooovahh said:

the work NI needs to do if LabVIEW is ever going to have official proper unicode.

The current pink wire datatype needs to remain as a "byte string" plus "locally encoded text string" to maintain backwards compatibility.

A new type of wire (Purple? Darker than pink, paler than DAQmx wires?) should be introduced as a "universal text string" datatype. Explicit conversion (with user-selectable encoding) should be done to change pink wires to purple wires and vice-versa.

Binary nodes like "TCP Read"/"TCP Write" should only use pink wires. Purple wires not accepted.
Text nodes like "Read from Text File"/"Write to Text File" should ideally only use purple wires, but they need to allow pink wires too for backwards compatibility.
- VI Analyzer should flag the use of pink wires with these nodes.
- Perhaps a project-wide setting could be introduced to disable the use of pink wires for text nodes.

Rolf Kalbermatter · October 12, 2021

6 hours ago, JKSH said:

Binary nodes like "TCP Read"/"TCP Write" should only use pink wires. Purple wires not accepted.

Please also allow for byte arrays, as a prefered default data type. Same applies for Flatten and Unflatten. The pink wire for binary strings should be only a backwards compatibility feature and not the default anymore.

As to strings, the TCP and other network nodes, should allow to pass in at least UTF-8 strings. This is already the universal encoding for pretty much anything that needs to be in text and go over the wire.

Edited October 12, 2021 by Rolf Kalbermatter

ShaunR · October 12, 2021

3 hours ago, Rolf Kalbermatter said:

Please also allow for byte arrays, as a prefered default data type. Same applies for Flatten and Unflatten. The pink wire for binary strings should be only a backwards compatibility feature and not the default anymore.

As to strings, the TCP and other network nodes, should allow to pass in at least UTF-8 strings. This is already the universal encoding for pretty much anything that needs to be in text and go over the wire.

We already have the UTF8 primitives that cater for 99% of cases The real problems are indicators and controls (I'm looking at you, file control). I would be happy if we could display and use UTF8 in controls and indicators. I don't really care about other encodings as I could always convert them to UTF8 for the edge cases. Also. Wouldn't need a special wire

Rolf Kalbermatter · October 12, 2021

25 minutes ago, ShaunR said:

Also. Wouldn't need a special wire

I think it should. The old string is in whatever locale the system is configured for and is at the same time also using the existing Byte Array === String analogy. That is usually also UTF-8 on most modern Unix systems, and could be UTF-8 on Windows systems that have the UTF-8 Beta hack applied.

There should be another string which is whatever the prefered Unicode type for the current platform is. For Windows I would tend to use UTF-16 internally for Unix it is debatable if it should be UTF-32, the wchar_t on pretty much anything that is not Windows, or UTF-8 which is pretty much the predominant Unicode encoding nowadays on non Windows systems and network protocols. I would even go as far as NOT exposing the inner datatype of this new string very much like with Paths in LabVIEW since as long as it existed, but instead provide clear conversion methods to and from explicit String encodings and ByteArrays. The string datatype in the Call Library Node would also have such extra encoding option changes. Flattened strings for LabVIEW internal purposes would be ALWAYS UTF-8. Same for flattened paths. Internally Paths would use whatever the native string format is, which in my opinion would be UTF-16 on Windows and UTF-8 on other systems.

Basically the main reason Windows is using UTF-16, is because Microsoft was an early adopter of Unicode. At the time when they implemented Unicode support for Windows, the Unicode space fit easily within the 2^16 character points that UTF-16 provided. When the Unicode consortium finalized the Unicode standard with additional exotic characters and languages, it did however not fit anymore but the Microsoft implementation was already out in the field and changing it was not really a good option anymore. Non-Windows versions only started a bit later and went for UTF-32 as widechar. But that wastes 3 bytes for 99% of the characters used in text outside of Asian languages and that was 20 years ago still a concern. So UTF-8 was adopted by most implementations instead, except on Windows where UTF-16 was both fully implemented and also a moderate compromise between wasting memory for most text representations and being a better standard than the multi codepage mess that was the alternative.

Edited October 12, 2021 by Rolf Kalbermatter

hooovahh · October 12, 2021

Also does anyone else think there isn't enough color choices for the amount of data types there are? Maybe it is my eyes getting worst, or maybe it is the higher resolution monitors, and higher DPI, but I think at a glance the difference between some wire colors is too close. I'm sure another color or visual style would be needed, but a purple (or basically any other color) is going to look pretty similar to others. Someone pointed me to this video on Rebar from GDevCon, and while this is the now defunct NXG, I couldn't help but notice how similar the style is for a string. Even more similar would be having this next to an array of string, a DVR of a string, or a set and map of a string. I can tell the difference but maybe not very quickly. Maybe I should try turning on the Alternative Block Diagram Data Type Colors from the config and see if it is any better.

ShaunR · October 12, 2021

8 hours ago, Rolf Kalbermatter said:

There should be another string which is whatever the prefered Unicode type for the current platform is.

That sounds like I would be shooting all the toes off of my feet as well as the feet.

8 hours ago, Rolf Kalbermatter said:

I would even go as far as NOT exposing the inner datatype of this new string very much like with Paths in LabVIEW since as long as it existed, but instead provide clear conversion methods to and from explicit String encodings and ByteArrays.

We already have this for UTF8. What we can't do is display it (ignoring the file control for now).

If you roll your trouser-leg up and poke your tongue out at the right angle; you might get the ini switch and the associated property to sort of work for some encodings. But I ditched that a long time ago in favour of HTML and I don't remember it working for UTF8.

Rolf Kalbermatter · October 12, 2021

12 hours ago, ShaunR said:

We already have this for UTF8. What we can't do is display it (ignoring the file control for now).

Not really. There are two hidden nodes that convert between whatever is the current locale and UTF-8 and vice versa. Nothing more. This works and can be done with various Unicode VI libraries too. Under Windows what it essentially does is something analoguous to this:

MgErr ConvertANSIStrToUTF8Str(LStrHandle ansiString, LStrHandle *utf8String)
{
    MgErr err = mgArgErr;
    int32_t wLen = MultiByteToWideChar(CP_ACP, 0, LStrBuf(*ansiString), LStrLen(*ansiString), NULL, 0);
    if (wLen)
    {
        WCHAR *wStr = (WCHAR*)DSNewPtr(wLen * sizeof(WCHAR));
        if (!wStr)
            return mFullErr;
        
        wLen = MultiByteToWideChar(CP_ACP, 0, LStrBuf(*ansiString), LStrLen(*ansiString), wStr, wLen);
        if (wLen)
        {
            int32_t uLen = WideCharToMultiByte(CP_UTF8, 0, wStr, wLen, NULL, 0, NULL, NULL);
            if (uLen)
            {
                err = NumericArrayResize(uB, 1, (UHandle*)utf8String, uLen);
                if (!err)
                     LStrBuf(**utf8String) = WideCharToMultiByte(CP_UTF8, 0, wStr, wLen, LStrBuf(**utf8String), uLen, NULL, NULL);
            }
        }
        DSDisposePtr(wStr);
    }
    return err;
}

The opposite is done exactly the same, just swap the CP_UTF8 and CP_ACP.

Windows does not have a direct UTF-8 to/from anything conversion. Everything has to go through UTF-16 as the common standard. And while UTF-8 to/from UTF-16 is a fairly simple algorithm, since they map directly to each other, it is still one extra memory copy every time. That is why I would personally use native strings in LabVIEW without exposing the actually used internal format and only convert them to whatever is explicitly required, when it is required. Otherwise you have to convert the strings continuously back and forth as Windows APIs either want ANSI or UTF-16, nothing else (and all ANSI functions convert all strings to and from UTF-16 before and after calling the real function, which is always operating on UTF-16. By keeping the internal strings in whatever is the native format, you would avoid a lot of memory copies over and over again. And make a lot of things in the LabVIEW kernel easier. Yes ~~you~~ LabVIEW needs to be careful whenever interfacing to other things, be it VISA, File IO, TCP and also external formats such as VI file formats when they store strings or paths. You do not want them to be platform specific. For the nodes such as VISA, TCP, File read and Write and of course the Call Library Node parameter configuration for strings, it would have to provide a way to explicitly let the user choose the external encoding. It would be of course nice if there were many different encodings selectable including all possible codepages in the world but that is an impossible task to make platform independent. But it should at least allow Current Locale, UTF-8 and WideChar, which would be UTF-16-LE on Windows and UTF-32 on other platforms. Only UTF-8 will be universally platform independent and should be the format used to transport information across systems and between different LabVIEW platforms. The rest is mainly to interface to local applications, APIs and systems. UTF-8 would be the default for at least TCP, and probably also VISA and other instrument communication interfaces. Current Local would be the default for Call Library Node string parameters and similar. Other Nodes like Flatten and Unflatten would implicitly always use UTF-8 format on the external side. But the whole string handling internally is done in one string type which is whatever is the most convenient for the current platform.

In my own C library I called it NStr for native string. (Which is indeed somewhat close to the Mac Foundation type NSString, but different enough to not cause name clashing). On Windows it is basically a WCHAR string, on other platforms it is a char string but with the implicit rule to be always in UTF-8 no matter what, except on platforms that would not support UTF-8 which was an issue for at least VxWorks and Pharlap systems, but luckily we don't have to worry about them from next year on since NI will have them definitely sacked by then.😀

Edited October 13, 2021 by Rolf Kalbermatter

ShaunR · October 13, 2021

I think you are missing the point.

We can handle *most* things in UTF8 due to the primitives we have but what we can't do is display any of them on a FP.

I don't really care if that is a windows limitation or not. If NI have to write a UTF8 renderer for their controls, then so be it. While they are at it, they could make the nonsense we have to go through to colour text much easier too.

Rolf Kalbermatter · October 13, 2021

1 hour ago, ShaunR said:

I think you are missing the point.

We can handle *most* things in UTF8 due to the primitives we have but what we can't do is display any of them on a FP.

I don't really care if that is a windows limitation or not. If NI have to write a UTF8 renderer for their controls, then so be it. While they are at it, they could make the nonsense we have to go through to colour text much easier too.

So you propose for NI to more or less make a Windows GDI text renderer rewrite simply to allow it to use UTF-8 for its strings throughout? And that that beast would even be close to the Windows native one in terms of features, speed and bugs? That sounds like a perfect recipe for disaster. If the whole thing would be required to be in UTF-8 the only feasable way would be to always convert it to UTF-16 before passing it to Windows APIs. Not really worse than now in fact as the Windows ANSI APIs do nothing else in fact but quite a lot of repeated conversions back and forth.

ShaunR · October 13, 2021

1 hour ago, Rolf Kalbermatter said:

So you propose for NI to more or less make a Windows GDI text renderer rewrite simply to allow it to use UTF-8 for its strings throughout? And that that beast would even be close to the Windows native one in terms of features, speed and bugs? That sounds like a perfect recipe for disaster. If the whole thing would be required to be in UTF-8 the only feasable way would be to always convert it to UTF-16 before passing it to Windows APIs. Not really worse than now in fact as the Windows ANSI APIs do nothing else in fact but quite a lot of repeated conversions back and forth.

It's not my problem, so to speak, about how they accomplish it. Personally, I would go with an off-the-shelf HTML rendering engine for FP's - which is, in effect, what I have already done.

Sign In

Wow, ASCII you clever thing

Recommended Posts

Neil Pate

Rolf Kalbermatter

Neil Pate

Rolf Kalbermatter

ShaunR

Francois Normandin

Rolf Kalbermatter

drjdpowell

ensegre

Rolf Kalbermatter

thols

hooovahh

JKSH

Rolf Kalbermatter

ShaunR

Rolf Kalbermatter

hooovahh

ShaunR

Rolf Kalbermatter

ShaunR

Rolf Kalbermatter

ShaunR

Join the conversation

Browse

Activity

Important Information