Jump to content

Data should be represented as an array of bytes (U8s) and text should be represented as strings


Mr Mike

Recommended Posts

Posted

I think this would be better if your tools took in an array of U8s. You're not sure if you encrypting data or text. Data should be represented as an array of bytes (U8s) and text should be represented as strings.

In LabVIEW we've traditionally (read "incorrectly") treated an array of U8s the same as a string because the data types are essentially the same for our strings. We want to change this in the future and work away from treating strings as data, since you can only represent a small subset of characters as U8s. (For more information, see Text Encoding and Windows-1252. Windows-1252 is the character set / encoding that LabVIEW supports in English (and French and German, IIRC. Japanese, Chinese, and Korean are obviously different).

At a very high level, the guidelines we came up with can be boiled down to three rules:

  1. Only use the string datatype for things that are actually sequences of characters. Use arrays of U8 for things that are arbitrary binary data.
  2. APIs that operate on strings should do what makes sense for a sequence of characters.
  3. APIs that can support either string or [u8] should provide separate entry points.

Posted

I think this would be better if your tools took in an array of U8s. You're not sure if you encrypting data or text. Data should be represented as an array of bytes (U8s) and text should be represented as strings.

In LabVIEW we've traditionally (read "incorrectly") treated an array of U8s the same as a string because the data types are essentially the same for our strings. We want to change this in the future and work away from treating strings as data, since you can only represent a small subset of characters as U8s. (For more information, see Text Encoding and Windows-1252. Windows-1252 is the character set / encoding that LabVIEW supports in English (and French and German, IIRC. Japanese, Chinese, and Korean are obviously different).

At a very high level, the guidelines we came up with can be boiled down to three rules:

  1. Only use the string datatype for things that are actually sequences of characters. Use arrays of U8 for things that are arbitrary binary data.
  2. APIs that operate on strings should do what makes sense for a sequence of characters.
  3. APIs that can support either string or [u8] should provide separate entry points.

So does that mean that comms primitives (TCPIP, Serial, Bluetooth etc) are all going to be changed to arrays of U8?

Strings/data....it's symantics. If it's not broke...don't fix it.

Posted

So does that mean that comms primitives (TCPIP, Serial, Bluetooth etc) are all going to be changed to arrays of U8?

Strings/data....it's symantics. If it's not broke...don't fix it.

I would argue it is broke. Not being able to natively handle some of the more modern encodings at best leaves LabVIEW's string processing capabilities as a quaint relic of its history. Multi-byte encodings such as UTF-8 are becoming very common. I know I've processed plenty of XML "blindly" hoping there are no multi-byte characters in there for my LabVIEW code to mangle...

Strings are not simple byte arrays. In their simplest form, they might be bytes, but still have an encoding attached to them.

I completely agree with making the distinction between arbitrary byte arrays and strings. In the case of communication primitives, I'd say they should be polymorphic, allowing either.

While I don't expect LabVIEW to overhaul it's string capabilties overnight, I do expect it will happen eventually, and a requisite to this would be making sure the distinction between string and byte data is consistent in the existing language.

  • Like 1
Posted

So does that mean that comms primitives (TCPIP, Serial, Bluetooth etc) are all going to be changed to arrays of U8?

Strings/data....it's symantics. If it's not broke...don't fix it.

I made a list recently of nodes that ought to accept/output an array of bytes instead of strings: Flatten to String (should be to bytes) and unflatten, FTP, TCP, Bluetooth, IrDA, VISA, String to Byte Array, and Byte Array to String

There are a lot of other nodes that need to be changed to accept an encoding input, handle encodings in them (like the HTTP VIs).

But the important thing, as mje pointed out, is that it is broken less than ideal. Want to write an app to help you learn Kanji? What about one that supports Hebrew? And Arabic? What about an app that loads Greek data files? All of these can be done, but require an intense amount of research and development into encodings that you shouldn't need to know, some unsupported INI tokens, and potentially changing the locale on your machine.

I completely agree with making the distinction between arbitrary byte arrays and strings. In the case of communication primitives, I'd say they should be polymorphic, allowing either.

While I don't expect LabVIEW to overhaul it's string capabilties overnight, I do expect it will happen eventually, and a requisite to this would be making sure the distinction between string and byte data is consistent in the existing language.

Right now it's not a high priority, but it's something that we're mindful of. If we help our customers stop interchanging strings and data, that will help the LabVIEW community as a whole understand the differences between text and data.

  • Like 1
Posted

Want to write an app ... that supports Hebrew? And Arabic?

Don't get me started on right-to-left languages, as LV is currently woefully short of the mark in that department and just expanding support to multiple encodings will not be enough in that case, as you need to correctly handle cases where you have mixed R2L and L2R content.

  • 1 month later...
Posted

Ton,

I took another look at this and I saw that the default mode for encryption is Electronic Code Book. I think that following secure by default guidelines, you should set the default mode to Cipher Block Chaining. I'd also suggest you change the default value of the enum to be CBC instead of ECB. I recently wondered if there's ever a good use for ECB. Sometimes there is, but using it is almost always it's a bad idea.

I played around with it a bit and I was very confused about how to enter the key and IV. My comments about data as strings apply less so because I learned you're using a string to represent a hex value, which is then converted to arrays of U8s. To me, it would be much simpler if the plaintext, key, and IV all were arrays of U8s. I think the IV should be required.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.