yEnc Decoding

Kas · October 17, 2012

I'm hoping to get a better answer than what NI forum gave at: http://forums.ni.com/t5/LabVIEW/yEnc/td-p/2187886.

----

I'm trying to implement a yEnc Decoder in LabView. The attached solution seem to be working most of the time with small txt files, but when I'm trying to decode larger files, i.e. with .rar extension etc. from usenet groups, the encoder doesn't quite work. It seems to add random bytes extra, and I don't know where this is comming from.

I'm not 100% sure if I'm decoding the binary file corectly, but the yEnc Encoding is pretty simple from what I've seen.

I've been at this for 3-4 days now, but I still can't solve it.

Attached are yEnc Decoder VIs and two files, one where it works and one where it doesn't.

Any help appriciated.

Thanks

Kas

---

Kas · October 17, 2012

Hah... I think I just figured out the problem. Apparently, If a line starts with a "." (i.e. any of the 128 B lines on the encoded binary) than the NNTP server and apparently POP3, introduces an extra . to it. i.e. if a line starts with ".$£%^&" etc", when downloaded, it becomes "..$£%^&". In this case, than I just delete one of the "."

This introduces a new wrinkle on the code. I don't really want to place each line "with pick line function" into an array and examine the first two characters for "..", mainly because I'll be dealing with large files. Is there a way of optimising the above code "this may turn into a trade of between processor speed and memory" with this extra functionality?

Thanks

Kas

asbo · October 17, 2012

If you should change your profile to say LabVIEW 2012 if that's what you're using now. I can't view your code, I only have LabVIEW 2011SP1.

Kas · October 17, 2012

Sorry asbo. Attached is the 2011 version. (Changed the profile too).

asbo · October 17, 2012

Well, your algorithm doesn't really make sense to me - you're pulling off a 1 character string, turning it into a byte array, and then operating on the array... not sure if that's supposed to be an optimization or if there's functional value I'm missing. But I get that you're offsetting the ASCII value based on two different constants. If you convert the whole string to a byte array (and then index off that), you will see a speed improvement.

I don't see what defines a "line", but I would handle the ".." a bit like you've done the escape character. You'll need to add lookahead logic and then wrap everything in the loop in the case structure to add a "no-op" for the second period (or the first period, if you want to make it more complex). Unless, of course, ".." is never a legal substring, at which point I would just do a search and replace on "..".

Kas · October 17, 2012

Well I've cleaned it up alittle. The incomming stream of data is a string of binary, but I can save the values just as they are without converting it to string again.

As for the definition of a "line" in this case, it is based on the principle of how yEnc Encoder works. Basically, it adds a 42 at every byte appart from what they consider to be "critical" characters. These are "=", "CR" and "LF". If during encoding these characters are encountered, they add an escape character "=" in front first then they add 42 AND 64 to the next byte.

When finished, the encoded binary will consist of multiple lines of size 128 bytes/characters, appart from the last line ofcourse.

If you examine the encoded file attached, you'll see that when you go through each line in the binary string, its size is 128 bytes.

As for the ".." feature, its not part of the yEncode/decode mechanism. Its something that only happens when sending an retrieving data from NNTP servers and POP3 procedures only. This seems to be the way the NNTP are set-up. Apparently, when sending a message to the server through TCP\IP, the message has to end with a "." and "CR and LF" in a new line at the end. So, in order for the server to understand that the line beginning with "." does not represent the end of the message, than the ".." comes in.

So, to clarify, this ".." rule is only ilegal if it appears at the beginning of a line, and its perfectly ok if it appears anywhere else within the 128 bytes (i.e. line).

Sorry to make this long, I just wanted to explane the situation alittle bit.

As I mentioned, attached is the cleaned up version but it still doesn't have the ".." function included.

Your solution however just looks for ANY ".." and removes one, trouble is, I need to look for ".." ONLY at the beginning of the line.

One quick solution would be to cut the whole byte stream in chunks of 128 bytes, then index the first two bytes of each line and search for the ".." characters (where "." is decimal "46").

I guess that I'm just hoping there is a cleaner/better way of doing this and reducing the RAM and CPU and increasing speed at the same time (maybe wishfull thinking here ) when dealing with large files 200-300 MB.

Thanks

Kas

Edited October 17, 2012 by Kas

asbo · October 17, 2012

I would recommend reading the file in 128-byte chunks and processing that chunk right away. That way, you also know exactly where to look for your ".." and have one chunk of string that's easy to manipulate/subset, if necessary. To improve disk/computation parallelism, you could implement producer/consumer loops.

Kas · October 17, 2012

Well, I added the ".." functionality and the decoder itself seems to be working now.

As a result, I ended up implementing loops within loops , Dearly costing me speed and probably memory as well.

A 250 KB file takes about 1 sec. I dare not think how long would a 200 - 300 MB take.

Attached is the latest 2011 version. Any improvements would go along way.

I left comments on each section (its pretty simple operation).

Thanks

Kas.

P.S. Now the "NOT WORKING" example actually works as well. This can be confirmed by checking the header of the binary file in the string indicator, the displayed size in the indicator is the same as the decoded size of the array at the end.

Edited October 17, 2012 by Kas

Kas · October 18, 2012

I would recommend reading the file in 128-byte chunks and processing that chunk right away. That way, you also know exactly where to look for your ".." and have one chunk of string that's easy to manipulate/subset, if necessary. To improve disk/computation parallelism, you could implement producer/consumer loops.

Well, this decoding section will allready be part of a bigger project (producer/consumer style). See, this is part of a program to do with NNTP server from the clients side using TCP\IP protocols. Majority of the incomming data comes in 15 MB chuncks, but this varies, and it can go up to 300 MB per part. The yEnc Decoding needs to happen once the whole part is downloaded, mainly because a yEnc encoded parts will have a header and a footer. The downloaded part is written to disc at certain intervals in order to free up memory (particularly when a single part is large).

I'm not sure if its a good idea to read every 128 bytes from the disk untill the whole 300 MB (maybe worst case scenario) is finished.

So I thought I read the whole lot at one go, but then have a clever yEnc Decoder that goes through this whole part as fast as possible.

asbo · October 18, 2012

Your use of Pick Line is absolutely killing you. I get a throughput of about 400KBps - maybe a little bit more than you, but not awesome. Your 300MB file will scale out to 12 minutes if the algorithm is O(N) (which it isn't, because Pick Line won't "remember" where you are in the string).

I didn't look closely at it before, but think about what you're doing conceptually: first, you parse the whole file for lines, just to get the count. Then, you iterate n times, seeking within the string to pull out just one line. Messy. So, save yourself some trouble and just build an array of lines while you're counting and index off that array for your next loop:

This just about doubles the throughput to 800KBps. Not bad, now you can process that 300MB file in 6 minutes instead.

But you know what? We can do better. Pick Line is still a klutzy approach to this problem. So, instead, we're going to split up the data up beforehand and not care how many elements there are, because LabVIEW will handle it all for us. I'm using an OpenG VI here, String to 1D Array (but the concept is very easy to implement with Scan String for Tokens if you can't use OpenG):

That little guy lands at 50ms, a 6x speed up from the last stage and a 12x speedup overall; about 5MBps. 300 MB in about a minute.

By removing the disk read and header/footer time from the benchmark, I got around 30ms of data processing time, which equates to a bit over 8MBps and 40 seconds processing time for 300MB. On-the-fly handling of the header and footer would not be difficult to implement. You know your markers for each, and once you get the header you know exactly where the footer is going to be. If you really need throughput, give it some thought. However, you're probably not on a 8MBps link, so your bottleneck is now probably the network.

Mellroth · October 18, 2012

...But you know what? We can do better....

And you can do even better still...

Since the decoding is about removing escape characters, it is possible to reuse the input buffer and have a single FOR loop to do the decoding. The separate search for new lines is unnecessary since we can deal with these characters on the fly.

Attached is a VI that I tested on the "NOT WORKING.rar".

The original code (except header and footer removal) takes about 650ms and the buffer reuse version takes about 4ms.

yEnc Decoder.vi

It also think it is possible to speed up the footer search by performing the search backwards. To do this replace the footer search string "=yend " by "=yend .*$".

/J

Edited October 18, 2012 by Mellroth

Phillip Brooks · October 18, 2012

Keep in mind that the implementations so far all read the complete file into memory then decode.

Not an issue for a few hundred KB or even MB, but If you try to do this with really big files and then combine these in RAM for writing to disk as a single file then you may run into problems.

Kas · October 18, 2012

Mellroth, you hit the nail.

I have now just placed it together as the final solution. The attached is in LV 2011.

The second FOR loop that deals with either 46 or 106, is boolean initiated, is there a reason for it or would it be the same if we just wire the 46 shift register directly to the FOR loop instead of checking if 46 is equal to the previous run.

Edited October 18, 2012 by Kas

Mellroth · October 18, 2012

I have now just placed it together as the final solution. The attached is in LV 2011.

Mellroth: The second FOR loop that deals with either 46 or 106, is boolean initiated, is there a reason for it or would it be the same if we just wire the 46 shift register directly to the FOR loop instead of checking if 46 is equal to the previous run.

The boolean keeps track of the LF/CR characters because the original code removed one of two starting '.' characters.

Only checking for equal 46 would remove all occurrences of double '.', not only at the start of a new line.

/J

Kas · October 18, 2012

Not an issue for a few hundred KB or even MB, but If you try to do this with really big files and then combine these in RAM for writing to disk as a single file then you may run into problems.

Very true, but since this is part of the bigger piece, I've placed the "Remove Header" and "Remove footer" in serial in order to make this as a single standin example. What I have in mind is to place "Get File Size" at the beginning in order to see if I should read the file as a whole or in parts (i.e. replace "-1" of the count in the "Read from Binary". Than have "Remove Footer" act a the STOP condition for the main yEnc decoder if this becomes the case.

The boolean keeps track of the LF/CR characters because the original code removed one of two starting '.' characters.

Only checking for equal 46 would remove all occurrences of double '.', not only at the start of a new line.

Actually I meant the for loop that deals with adding the period ".". Can we just link the "previous character" shift register directly to this loop instead of using bolean to determine this.

Edited October 18, 2012 by Kas

Mellroth · October 18, 2012

..Actually I meant the for loop that deals with adding the period ".". Can we just link the "previous character" shift register directly to this loop instead of using bolean to determine this....

OK, I guess you mean the Case structure, not the for loop?

In that case; yes you could use the numeric as the selector, but in my experience you actually gain performance by using booleans instead of numerics in case structures.

/J

Kas · October 18, 2012

OK, I guess you mean the Case structure, not the for loop?

Sorry, that's what I meant.

To Everyone Involved:

Thanks for helping me sort this out.

Regards

Kas

Kas · November 8, 2012

To anyone interested.

Attached is the final solution that also includes the encoder. For the sake of speed, the yEnc encoder is implemented using the same idea as the yEnc decoder. Both have been tested for as much as I could, and appart from some small initial bugs found on the decoder (now resolved), they both work.

Regards

Kas

Kas · August 2, 2013

Hello. Apologies for bringing this thread back again but I thought this is relevant to the thread.

Attached are some VI's that encode the message and prepares it for NNTP protocol. This includes the header, footer and encodes the message or file.

The attached example however seems to fail but the failure seems random.

When the whole message is prepared, it is sent through using the TCP/IP but conforming to the NNTP commands structure. Below are steps that I use to send the data to NNTP server.

NNTP Communication structure:

1. A server Address and Port is established.

2. User Authentication is carried out.

3. NNTP Server capabilities are checked.

4. Prepare the NNTP server to receive the data through the "POST" command.

5. First the main Header information is sent (i.e. information like "From", "Group Names", "Message ID" etc.).

6. Main data is then sent.

7. Check if the data was sent successfully.

All the above steps come back as successful. Basically, all the steps 1 through 7 are OK when the data is sent. However, when the whole file (around 500 MB) is sent through by repeating the above steps, I also save the unique Message ID's so that the same file can be downloaded later on.

On this process, as soon as the file upload is finished, I go back and re-check the upload again using the "STAT" command, and even though its only few minutes later, the file doesn't seem to have been transmitted properly. There seem to be some pieces missing. Its as though those pieces no longer exist on the NNTP server. This process is also shown in the attached image.

So far, I have traced the problem to how the encoder is working and how the message is prepared in general.

For those that are not too familiar with NNTP protocols, the link below provides an introduction.

http://www.javvin.com/protocolNNTP.html

For those that are not too familiar with yEnc Encoding, the link below provides an introduction.

http://www.yenc.org/develop.htm

Sorry for providing reading material for this. I know that chances for help are greatly reduced if a person needs to read and learn before helping, but I'm hoping that someone may be allready familiar with the two concepts (yEnc and NNTP).

One of the problems that may contribute to this may be the yEnc Encoder. All the encoded lines should be constant in size (i.e. 128 bytes excluding the End-of-Line character). Looking at the attached encoder, it doesn't seem to guarantee that. The last line of the encoded message will obviously be less then 128 but based on the codding the previous lines seem to be either 128 characters long or 129 characters. This should however be always 128 characters plus the end of line. If anyone has a quick fix on this it would be great. I can than test and see if the situation improves.

I apologize for making this a long post, and if anything more is needed than please let me know.

Kas

Sign In

yEnc Decoding

Recommended Posts

Kas

Kas

asbo

Kas

asbo

Kas

asbo

Kas

Kas

asbo

Mellroth

Phillip Brooks

Kas

Mellroth

Kas

Mellroth

Kas

Kas

Kas

Join the conversation

Browse

Activity

Important Information