Editing a Very Large File

Reds · September 4, 2023

Let's say you have a really big binary file. A file so big that it won't fit into your PC RAM. Now let's say you wanted to delete the *first* 100kB in that file, and leave the rest of the file alone.

How would you do that? Can it be done quickly? Can it be done without creating a whole new file?

Edited September 4, 2023 by Reds

mcduff · September 4, 2023

Read and copy the file in chunks. No need to open the whole file at once. To increase speed write is multiples of the disk sector size.

X___ · September 4, 2023

Here is ChatGPT's answer:

You can delete part of a file without loading it entirely into RAM by using the dd command. This command can be used to move the contents of the file up by a specified number of bytes, effectively deleting the specified number of bytes from the beginning of the file ¹. Here is an example script that you can use to delete a specified number of bytes from the beginning of a file:

#!/bin/bash
size=$(stat -c %s "$2")
dd bs=$1 if="$2" skip=1 seek=0 conv=notrunc of="$2"
dd bs=$((size - $1)) if="$2" skip=1 seek=1 count=0 of="$2"

Copy

You can call this script as ./truncstart.sh 2 file.dat, where 2 is the number of bytes to delete from the beginning of file.dat¹. However, please note that this solution is not robust in case of an unplanned outage, which could occur part-way through dd’s processing; in which case the file would be corrupted ¹. Is there anything else you would like to know? 😊

Learn more:

1. unix.stackexchange.com 2. superuser.com 3. digitalcitizen.life

ShaunR · September 5, 2023

8 hours ago, X___ said:
Here is ChatGPT's answer:
You can delete part of a file without loading it entirely into RAM by using the dd command. This command can be used to move the contents of the file up by a specified number of bytes, effectively deleting the specified number of bytes from the beginning of the file ¹. Here is an example script that you can use to delete a specified number of bytes from the beginning of a file:
#!/bin/bash
size=$(stat -c %s "$2")
dd bs=$1 if="$2" skip=1 seek=0 conv=notrunc of="$2"
dd bs=$((size - $1)) if="$2" skip=1 seek=1 count=0 of="$2"
Copy
You can call this script as ./truncstart.sh 2 file.dat, where 2 is the number of bytes to delete from the beginning of file.dat¹. However, please note that this solution is not robust in case of an unplanned outage, which could occur part-way through dd’s processing; in which case the file would be corrupted ¹. Is there anything else you would like to know? 😊
Learn more:

1. unix.stackexchange.com 2. superuser.com 3. digitalcitizen.life

Linux only huh? No mention of fallocate?

Why do people keep posting junk from ChatGPT? At this point I consider it spam.

dadreamer · September 5, 2023

I would suggest Memory-Mapped Files, but I'm a bit unsure whether all-ready instruments exist for such a task. There's @Rolf Kalbermatter's adaptation: https://forums.ni.com/t5/LabVIEW/Problem-Creating-File-Mapping-Object-in-Memory-Mapped-FIles/m-p/3753032#M1056761 But seems to need some tweaks to work with common files instead of file mapping objects. Not that hard to do though.

A quick-n-dirty sample (reading 10 bytes only).

Yes, I know I should use CreateFileA instead of Open/Create/Replace VI + FRefNumToFD, just was lazy and short on time.

Edited September 5, 2023 by dadreamer

ShaunR · September 5, 2023

15 minutes ago, dadreamer said:

I would suggest Memory-Mapped Files, but I'm a bit unsure whether all-ready instruments exist for such a task. There's @Rolf Kalbermatter's adaptation: https://forums.ni.com/t5/LabVIEW/Problem-Creating-File-Mapping-Object-in-Memory-Mapped-FIles/m-p/3753032#M1056761 But seems to need some tweaks to work with common files instead of file mapping objects. Not that hard to do though.

There is a limit to how much you can map into memory.

BTW. Here is a LabVIEW mmap wrapper for working with files on windows

Edited September 5, 2023 by ShaunR

dadreamer · September 5, 2023

28 minutes ago, ShaunR said:

There is a limit to how much you can map into memory.

Not an issue for "100kB" views, I think. Files theirselves may be big enough, 7.40 GB opened fine (just checked).

Reds · September 5, 2023

Thanks for the ideas fellas, I'll report back on my progress.

I guess I was hoping for some Win32 API that could tweak the NTFS tables to change the starting sector of a file (but I guess that would be too easy).

X___ · September 5, 2023

14 hours ago, ShaunR said:

Linux only huh? No mention of fallocate?

Why do people keep posting junk from ChatGPT? At this point I consider it spam.

Well, isn't Linux part of Windows nowadays?

ShaunR · September 6, 2023

On 9/5/2023 at 11:19 AM, dadreamer said:

Not an issue for "100kB" views, I think. Files theirselves may be big enough, 7.40 GB opened fine (just checked).

A 100kB view will not help you truncate from the front. You can use it to copy chunks like mcduff suggested but

On 9/4/2023 at 7:41 PM, Reds said:

A file so big that it won't fit into your PC RAM

The issue with what the OP is asking is to get the OS to recognise a different start of a file. Truncating from the end is easy (just tell the file system the length has changed). From the front is not unless you have specific file system operations. On Windows you would have to have Sparse Files to achieve the same as fallocate

dadreamer · September 6, 2023

7 minutes ago, ShaunR said:

You can use it to copy chunks like mcduff suggested

It is what I was thinking of, just in case with Memory-Mapped Files it should be a way more productive, than with normal file operations. No need to load entire file into RAM. I have a machine with 8 GB of RAM and 8 GB files are mmap'ed just fine. So, general sequence is that: Open a file (with CreateFileA or as shown above) -> Map it into memory -> Move the data in chunks with Read-Write operations -> Unmap the file -> SetFilePointer(Ex) -> SetEndOfFile -> Close the file.

ShaunR · September 6, 2023

1 hour ago, dadreamer said:

It is what I was thinking of, just in case with Memory-Mapped Files it should be a way more productive, than with normal file operations. No need to load entire file into RAM. I have a machine with 8 GB of RAM and 8 GB files are mmap'ed just fine. So, general sequence is that: Open a file (with CreateFileA or as shown above) -> Map it into memory -> Move the data in chunks with Read-Write operations -> Unmap the file -> SetFilePointer(Ex) -> SetEndOfFile -> Close the file.

Indeed. However. Hole punching is much, much faster. If you are talking terabytes, it's the only way really.

Set the file to be Sparse. Write 100k zero's to the beginning. Job done (sort of).

Edited September 6, 2023 by ShaunR

Reds · September 6, 2023

9 hours ago, ShaunR said:

Indeed. However. Hole punching is much, much faster. If you are talking terabytes, it's the only way really.

Set the file to be Sparse. Write 100k zero's to the beginning. Job done (sort of).

Yes, we are indeed talking terabytes.

Reading the original file and writing a new one will take many minutes. It will also require the storage medium to have terabytes of free space available to perform the operation. Maybe even a whole separate partition would need to be set aside with free space. "Copy only the parts you want to save" is certainly the obvious solution, but it's not a good one for really big files.

Thanks for the Microsoft link to Sparse files. I"ll dig into that and learn more.

ShaunR · September 7, 2023

9 hours ago, Reds said:

Thanks for the Microsoft link to Sparse files. I"ll dig into that and learn more.

You can play with fsultil but Windows (A.K.A NTFS/ReFS) doesn't have the "FALLOC_FL_COLLAPSE_RANGE" like fallocate (which helps with programs that aren't Sparse aware).

Reds · September 7, 2023

Yeah, I dug into the Microsoft docs on sparse files, and I don't think that technology is going to solve my problem after all. Cool stuff. Good to know. But doesn't seem like it's going to solve my immediate pain.

I guess what's really needed is a way to modify the NTFS Master File Table (MFT) to modify the starting offset of a given file. But, I didn't actually see any Win32 API's that could do that. I'm sure it must be possible to do that with some bit banging, but I'd probably be getting in way over my head if I tried to modify the MFT using a method that was not Microsoft endorsed.

GregSands · September 7, 2023

I'm guessing there must be more to your question, but based on your specs, I'd be asking whether it was worth spending time and effort deleting a relatively tiny part of a file. 100k out of tens of GB? I'd just leave it there and work around it!

ShaunR · September 8, 2023

9 hours ago, GregSands said:

I'm guessing there must be more to your question, but based on your specs, I'd be asking whether it was worth spending time and effort deleting a relatively tiny part of a file. 100k out of tens of GB? I'd just leave it there and work around it!

It's a common requirement for video editing.

dadreamer · September 8, 2023

Technically related question: Insert bytes into middle of a file (in windows filesystem) without reading entire file (using File Allocation Table)? (Or closer, but not that informative). The extract is - theoretically possible, but so low level and hacky that easy to mess up with something, rendering the whole system inoperable. If this doesn't stop you, then you may try contacting Joakim Schicht, as he has made a bunch of NTFS tools incl. PowerMft for low level modifications and maybe he will give you some tips about how to proceed (or give it up and switch to traditional ways/workarounds).

Edited September 8, 2023 by dadreamer

ShaunR · September 8, 2023

Well. What is your immediate pain? Can you elaborate?

Here is an existing file with the first 0x40000000 (d:1073741824) bytes nulled.

You can see it only has about 600MB on disk.

If I query it I see that data starts at 0x40000000

Now I can do a seek to that location and read ~600 bytes. However. I'm guessing you have further restrictions.

Reds · September 8, 2023

Quote

Well. What is your immediate pain? Can you elaborate?

The jumbo file is recorded with a bunch of header data starting at file offset zero. This header data is not actually useful, and it actually causes a third party analysis application to think that the recorded data is corrupt. If I can manage to delete only the header data at the beginning of the file, then the third party analysis application can open and analyze the file without throwing any errors.

Edited September 8, 2023 by Reds

Reds · September 8, 2023

21 hours ago, GregSands said:

I'm guessing there must be more to your question, but based on your specs, I'd be asking whether it was worth spending time and effort deleting a relatively tiny part of a file. 100k out of tens of GB? I'd just leave it there and work around it!

Yeah, I wish that was possible.

The problem is that a third party analysis application can't understand the first 100kB of the file, and so that software incorrectly concludes that the entire remainder of the file must be corrupt.

Dan Bookwalter N8DCJ · September 8, 2023

I was looking for something else ... and ran across this thread ...

"This header data is not actually useful"

my question is , why the header then ?

Dan

ShaunR · September 9, 2023

10 hours ago, Reds said:

If I can manage to delete only the header data at the beginning of the file, then the third party analysis application can open and analyze the file without throwing any errors.

Indeed.

On 9/7/2023 at 9:01 AM, ShaunR said:

but Windows (A.K.A NTFS/ReFS) doesn't have the "FALLOC_FL_COLLAPSE_RANGE"

Which is what you need.

Neil Pate · September 9, 2023

@ShaunR and @dadreamer (and @Rolf Kalbermatter) how do you know so much about low level Windows stuff?

Please never leave our commuity, you are not replaceable!

ShaunR · September 10, 2023

18 hours ago, Neil Pate said:

@ShaunR and @dadreamer (and @Rolf Kalbermatter) how do you know so much about low level Windows stuff?

Please never leave our commuity, you are not replaceable!

Oh, I am easily replaceable.

The other two know how things work in a "white-box", "under-the-hood" manner. I know how stuff works in a "black-box" manner after decades of finding work-arounds and sheer bloody-mindedness.

Sign In

Editing a Very Large File

Recommended Posts

Reds

mcduff

X___

ShaunR

dadreamer

ShaunR

dadreamer

Reds

X___

ShaunR

dadreamer

ShaunR

Reds

ShaunR

Reds

GregSands

ShaunR

dadreamer

ShaunR

Reds

Reds

Dan Bookwalter N8DCJ

ShaunR

Neil Pate

ShaunR

Join the conversation

Browse

Activity

Important Information