Jump to content

Editing a Very Large File


Reds

Recommended Posts

Let's say you have a really big binary file. A file so big that it won't fit into your PC RAM. Now let's say you wanted to delete the *first* 100kB in that file, and leave the rest of the file alone.

How would you do that? Can it be done quickly? Can it be done without creating a whole new file?

Edited by Reds
Link to comment

Here is ChatGPT's answer:

 

You can delete part of a file without loading it entirely into RAM by using the dd command. This command can be used to move the contents of the file up by a specified number of bytes, effectively deleting the specified number of bytes from the beginning of the file1. Here is an example script that you can use to delete a specified number of bytes from the beginning of a file:

#!/bin/bash
size=$(stat -c %s "$2")
dd bs=$1 if="$2" skip=1 seek=0 conv=notrunc of="$2"
dd bs=$((size - $1)) if="$2" skip=1 seek=1 count=0 of="$2"
Copy

You can call this script as ./truncstart.sh 2 file.dat, where 2 is the number of bytes to delete from the beginning of file.dat1. However, please note that this solution is not robust in case of an unplanned outage, which could occur part-way through dd’s processing; in which case the file would be corrupted1. Is there anything else you would like to know? 😊

Link to comment
8 hours ago, X___ said:

Here is ChatGPT's answer:

 

You can delete part of a file without loading it entirely into RAM by using the dd command. This command can be used to move the contents of the file up by a specified number of bytes, effectively deleting the specified number of bytes from the beginning of the file1. Here is an example script that you can use to delete a specified number of bytes from the beginning of a file:

#!/bin/bash
size=$(stat -c %s "$2")
dd bs=$1 if="$2" skip=1 seek=0 conv=notrunc of="$2"
dd bs=$((size - $1)) if="$2" skip=1 seek=1 count=0 of="$2"
Copy

You can call this script as ./truncstart.sh 2 file.dat, where 2 is the number of bytes to delete from the beginning of file.dat1. However, please note that this solution is not robust in case of an unplanned outage, which could occur part-way through dd’s processing; in which case the file would be corrupted1. Is there anything else you would like to know? 😊

Linux only huh? No mention of fallocate?

Why do people keep posting junk from ChatGPT? At this point I consider it spam.

  • Thanks 2
Link to comment

I would suggest Memory-Mapped Files, but I'm a bit unsure whether all-ready instruments exist for such a task. There's @Rolf Kalbermatter's adaptation: https://forums.ni.com/t5/LabVIEW/Problem-Creating-File-Mapping-Object-in-Memory-Mapped-FIles/m-p/3753032#M1056761 But seems to need some tweaks to work with common files instead of file mapping objects. Not that hard to do though.

A quick-n-dirty sample (reading 10 bytes only).

2023-09-05_14-54-51.jpg.267de1d8cf050842253988b420949a50.jpg

2023-09-05_14-55-59.jpg.5a98ef9058ad61a000135cd0d68b91a0.jpg

Yes, I know I should use CreateFileA instead of Open/Create/Replace VI + FRefNumToFD, just was lazy and short on time. :)

Edited by dadreamer
Link to comment
15 minutes ago, dadreamer said:

I would suggest Memory-Mapped Files, but I'm a bit unsure whether all-ready instruments exist for such a task. There's @Rolf Kalbermatter's adaptation: https://forums.ni.com/t5/LabVIEW/Problem-Creating-File-Mapping-Object-in-Memory-Mapped-FIles/m-p/3753032#M1056761 But seems to need some tweaks to work with common files instead of file mapping objects. Not that hard to do though.

There is a limit to how much you can map into memory.

BTW. Here is a LabVIEW mmap wrapper for working with files on windows

Edited by ShaunR
Link to comment
On 9/5/2023 at 11:19 AM, dadreamer said:

Not an issue for "100kB" views, I think. Files theirselves may be big enough, 7.40 GB opened fine (just checked).

A 100kB view will not help you truncate from the front. You can use it to copy chunks like mcduff suggested but

On 9/4/2023 at 7:41 PM, Reds said:

A file so big that it won't fit into your PC RAM

The issue with what the OP is asking is to get the OS to recognise a different start of a file. Truncating from the end is easy (just tell the file system the length has changed). From the front is not unless you have specific file system operations. On Windows you would have to have Sparse Files to achieve the same as fallocate

 

  • Like 1
Link to comment
7 minutes ago, ShaunR said:

You can use it to copy chunks like mcduff suggested

It is what I was thinking of, just in case with Memory-Mapped Files it should be a way more productive, than with normal file operations. No need to load entire file into RAM. I have a machine with 8 GB of RAM and 8 GB files are mmap'ed just fine. So, general sequence is that: Open a file (with CreateFileA or as shown above) -> Map it into memory -> Move the data in chunks with Read-Write operations -> Unmap the file -> SetFilePointer(Ex) -> SetEndOfFile -> Close the file.

Link to comment
1 hour ago, dadreamer said:

It is what I was thinking of, just in case with Memory-Mapped Files it should be a way more productive, than with normal file operations. No need to load entire file into RAM. I have a machine with 8 GB of RAM and 8 GB files are mmap'ed just fine. So, general sequence is that: Open a file (with CreateFileA or as shown above) -> Map it into memory -> Move the data in chunks with Read-Write operations -> Unmap the file -> SetFilePointer(Ex) -> SetEndOfFile -> Close the file.

Indeed. However. Hole punching is much, much faster. If you are talking terabytes, it's the only way really.

Set the file to be Sparse. Write 100k zero's to the beginning. Job done (sort of).

Edited by ShaunR
Link to comment
9 hours ago, ShaunR said:

Indeed. However. Hole punching is much, much faster. If you are talking terabytes, it's the only way really.

Set the file to be Sparse. Write 100k zero's to the beginning. Job done (sort of).

Yes, we are indeed talking terabytes.

Reading the original file and writing a new one will take many minutes. It will also require the storage medium to have terabytes of free space available to perform the operation. Maybe even a whole separate partition would need to be set aside with free space. "Copy only the parts you want to save" is certainly the obvious solution, but it's not a good one for really big files.

Thanks for the Microsoft link to Sparse files. I"ll dig into that and learn more.

Link to comment

Yeah, I dug into the Microsoft docs on sparse files, and I don't think that technology is going to solve my problem after all. Cool stuff. Good to know. But doesn't seem like it's going to solve my immediate pain.

I guess what's really needed is a way to modify the NTFS Master File Table (MFT) to modify the starting offset of a given file. But, I didn't actually see any Win32 API's that could do that. I'm sure it must be possible to do that with some bit banging, but I'd probably be getting in way over my head if I tried to modify the MFT using a method that was not Microsoft endorsed.

 

Link to comment
9 hours ago, GregSands said:

I'm guessing there must be more to your question, but based on your specs, I'd be asking whether it was worth spending time and effort deleting a relatively tiny part of a file. 100k out of tens of GB? I'd just leave it there and work around it!

It's a common requirement for video editing.

Link to comment

Technically related question: Insert bytes into middle of a file (in windows filesystem) without reading entire file (using File Allocation Table)? (Or closer, but not that informative). The extract is - theoretically possible, but so low level and hacky that easy to mess up with something, rendering the whole system inoperable. If this doesn't stop you, then you may try contacting Joakim Schicht, as he has made a bunch of NTFS tools incl. PowerMft for low level modifications and maybe he will give you some tips about how to proceed (or give it up and switch to traditional ways/workarounds).

Edited by dadreamer
  • Like 1
Link to comment

Well. What is your immediate pain? Can you elaborate?

Here is an existing file with the first 0x40000000 (d:1073741824) bytes nulled.

image.png.395e2b23df57b23231df5840a22f6535.png

You can see it only has about 600MB on disk.

If I query it I see that data starts at 0x40000000

image.png.17ea4ec14c6137d19cc31e5f296bd721.png


Now I can do a seek to that location and read ~600 bytes. However. I'm guessing you have further restrictions.

Link to comment
Quote

Well. What is your immediate pain? Can you elaborate?

The jumbo file is recorded with a bunch of header data starting at file offset zero. This header data is not actually useful, and it actually causes a third party analysis application to think that the recorded data is corrupt. If I can manage to delete only the header data at the beginning of the file, then the third party analysis application can open and analyze the file without throwing any errors.

Edited by Reds
Link to comment
21 hours ago, GregSands said:

I'm guessing there must be more to your question, but based on your specs, I'd be asking whether it was worth spending time and effort deleting a relatively tiny part of a file. 100k out of tens of GB? I'd just leave it there and work around it!

Yeah, I wish that was possible.

The problem is that a third party analysis application can't understand the first 100kB of the file, and so that software incorrectly concludes that the entire remainder of the file must be corrupt.

Link to comment
10 hours ago, Reds said:

If I can manage to delete only the header data at the beginning of the file, then the third party analysis application can open and analyze the file without throwing any errors.

Indeed.

On 9/7/2023 at 9:01 AM, ShaunR said:

but Windows (A.K.A NTFS/ReFS) doesn't have the "FALLOC_FL_COLLAPSE_RANGE"

Which is what you need.

Link to comment
18 hours ago, Neil Pate said:

@ShaunR and @dadreamer (and @Rolf Kalbermatter) how do you know so much about low level Windows stuff?

Please never leave our commuity, you are not replaceable!

 

:beer_mug:

Oh, I am easily replaceable.

The other two know how things work in a "white-box", "under-the-hood" manner. I know how stuff works in a "black-box" manner after decades of finding work-arounds and sheer bloody-mindedness. :D

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.