How to extract large files on Windows - 7zip

I'm trying to extract the .tar imagenet archive on Windows. This has 150 gb of files, and is estimated to take 95 hours to extract using an SSD and the 7zip file manager. Is there a reasonable way to speed this up, such as by using a Python library or command line options instead?

If you are using an external SSD then it is slowing you down. Copy the tar to a hard drive (or internal SSD) and try again. I bet it goes 10x faster.

Related

Anaconda zip file

Is the Anaconda installer available as a zip file? The links usually used have sizes similar to or +/- 10 MB as compared to the originals. I have a hard daily usage limit and a single download will exhaust it.

get name of memory mapped file

I have a windows host where, according to rammap, almost all memory is in mapped files. I try to find out which file causes such leak. All available guides suggest using tab File Summary to find out connection between file and mapped files. But there is no any file which occupies such amount of mapped files memory.
Is there a way to find out which file is to blame? I guess sysinternals tools like rammap already use windows api functions, so i won't find out more info if i'll try to use functions like GetMappedFileNameA on my own.
I have 24 GB of mapped files on my 96 GB machine. It seems to me that this is simply the Windows file cache ("Smartdrv", if you know that from DOS times).
This is roughly the same amount as displayed in Task Manager as "cached". The tool tip of that reads as
Memory that contains cached data and code that is not actively in use.
So, this is nothing to worry about. In fact it's great, because Windows can read files from memory instead of disk. That makes stuff much faster.

C drive free space drops very fast with no obvious reason

My operating system is Windows 10 and I have a problem with the free space dropping for no reason.
A couple of days ago I ran a python code in jupyter notebook, and in the middle of the execution my C drive ran out of space (there was ~50 GB free space), and since then the C drive free space changes significantly (even shrinks to few MBs) without no obvious reason.
Since then I found some huge files in a pycharm temporary directory, and I freed 47GB of space, but after a short time, it runs out of space again ( I am not even running any code anymore)!
When I restart, the free space gradually starts to increase, and again after a some time, it shrinks to a few GB or even MBs.
PS. I installed WinDirStat to show me the stat of the disk space, and it shows 93 GB under this path: C:\ProgramData\Microsoft\Search\Data\Applications\Windows\Files\Windows.edb, but I can't open Data folder in the file explorer, and it shows 0 bytes when I open the folder properties.
Windows.edb is an index database of the Windows Search function. It provides data to speed up searching in the file system due to indexing of files. There are several guides in the internet about reducing it's size. The radical way would be deleting it but I do not recomment this. You had to turn Windows Search off to do so:
net stop "Windows Search"
del %PROGRAMDATA%\Microsoft\Search\Data\Applications\Windows\Windows.edb
net start "Windows Search"
You wrote in your question that the file suddenly grew while your program was running. Maybe files will be created there. These files should be set to not be indexed. You should do that for the folder where the files are created. If this all fails, you could finally turn indexing off which slows down Windows Search.

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.
I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.
I tried "cp /path/to/dataset /path/to/external-ssd"
Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.
Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.
Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. It's been 24 hours and it isn't quite halfway done.
Is there a better way to do this?
rsync is the preferred tool for this kind of workload. It is used for both local and network copies.
Main benefits are (excerpt from manpage):
delta-transfer algorithm, which reduces the amount of data sent
if it is interrupted for any reason, then you can restart it easily with very little cost. It can even restart part way through a large file
options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied.
Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
Regarding command usage and syntax, for local transfers is almost the same as cp:
rsync -az /path/to/dataset /path/to/external-ssd

jRuby Zip out of Memory

We have a small utility that finds unused items on our server and zips them up then moves them this is written in jRuby. When we go to run this on the actual servers needing clean up they run out of memory before they can complete the operation of the clean up. The java memory is as high as we can get it to run stably on 32bit and we can't move to 64bit at this time it is around 1800m max heap size. There is our main application running as well that we would like to avoid shutting down. The zips the system is creating are 800megs plus is there any way to do this and not have the entire zip file open in memory?
Can you execute zip via the command line?
You may also want to look at pbzip2, you will still need tar to do the archival of multiple files though.

Resources