Listing the contents of a LZMA compressed file? - 7zip

Is it possible to list the contents of a LZMA file (.7zip) without uncompressing the whole file? Also, can I extract a single file from the LZMA file?
My problem: I have a 30GB .7z file that uncompresses to >5TB. I would like to manipulate the original .7z file without needing to do a full uncompress.

Yes. Start with XZ Utils. There are Perl and Python APIs.
You can find the file you want from the headers. Each file is compressed separately, so you can extract just the one you want.

Download lzma922.tar.bz2 from the LZMA SDK files page on Sourceforge, then extract the files and open up C/Util/7z/7zMain.c. There, you will find routines to extract a specific archive file from a .7z archive. You don't need to extract all the data from all the entries, the example code shows how to extract just the one you are interested in. This same code has logic to list the entries without extracting all the compressed data.

I solved this problem by installing 7zip (https://www.7-zip.org/) and using the parameter l. For example:
7z l file.7z
The output has some descriptive information and the list of files in the compressed files. Then, I call this inside python using the subprocess library:
import subprocess
output = subprocess.Popen(["7z","l", "file.7z"], stdout=subprocess.PIPE)
output = output.stdout.read().decode("utf-8")
Don't forget to make sure the program 7z is accessible in your PATH variable. I had to do this manually in Windows.

Related

Extracting images from multiple PDF files using hexapdf - hexapdf no such file or directory # rb_sysopen

I'm in my master thesis and I have to extract images from about 500 pdf files, some people recommended hexapdf to me for this. I was able to install Ruby and hexapdf and now I'm kinda stuck getting the images out of the pdf's since I don't have a coding background. Any tips?
Thanks in advance.
I tried using the basic command for only one pdf to see what happened by using 'hexapdf images' followed by the pdf name but the result was 'no such file or directory # rb_sysopen'.
If you're getting no such file or directory # rb_sysopen, then that signals that the file you are trying to open does not exist. It sounds like this is probably the PDF that you are trying to extract images from.
I would check that you are following help provided by hexapdf documentation and that the path to your PDF is correct. If the file with your code and the PDF are in the same directory and you are running your code from that file, then you would do something like:
require 'hexapdf'
doc = HexaPDF::Document.open('my_pdf_document_filename.pdf')
If the file is somewhere else on the machine, it may be easiest to use a full file path instead of a relative file path which will depend on your system and such (e.g. /Users/username/thesis/image_processing/files/my_pdf_document_filename.pdf).

Moving files inside a tar archive

I have a script that archives a mongo collection:
archive.tar.gz contains:
folder/file.bson
and I need to add a additional top level folder to that structure, example:
top-folder/folder/file.bson
It seems that one way is to unpack and re-pack everything but is there any other solution to this ?
The problem is that there's is a third party script that unpacks the archive and fetches the files from top-folder/folder/file.bson and in current formal, the path is wrong.
.tar.gz is actually what the name suggests - first tar converts a directory structure to a byte stream (i.e. a single file), and this byte stream is then compressed by gzip.
Which means that changing the file path inside the archive is equal to byte-editing a compressed data stream - an unnecessarily difficult thing to do without decompressing the stream.

Extracting specific files with file extension from a .tar.xz archive using MacOS terminal

I have a number of compressed archives with the extension .tar.xz. I am advised that, when decompressed, the total size required is around 2TB.
Within the archives are a number of images that I am solely after.
Is there a method to solely extract files for example with the extensions .jpeg, .jpeg and .gif from the compressed archives without having to extract every file?
Thanks
It's trivial to just extract one of the file types; for example:
tar -xjf archive.tar.xz '*.jpeg'
will extract all files with the .jpeg extension. It's important to quote the *, as otherwise the shell would attempt to expand it, and would only try to match only the files that were found (or fail because there were no files with that name).
You can similarly use other patterns like '*.gif', or both together:
tar -xjf archive.tar.xz '*.jpeg' '*.gif'
Because you tag that you're using OSX, I'll skip the need to use the --wildcards option, which is needed when trying to extract only those files under linux.

How to extract a specific folder using IZARC (IZARCe)

I want to extract a specific directory form a huge zip file (>5GB) that is somewhat corrupted because of an inevitable bad maintained build system that creates the zip.
The tools such as winrar/7Zip GUI apps have no issues extracting the files, but some command line tools such as mks unzip and 7za fails to extract from the corrupted archive.
After a lot of digging around and trying out many such command line utilities I found out that IZARC successfully extracts files from the archive.
I am running the following command:
IZARCe.exe -e -d -o D:\aHugeZipFile.zip -pD:\temp #"source.txt"
The listing file source.txt contains just one entry:
source/lib/*
which is the only directory in the archive, from where the contents are to be extracted.
But, it is resulting in:
IZArc Command Line Extraction Add-On Version 1.1 (Build: 130)
Copyright(c) 2007 Ivan Zahariev, All Rights Reserved.
http://www.izarc.org contact#izarc.org
Archive File: aHugeZipFile.zip
WARNING: Nothing to do!
I have tried specifying:
/source/lib/*
source/lib/*
source/lib/
source/lib
*source/lib/*
in the listing file, all to no avail! :(
Any pointers on where the error is occurring, and how to fix the issue will be of great help. Thank you in advance!
Using relative or absolute paths for listfiles doesn't appear to work with IZArc. Try using wildcards such as ., *.doc, etc instead of paths in the listfile. Be aware that there appears to be a limitation for the folder depth that IZArc will extract to as well as a tendency to generate CRC errors when files with the same name are present in the same archive, even if they are in different directories.
I would suggest using 7-Zip command-line instead. It can recurse deeply through a file structure without error and can use relative directories and wildcards in its listfiles.
The following 7-Zip command was tested and worked perfectly.
7za x somearchive.zip -o"C:\Documents and Settings\me\desktop\temp_folder\test2" -ir#source.txt -aoa -scsWIN
the source.txt file may contain contain a combination of relative paths and/or wildcards on separate lines such as:
Output/, Folder2/, *, or *.doc.
In the command above: x (extract with full paths), -ir (include filenames, recurse subdirectories), -aoa (overide existing files without prompt), -scsWIN (set charset for list files). You may need to adjust these commands for your situation.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

Resources