Extracting specific file from a tar.bz2 containing a matching pattern - bash

So I have this one big tarball:
du -sh file.tar.bz2
871M file.tar.bz2
This tarball contains hundreds of files:
tar -jtvf file.tar.bz2 | head -3
./file-140556-001_045.txt
./file-121720-001_012.txt
./file-171008-001_036.txt
And I can do a bzgrep no problem:
bzgrep '0316629989093' file.tar.bz2
Binary file (standard input) matches
And using bzgrep -a I can extract the line containing the search pattern. But what I was trying to accomplish is getting the file name inside the tarball that matches the search pattern, so I can extract it without uncompressing the whole tarball.
For example: ./file-171008-001_036.txt
Is there any way to do this from the bzgrep command?

I tried all possible options on bzgrep and it seems not possible to extract the filenames matching the pattern. That's too bad.
What you can do as a workaround is to extract files one by one and delete them after you searched into them.
Something like this :
#!/bin/bash
ARCHIVE="file.tar.bz2"
PATTERN="0316629989093"
tar -jtf "$ARCHIVE" | while read file; do
tar -xf "$ARCHIVE" "$file"
grep -q "$PATTERN" "$file" && echo "$file matches"
rm "$file"
done
Outputs
file-171008-001_036.txt matches
Pros: not all the file are uncompressed at once, so disk usage is limited.
Cons: all the archive is decompressed, so the execution time is pretty bad.

Related

Unpack .tar.gz and modify result files

I wanted to write a bash script that will unpack .tar.gz archives and for each result file it will set an additional attribute with the name of the original archive. Just to know what the origin is of the unpacked file.
I tried to store the inside files in an array and then for-loop them.
for archive in "$1"*.tar.gz; do
if [ -f "${archive}" ]
then
readarray -t fileNames < <(tar tzf "$archive")
for file in "${fileNames}"; do
echo "${file}"
tar xvzf "${archive}" -C "$1" --no-wildcards "${file}" &&
attr -s package -V "${archive}" "${file}"
done
fi
done
The result is that only one file is extracted and no extra attribute is set.
#! /bin/bash
for archive in "$1"*.tar.gz; do
if [ -f "${archive}" ] ; then
# Unpack the archive into subfolder $1
tar xvf "$archive" -C "$1"
# Assign attributes
tar tf "$archive" | (cd "$1" && xargs -t -L1 attr -s package -V "$archive" )
fi
done
Notes:
Script is unpacking each archive with a single 'tar'. This is more efficient than unpacing one file at a time. It also avoid issues with unpacking folders, which will lead to unnecessary repeated work.
Script is using 'attr'. Will be better to use 'setfattr', if supported on target file system to set attributes on multiple files with a few calls (using xargs, with multiple files per command)
It is not clear what is the structure of the output folder. From the question, it looks as if all archives will be placed into the same folder "$1". The following solution assume that this is the intended behavior, and that each archive will have distinct file names. If each archive is to be placed into different sub folder, it will be easier/more efficient to implement.

Shell: Copy list of files with full folder structure stripping N leading components from file names

Consider a list of files (e.g. files.txt) similar (but not limited) to
/root/
/root/lib/
/root/lib/dir1/
/root/lib/dir1/file1
/root/lib/dir1/file2
/root/lib/dir2/
...
How can I copy the specified files (not any other content from the folders which are also specified) to a location of my choice (e.g. ~/destination) with a) intact folder structure but b) N folder components (in the example just /root/) stripped from the path?
I already managed to use
cp --parents `cat files.txt` ~/destination
to copy the files with an intact folder structure, however this results in all files ending up in ~/destination/root/... when I'd like to have them in ~/destination/...
I think I found a really nice an concise solution by using GNU tar:
tar cf - -T files.txt | tar xf - -C ~/destination --strip-components=1
Note the --strip-components option that allows to remove an arbitrary number of path components from the beginning of the file name.
One minor problem though: It seems tar always "compresses" the whole content of folders mentioned in files.txt (at least I couldn't find an option to ignore folders), but that is most easily solved using grep:
cat files.txt | grep -v '/$' > files2.txt
This might not be the most graceful solution - but it works:
for file in $(cat files.txt); do
echo "checking for $file"
if [[ -f "$file" ]]; then
file_folder=$(dirname "$file")
destination_folder=/destination/${file_folder#/root/}
echo "copying file $file to $destination_folder"
mkdir -p "$destination_folder"
cp "$file" "$destination_folder"
fi
done
I had a look at cp and rsync, but it looks like they would benefit more if you to cd into /root first.
However, if you did cd to the correct directory before hand, you could always run it as a subshell so that you would be returned to your original location once the subshell has finished.

Extract file using bash script

I created a script which will extract all *.tar.gz file. This file is decompressed five times .tar.gz file, but the problem is that only the first *.tar.gz file is being extracted.
for file in *.tar.gz; do
gunzip -c "$file" | tar xf -
done
rm -vf "$file"
What should I do this? Answers are greatly appreciated.
If your problem is that the tar.gz file contains another tar.gz file which should be extracted as well, you need a different sort of loop. The wildcard at the top of the for loop is only evaluated when the loop starts, so it doesn't include anything extracted from the tar.gz
You could try something like
while true; do
for f in *.tar.gz; do
case $f in '*.tar.gz') exit 0;; esac
tar zxf "$f"
rm -v "$f"
done
done
The case depends on the fact that (by default) when no files match the wildcard, it remains unexpanded. You may have to change your shell's globbing options if they differ from the default.
If you really mean that it is compressed (not decompressed) five times, despite the single .gz extension, perhaps you need instead
for i in 1 2 3 4; do
gunzip file.tar.gz
mv file.tar file.tar.gz
done
tar zxf file.tar.gz

How to extract only one kind of file from the archive?

Given a .zip or .rar archive containing 10 files, each with different extensions.
Given I only want the .jpg file in it.
How to extract the *.jpg in it without having to extract the 9 other files ?
Try this :
unzip test.zip '*.jpg'
The argument after the filename is the file to be extracted. See man unzip Arguments section:
[file(s)]
An optional list of archive members to be processed, separated
by spaces. (VMS versions compiled with VMSCLI defined must
delimit files with commas instead. See -v in OPTIONS below.)
Regular expressions (wildcards) may be used to match multiple
members; see above. Again, **be sure to quote expressions that
would otherwise be expanded** or modified by the operating system.
tar -xf test.tar --wildcards --no-anchored '*.jpg'
You can use:
while read -r f; do
tar -xf "$zipFile" "$f"
done < <(tar tf "$zipFile" | grep '\.jpg')

bash search inside ZIP files with keyword?

I am looking for a way to search inside ZIP files. My sysadmin gave me access to a mass storage device that contains approximately 1.5 million ZIPs.
Each ZIP may contain up to 1,000 (ASCII) files. Typically a file will have a name has a part number in it like so: supplier_code~part_number~yyyymmdd~hhmmss.txt
My boss asked me to search all the ZIPS for a specific part number. If I find a file matching a part number, I need to unzip that specific file. I have tried this so far on a handful of ZIPs:
for i in find . -name "*zip*"; do unzip $i tmp/ ; done
Problem is that it unzips everything. That is not correct. I tried to specify the part number like so (read the unzip man page)
for i in find . -name "*zip*"; do unzip $i -c *part_number* tmp/ ; done
but it did not work (nothing found). And I got the correct part number.
Is what I am trying to do possible?
You need to use -l option of unzip. From the man page:
-l list archive files (short format). The names, uncompressed file sizes and modification dates and times of the specified files are
printed, along with totals for all files specified. If UnZip was
compiled with OS2_EAS defined, the -l option also lists columns for the sizes of stored OS/2 extended attributes (EAs)
and OS/2 access control lists (ACLs). In addition, the zipfile
comment and
individual file comments (if any) are displayed. If a file was archived from a single-case file system (for example, the old
MS-DOS FAT file system) and the -L option was given, the filename
is
converted to lowercase and is prefixed with a caret (^).
So try something like this -
for i in *.zip; do
echo "scanning $i";
grep -oP "ixia" <(unzip -l "$i") && echo "Found in $i" || echo "Not Found in $i";
done
Since you mentioned you have millions of zip files, you probably don't need all the logging. This is just for example.
I figured out the answer to my question. It's actually quite simple
for i in `find . -name "*zip"`; do unzip -o $i "*partnumber*" -d /tmp/ ; done
for example, this code
for i in `find . -name "*zip"`; do unzip -o $i "*3460*" -d /tmp/ ; done
will actually look at the zips on my device but only unzip the file(s) that match a part number.

Resources