I am looking for a way to search inside ZIP files. My sysadmin gave me access to a mass storage device that contains approximately 1.5 million ZIPs.
Each ZIP may contain up to 1,000 (ASCII) files. Typically a file will have a name has a part number in it like so: supplier_code~part_number~yyyymmdd~hhmmss.txt
My boss asked me to search all the ZIPS for a specific part number. If I find a file matching a part number, I need to unzip that specific file. I have tried this so far on a handful of ZIPs:
for i in find . -name "*zip*"; do unzip $i tmp/ ; done
Problem is that it unzips everything. That is not correct. I tried to specify the part number like so (read the unzip man page)
for i in find . -name "*zip*"; do unzip $i -c *part_number* tmp/ ; done
but it did not work (nothing found). And I got the correct part number.
Is what I am trying to do possible?
You need to use -l option of unzip. From the man page:
-l list archive files (short format). The names, uncompressed file sizes and modification dates and times of the specified files are
printed, along with totals for all files specified. If UnZip was
compiled with OS2_EAS defined, the -l option also lists columns for the sizes of stored OS/2 extended attributes (EAs)
and OS/2 access control lists (ACLs). In addition, the zipfile
comment and
individual file comments (if any) are displayed. If a file was archived from a single-case file system (for example, the old
MS-DOS FAT file system) and the -L option was given, the filename
is
converted to lowercase and is prefixed with a caret (^).
So try something like this -
for i in *.zip; do
echo "scanning $i";
grep -oP "ixia" <(unzip -l "$i") && echo "Found in $i" || echo "Not Found in $i";
done
Since you mentioned you have millions of zip files, you probably don't need all the logging. This is just for example.
I figured out the answer to my question. It's actually quite simple
for i in `find . -name "*zip"`; do unzip -o $i "*partnumber*" -d /tmp/ ; done
for example, this code
for i in `find . -name "*zip"`; do unzip -o $i "*3460*" -d /tmp/ ; done
will actually look at the zips on my device but only unzip the file(s) that match a part number.
Related
I have the below bash that searches a specific directory and returns the earliest folder in that directory. The bash works great if their are no subfolders within each folder. If there is those are returned instead of the main folder. I am not sure why this is happening or how to fix it. Thank you :).
For example,
/home/cmccabe/Desktop/NGS/test is the directory searched and in it there are two folder, R_1 and R_2
output
The earliest folder is: R_1
However, if /home/cmccabe/Desktop/NGS/testhasR_1 and within it testfolderandR_2 and testfolder2 within it`
output
The earliest folder is: testfolder
Bash
cd /home/cmccabe/Desktop/NGS/test
folder=$(ls -u *"R_"* | head -n 1) # earliest folder
echo "The earliest folder is: $folder"
ls is the wrong tool for this job: Its output is built for humans, not scripts, and is often surprising when nonprintable characters are present.
Assuming you have GNU find and sort, the following works with all possible filenames, including those with literal newlines:
dir=/home/cmccabe/Desktop/NGS/test # for my testing, it's "."
{
read -r -d $'\t' time && read -r -d '' filename
} < <(find "$dir" -maxdepth 1 -mindepth 1 -printf '%T+\t%P\0' | sort -z -r )
...thereafter:
echo "The oldest file is $filename, with an mtime of $time"
For a larger discussion of portably finding the newest or oldest file in a directory, including options that don't require GNU tools, see BashFAQ #99.
You should read about ls, the -u option doesn't do what you think it does. The following are the relevant options:
-u - Tells ls to use last access time instead of last modification time when sorting by time. So by itself it does nothing, should be called with -t
-t - Sorts by time (of modification or access or something else), with newest first
-r - Reverses order of output
-d - Don't search directories recursively
So what you actually need is:
$ ls -trd
Or:
$ ls -utrd
I have a few files with the format ReportsBackup-20140309-04-00 and I would like to send the files with same pattern to the files as the example to the 201403 file.
I can already create the files based on the filename; I would just like to move the files based on the name to their correct folder.
I use this to create the directories
old="directory where are the files" &&
year_month=`ls ${old} | cut -c 15-20`&&
for i in ${year_month}; do
if [ ! -d ${old}/$i ]
then
mkdir ${old}/$i
fi
done
you can use find
find /path/to/files -name "*201403*" -exec mv {} /path/to/destination/ \;
Here’s how I’d do it. It’s a little verbose, but hopefully it’s clear what the program is doing:
#!/bin/bash
SRCDIR=~/tmp
DSTDIR=~/backups
for bkfile in $SRCDIR/ReportsBackup*; do
# Get just the filename, and read the year/month variable
filename=$(basename $bkfile)
yearmonth=${filename:14:6}
# Create the folder for storing this year/month combination. The '-p' flag
# means that:
# 1) We create $DSTDIR if it doesn't already exist (this flag actually
# creates all intermediate directories).
# 2) If the folder already exists, continue silently.
mkdir -p $DSTDIR/$yearmonth
# Then we move the report backup to the directory. The '.' at the end of the
# mv command means that we keep the original filename
mv $bkfile $DSTDIR/$yearmonth/.
done
A few changes I’ve made to your original script:
I’m not trying to parse the output of ls. This is generally not a good idea. Parsing ls will make it difficult to get the individual files, which you need for copying them to their new directory.
I’ve simplified your if ... mkdir line: the -p flag is useful for “create this folder if it doesn’t exist, or carry on”.
I’ve slightly changed the slicing command which gets the year/month string from the filename.
Given a .zip or .rar archive containing 10 files, each with different extensions.
Given I only want the .jpg file in it.
How to extract the *.jpg in it without having to extract the 9 other files ?
Try this :
unzip test.zip '*.jpg'
The argument after the filename is the file to be extracted. See man unzip Arguments section:
[file(s)]
An optional list of archive members to be processed, separated
by spaces. (VMS versions compiled with VMSCLI defined must
delimit files with commas instead. See -v in OPTIONS below.)
Regular expressions (wildcards) may be used to match multiple
members; see above. Again, **be sure to quote expressions that
would otherwise be expanded** or modified by the operating system.
tar -xf test.tar --wildcards --no-anchored '*.jpg'
You can use:
while read -r f; do
tar -xf "$zipFile" "$f"
done < <(tar tf "$zipFile" | grep '\.jpg')
I have a large repository of media files that follow torrent naming conventions- something unpleasant to read. At one point, I had properly named the folders that contain said files, but not want to dump all the .avi, .mkv, etc files into my main media directory using a bash script.
Overview:
Current directory tree:
Proper Movie Title/
->Proper.Movie.Title.2013.avi
->Proper.Movie.Title.2013.srt
Title 2/
->Title2[proper].mkv
Movie- Epilogue/
->MOVIE EPILOGUE .AVI
Media Movie/
->MEDIAMOVIE.CD1.mkv
->MEDIAMOVIE.CD2.mkv
.
.
.
Desired directory tree:
Proper Movie Title/
->Proper Movie Title.avi
->Proper Movie Title.srt
Title 2.mkv
Movie- Epilogue.avi
Media Movie/
->Media Movie.cd1.mkv
->Media Movie.cd2.mkv
Though this would be an ideal, my main wish is for the directories with only a single movie file within to have that file be renamed and moved into the parent directory.
My current approach is to use a double for loop in a .sh file, but I'm currently having a hard time keeping new bash knowledge in my head.
Help would be appreciated.
My current code (Just to get access to the internal movie files):
#!/bin/bash
FILES=./*
for f in $FILES
do
if [[ -d $f ]]; then
INFILES=$f/*
for file in $INFILES
do
echo "Processing >$file< folder..."
done
#cat $f
fi
done
Here's something simple:
find * -type f -maxdepth 1 | while read file
do
dirname="$(dirname "$file")"
new_name="${dirname##*/}"
file_ext=${file##*.}
if [ -n "$file_ext" -a -n "$dirname" -a -n "$new_name" ]
then
echo "mv '$file' '$dirname/$new_name.$file_ext'"
fi
done
The find * says to run find on all items in the current directory. The -type f says you only are interested in files, and -maxdepth 1 limits the depth of the search to the immediate directory.
The ${file##*.} is using a pattern match. The ## says the largest left hand match to *. which is basically pulling everything off to the file extension.
The file_dir="$(dirname "$file")" gets the directory name.
Note quotes everywhere! You have to be careful about white spaces.
By the way, I echo instead of doing the actual move. I can pipe the output to a file, examine that file and make sure everything looks okay, then run that file as a shell script.
My users will be zipping up files which will look like this:
TEMPLATE1.ZIP
|--------- UnknownName
|------- index.html
|------- images
|------- image1.jpg
I want to extract this zip file as follows:
/mysite/user_uploaded_templates/myrandomname/index.html
/mysite/user_uploaded_templates/myrandomname/images/image1.jpg
My trouble is with UnknownName - I do not know what it is beforehand and extracting everything to the "base" level breaks all the relative paths in index.html
How can I extract from this ZIP file the contents of UnknownName?
Is there anything better than:
1. Extract everything
2. Detect which "new subdidrectory" got created
3. mv newsubdir/* .
4. rmdir newsubdir/
If there is more than one subdirectory at UnknownName level, I can reject that user's zip file.
I think your approach is a good one. Step 2 could be improved my extracting to a newly created directory (later deleted) so that "detection" is trivial.
# Bash (minimally tested)
tempdest=$(mktemp -d)
unzip -d "$tempdest" TEMPLATE1.ZIP
dir=("$tempdest"/*)
if (( ${#dir[#]} == 1 )) && [[ -d $dir ]]
# in Bash, etc., scalar $var is the same as ${var[0]}
mv "$dir"/* /mysite/user_uploaded_templates/myrandomname
else
echo "rejected"
fi
rm -rf "$tempdest"
The other option I can see other than the one you suggested is to use the unzip -j flag which will dump all paths and put all files into the current directory. If you know for certain that each of your TEMPLATE1.ZIP files includes an index.html and *.jpg files then you can just do something like:
destdir=/mysite/user_uploaded_templates/myrandomname
unzip -j -d "$destdir"
mkdir "${destdir}/images"
mv "${destdir}/*.jpg" "${destdir}/images"
It's not exactly the cleanest solution but at least you don't have to do any parsing like you do in your example. I can't seem to find any option similar to patch -p# that lets you specify the path level.
Each zip and unzip command differs, but there's usually a way to list the file contents. From there, you can parse the output to determine the unknown directory name.
On Windows, the 1996 Wales/Gaily/van der Linden/Rommel version it is unzip -l.
Of course, you could just simply allow the unzip to unzip the files to whatever directory it wants, then use mv to rename the directory to what you want it as.
$tempDir = temp.$$
mv $zipFile temp.$$
cd $tempDir
unzip $zipFile
$unknownDir = * #Should be the only directory here
mv $unknownDir $whereItShouldBe
cd ..
rm -rf $tempDir
It's always a good idea to create a temporary directory for these types of operations in case you end up running two instances of this command.