Unzip list of files from bash associative array keys

Unzip list of files from bash associative array keys - bash

My bash script create an associative array with files as keys.
declare -A fileInFolder
for f in "${zipFolder}/"*".zip"; do
read -r fileInFolder[$f] _ < <(md5sum "$f")
done
... code for removing some entry in fileInFolder ...
unzip -qqc "${!fileInFolder[#]}"
And unzip prevent me caution: filename not matched: for everyone except first file.
The following command work without any problem:
unzip -qqc "${zipFolder}/"\*".zip"
I try using 7z but I did't find the way to give more than one zip file as input (using -ai option this need a list of file separated by new line if my understanding is correct...)

Ignoring the reason for the Associative array storing MD5 of zip files.
As #Enrico Maria De Angelis pointed-out, unzip only accepts one zip file argument per invocation. So you can not expand the associative array file names indexes into arguments for a single call to unzip.
I propose this solution:
#!/usr/bin/env bash
# You don't want to unzip the pattern name if none match
shopt -s nullglob
declare -A fileInFolder
for f in "${zipFolder}/"*".zip"; do
# Store MD5 of zip file into assoc array fileInFolder
# key: zip file name
# value: md5sum of zip file
read -r fileInFolder["$f"] < <(md5sum "$f")
# Unzip file content to stdout
unzip -qqc "$f"
done | {
# Stream the for loop's stdout to the awk script
awk -f script.awk
}
Alternative implementation calling md5sum only once for all zip files
shopt -s nullglob
# Iterate the null delimited entries output from md5sum
# Reading first IFS=' ' space delimited field as sum
# and remaining of entry until null as zipname
while IFS=' ' read -r -d '' sum zipname; do
# In case md5sum file patterns has no match
# It will return the md5sum of stdin with file name -
# If so, break out of the while
[ "$zipname" = '-' ] && break
fileInFolder["$zipname"]="$sum"
# Unzip file to stdout
unzip -qqc -- "$zipname"
done < <(md5sum --zero -- "$zipFolder/"*'.zip' </dev/null) | awk -f script.awk

Foreword
This answer, which basically boils down to you can't do this with one unzip command, assumes you are aware you can put unzip -qqc "$f" in the for loop you wrote in your question, and that you don't want to do it for some reason.
My answer
You are not getting an error for all files; rather, you are getting an error for all files from the second one on.
Simply try the following
unzip -qqc file1.zip file2.zip
and you will get the error
caution: filename not matched: file2.zip
which is just the one you are getting.
From the unzip's man page
SYNOPSIS
unzip [-Z] [-cflptTuvz[abjnoqsCDKLMUVWX$/:^]] file[.zip] [file(s) ...] [-x xfile(s) ...] [-d exdir]```
it looks like you are only allowed to provide one zip file on the command line.
Well, actually not quite, in that you can specify more zip files on the command line, but to do so, you have to rely on unzip's own way of interpreting its own command line; this partly mimics the shell, but all it can do is listed in the man page:
ARGUMENTS
file[.zip]
Path of the ZIP archive(s). If the file specification is a wildcard, each matching file is processed in an order determined by the
operating system (or file system). Only the filename can be a wildcard; the path itself cannot. Wildcard expressions are similar to
those supported in commonly used Unix shells (sh, ksh, csh) and may contain:
* matches a sequence of 0 or more characters
? matches exactly 1 character
[...] matches any single character found inside the brackets; ranges are specified by a beginning character, a hyphen, and an ending
character. If an exclamation point or a caret (`!' or `^') follows the left bracket, then the range of characters within the
brackets is complemented (that is, anything except the characters inside the brackets is considered a match). To specify a
verbatim left bracket, the three-character sequence ``[[]'' has to be used.
(Be sure to quote any character that might otherwise be interpreted or modified by the operating system, particularly under Unix and
VMS.) If no matches are found, the specification is assumed to be a literal filename; and if that also fails, the suffix .zip is ap‐
pended. Note that self-extracting ZIP files are supported, as with any other ZIP archive; just specify the .exe suffix (if any) ex‐
plicitly. ```
So you are technically facing the same issue you've found with 7z.

Related

Globbing a filename and then saving to a variable

I've got some files in a directory with a standard format, I'm looking to use a txt file with part of the filenames to extend them through * then finally add on a .gz tag as an output
For example, a file called 1.SNV.111-T.vcf in my directory, I have 111-T in my txt file.
#!/bin/bash
while getopts f: flag
do
case "${flag}" in
f) file=${OPTARG};;
esac
done
while IFS="" read -r p || [ -n "$p" ]
do
vcf="*${p}.vcf"
bgzip -c ${vcf} > ${vcf}.gz
done < $file
This will successfully run bgzip but actually save the output to be:
'*111-T.vcf.gz'
So adding .gz at the end has "deactivated" the * character, as pointed out by Barmar this is because there isn't a file in my directory called 1.SNV.111-T.vcf.gz so the wildcard is inactivated, please can anyone help?
I'm new to bash scripting but I assume there must be some way to save the "absolute" value of my vcf variable so that once it has found a match the first time, it's now a string that can be used downstream? I really cant find anything online.

The problem is that wildcards are only expanded when they match an existing file. You can't use a wildcard in the filename you're trying to create.
You need to get the expanded filename into the vcf variable. You can do it this way:
vcf=$(echo *"$p.vcf")
bgzip -c "$vcf" > "$vcf.gz"

In bash, how can I remove multiple versions of the same file?

This may be a very specific case, but I know very little about bash and I need to remove "duplicate" files. I've been downloading totally legal videogame roms these past few days, and I noticed that a lot of packs have many different versions of the same game, like this:
Awesome Golf (1991).lnx
Awesome Golf (1991) [b1].lnx
Baseball Heroes (1991).lnx
Baseball Heroes (1991) [b1].lnx
Basketbrawl (1992).lnx
Basketbrawl (1992) [a1].lnx
Basketbrawl (1992) [b1].lnx
Batman Returns (1992).lnx
Batman Returns (1992) [b1].lnx
How can I make a bash script that removes the duplicates? A duplicate would be any file that has the same name, and the name would be the string before the first parenthesis. The script should parse all the files and grab their names, see which names match to detect duplicates, and remove all files except the first one (first being the first that comes up in alphabetical order).

Would you please try the following:
#!/bin/bash
dir="dir" # the directory where the rom files are located
declare -A seen # associative array to detect the duplicates
while IFS= read -r -d "" f; do # loop over filenames by assigning "f" to it
name=${f%(*} # extract the "name" by removing left paren and following characters
name=${name%.*} # remove the extension considering the case the filename doesn't have parens
name=${name%[*} # remove the left square bracket and following characters considering the case as above
name=${name%% } # remove trailing whitespaces, if any
if (( seen[$name]++ )); then # if the name duplicates...
# remove "echo" if the output looks good
echo rm -- "$f" # then remove the file
fi
done < <(find "$dir" -type f -name "*.lnx" -print0 | sort -z -t "." -k1,1)
# sort the list of filenames in alphabetical order
Please modify the first dir= line to your directory path which holds the rom files.
The echo command just prints the filenames to be removed as a rehearsal. If the output looks good, then remove echo and execute the real one.
[Explanation]
An associative array seen associates the extracted "name" with a
counter of appearance. If the counter is non-zero, the file is a duplicated
one and can be removed (as long as the files are properly sorted).
The -print0 option to find, the -z option to sort and the -d ""
option to read make a null character as a delimiter of filenames to
accept filenames which contain special characters such as a whitespace,
tab, newline, etc.

How do I rename multiple files before the extension in linux?

I want to take a group of files with names like 123456_1_2.mpg and turn it into 123456.mpg how can I do this using terminal commands?

To loop over all the available files you can use a for loop over the file names of the form ??????_?_?.mpg.
To rename the files you can retain the shortest match of a pattern from the beginning of the string using ${MYVAR%%pattern} without using any external command.
This said, your code should look like:
#!/bin/bash
shopt -s nullglob # do nothing if no matches found
for file in ??????_?_?.mpg; do
[[ -f $file ]] || continue # skip if not a regular file
new_file="${file%%_*}.mpg" # compose the new file name
echo mv "$file" "$new_file" # remove echo after testing
done

rename 's/_.*/.mpg/' *mpg
this will remove everything between the first underscore and the mpg file extension for all files ending in mpg

We can use grep to strip out everything but the first sequence of numbers. The --interactive flag will ask you if you're sure for each move, so you can make sure it's not doing anything you don't expect.
for file in *.mpg; do
mv --interactive "$file" "$(grep -o '^[0-9]\+' <<< "$file")".mpg
done
The regex ^[0-9]\+ translates to "any sequence of characters that starts with a number and is followed by zero or more numbers".

Do not start loop if there is no files in directory?

All,
I am running BASH in Solaris 10
I have the following shell script that loops in a directory depending on the presence of CSV files.
The problem is with this piece of code is that it still does one loop even if there is no CSV files in that directory and then calls SQL loader.
SQLLoader then produces a log file because there is no file to process and this is beginning to mess up my directory filling it with log files.
for file in *.csv ;
do
echo "SQLLoader is reading : " $file
sqlldr <User>/<Password>#<DBURL>:<PORT>/<SID> control=sqlloader.ctl log=$inbox/$file.log data=$inbox/$file
done
How do I stop it going into a loop if there is no CSV files in that directory of $inbox

Say:
shopt -s nullglob
before your for loop.
This is not the default, and saying for file in *.csv when you don't have any matching files expands it to *.csv.
Quoting from the documentation:
nullglob
If set, Bash allows filename patterns which match no files to expand to a null
string, rather than themselves.

Use find to search files
for file in `find -name "*.csv"` ;

First off, using nullglob is the correct answer if it is available. However, a POSIX-compliant option is available.
The pattern will be treated as literal text if there are no matches. You can catch this with a small hack:
for file in *.csv; do
[ -f "$file" ] || break
...
done
When there are no matches, file will be set to the literal string *.csv, which is not the name of a file, so -f "$file" will fail. Otherwise, file will be set in turn to the name of each file matching the pattern, and -f "$file" will succeed every time. Note this will work even if there is an file named *.csv. The drawback is that you have to make a redundant test for each existing file.

Bash: Check all files in a location against another for existence

I'm after a little help with some Bash scripting (on OSX). I want to create a script that takes two parameters - source folder and target folder - and checks all files in the source hierarchy to see whether or not they exist in the target hierarchy. i.e. Given a data DVD check whether the files contained on it are already on the internal drive.
What I've come up with so far is
#!/bin/bash
if [ $# -ne 2 ]
then
echo "Usage is command sourcedir targetdir"
exit 0
fi
source="$1"
target="$2"
for f in "$( find $source -type f -name '*' -print )"
do
I'm now not sure how it's best to obtain the filename without its path and then see if it exists. I am really a beginner at scripting.
Edit: The answers given so far are all very efficient in terms of compact code. However I need to be able to look for files found within the total source hierarchy anywhere within the target hierarchy. If found I would like to compare checksums and last modified dates etc and comment or, if not found, I would like to note this. The purpose is to check whether files on external media have been uploaded to a file server.

This should give you some ideas:
#!/bin/bash
DIR1="tmpa"
DIR2="tmpb"
function sorted_contents
{
cd "$1"
find . -type f | sort
}
DIR1_CONTENTS=$(sorted_contents "$DIR1")
DIR2_CONTENTS=$(sorted_contents "$DIR2")
diff -y <(echo "$DIR1_CONTENTS") <(echo "$DIR2_CONTENTS")
In my test directories, the output was:
[user#host so]$ ./dirdiff.sh
./address-book.dat ./address-book.dat
./passwords.txt ./passwords.txt
./some-song.mp3 <
./the-holy-grail.info ./the-holy-grail.info
> ./victory.wav
./zzz.wad ./zzz.wad
If its not clear, "some-song.mp3" was only in the first directory while "victory.wav" was only in the second. The rest of the files were common.
Note that this only compares the file names, not the contents. If you like where this is headed, you could play with the diff options (maybe --suppress-common-lines if you want cleaner output).
But this is probably how I'd approach it -- offload a lot of the work onto diff.
EDIT: I should also point out that something as simple as:
[user#host so]$ diff tmpa tmpb
would also work:
Only in tmpa: some-song.mp3
Only in tmpb: victory.wav
... but not feel as satisfying as writing a script yourself. :-)

To list only files in $source_dir that do not exist in $target_dir:
comm -23 <(cd "$source_dir" && find .|sort) <(cd "$target_dir" && find .|sort)
You can limit it to just regular files with -f on the find commands, etc.
The comm command (short for "common") finds lines in common between two text files and outputs three columns: lines only in the first file, lines only in the second file, and lines common to both. The numbers suppress the corresponding column, so the output of comm -23 is only the lines from the first file that don't appear in the second.
The process substitution syntax <(command) is replaced by the pathname to a named pipe connected to the output of the given command, which lets you use a "pipe" anywhere you could put a filename, instead of only stdin and stdout.
The commands in this case generate lists of files under the two directories - the cd makes the output relative to the directories being compared, so that corresponding files come out as identical strings, and the sort ensures that comm won't be confused by the same files listed in different order in the two folders.

A few remarks about the line for f in "$( find $source -type f -name '*' -print )":
Make that "$source". Always use double quotes around variable substitutions. Otherwise the result is split into words that are treated as wildcard patterns (a historical oddity in the shell parsing rules); in particular, this would fail if the value of the variable contain spaces.
You can't iterate over the output of find that way. Because of the double quotes, there would be a single iteration through the loop, with $f containing the complete output from find. Without double quotes, file names containing spaces and other special characters would trip the script.
-name '*' is a no-op, it matches everything.
As far as I understand, you want to look for files by name independently of their location, i.e. you consider /dvd/path/to/somefile to be a match to /internal-drive/different/path-to/somefile. So make an list of files on each side indexed by name. You can do this by massaging the output of find a little. The code below can cope with any character in file names except newlines.
list_files () {
find . -type f -print |
sed 's:^\(.*\)/\(.*\)$:\2/\1/\2:' |
sort
}
source_files="$(cd "$1" && list_files)"
dest_files="$(cd "$2" && list_files)"
join -t / -v 1 <(echo "$source_files") <(echo "$dest_files") |
sed 's:^[^/]*/::'
The list_files function generates a list of file names with paths, and prepends the file name in front of the files, so e.g. /mnt/dvd/some/dir/filename.txt will appear as filename.txt/./some/dir/filename.txt. It then sorts the files.
The join command prints out lines like filename.txt/./some/dir/filename.txt when there is a file called filename.txt in the source hierarchy but not in the destination hierarchy. We finally massage its output a little since we no longer need the filename at the beginning of the line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unzip list of files from bash associative array keys - bash

Related

Globbing a filename and then saving to a variable

In bash, how can I remove multiple versions of the same file?

How do I rename multiple files before the extension in linux?

Do not start loop if there is no files in directory?

Bash: Check all files in a location against another for existence

Categories

Resources