I'd like to write a script that will execute a command that finds every set of files that have the same last four characters.
So for example, given a directory with these files,
$ ls -1
GH010119.MP4
GH010120.MP4
GH010126.MP4
GH010127.MP4
GH020119.MP4
GH020126.MP4
GH020127.MP4
GH030119.MP4
GH030126.MP4
I'd like my script to make out these groups:
GH010119.MP4
GH020119.MP4
GH030119.MP4
GH010126.MP4
GH020126.MP4
GH030126.MP4
GH010127.MP4
GH020127.MP4
GH010120.MP4
My current solution is to make out each group manually using: find . -name "*0119*", so I'd also like to know if the script I'd have to come up with won't be overly complex in comparison....
With perl
perl -e 'for (glob("*")){$f{$1}.="$&\n" if /.*(.{4}).MP4/}print "$_\n" for (values %f)'
GH010126.MP4
GH020126.MP4
GH030126.MP4
GH010120.MP4
GH010119.MP4
GH020119.MP4
GH030119.MP4
GH010127.MP4
GH020127.MP4
I'm assuming the filenames without extension are all 8 characters and contain no newlines:
printf "%s\n" * |
sort -k1.5,1.8n |
awk '{key = substr($0,5,4)} NR==1{prev=key} prev != key {print ""} {print; prev=key}'
If the filename is not strictly 8 chars, then
for f in *; do
root=${f%%.*}
echo "${root: -4:4} $f"
done |
sort -k1,1n |
awk 'NR==1 {prev=$1} $1 != prev {print ""} {print $2; prev=$1}'
You can extract the groupings with something like
printf '%s\n' *.MP4 | sed 's/.*\(........\)$/\1/' | sort -u
With the extension .MP4 factored in, which is part of the file name no matter how you look at it, this extracts the last eight characters, and removes any duplicates.
Doing it in Awk might be a bit more efficient.
awk 'FNR == 1 { n = substr(FILENAME, length(FILENAME)-7);
if (seen[n]++ == 0) print n; nextfile }' *.MP4
Related
Using Bash, I'm wanting to get a list of email addresses from a CSV file to do a recursive grep search on it for a bunch of directories looking for a match in specific metadata XML files, and then also tallying up how many results I find for each address throughout the directory tree (i.e. updating the tally field in the same CSV file).
accounts.csv looks something like this:
updated to more accurately reflect real-world data
email,date,bar,URL,"something else",tally
address#somewhere.com,21/04/2015,1.2.3.4,https://blah.com/,"blah blah",5
something#that.com,17/06/2015,5.6.7.8,https://blah.com/,"lah yah",0
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",1
For example, if we put address#somewhere.com in $email from the list, run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
on it and then add that result to the tally column.
At the moment I can get the first column of that CSV file (minus the heading/first line) using
awk -F"," '{print $1}' accounts.csv | tail -n +2
but I'm lost how to do the looping and also the writing of the result back to the CSV file...
So for instance, with another#here.com if we run
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
and the result is say 17, how can I update that line to become:
another#here.com,7/08/2017,9.10.11.12,https://blah.com/,"wah wah",17
Is this possible with maybe awk or sed?
This is where I'm up to:
#!/bin/bash
# make temporary list of email addresses
awk -F"," '{print $1}' accounts.csv | tail -n +2 > emails.tmp
# loop over each
while read email; do
# count how many uploads for current email address
grep -rl "${email}" --include=\*_meta.xml --only-matching | wc -l
done < emails.tmp
XML Metadata looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<metadata>
<identifier>SomeTitleNameGoesHere</identifier>
<mediatype>audio</mediatype>
<collection>opensource_movies</collection>
<description>example <br /></description>
<subject>testing</subject>
<title>Some Title Name Goes Here</title>
<uploader>another#here.com</uploader>
<addeddate>2017-05-28 06:20:54</addeddate>
<publicdate>2017-05-28 06:21:15</publicdate>
<curation>[curator]email#address.com[/curator][date]20170528062151[/date][comment]checked for malware[/comment]</curation>
</metadata>
how to do the looping and also the writing of the result back to the CSV file
awk does the looping automatically. You can change any field by assigning to it. So to change a tally field (the 6th in each line) you would do $6 = ....
awk is a great tool for many scenarios. You probably can safe a lot of time in the future by investing some minutes in a short tutorial now.
The only non-trivial part is getting the output of grep into awk.
The following script increments each tally by the count of *_meta.xml files containing the given email address:
awk -F, -v OFS=, -v q=\' 'NR>1 {
cmd = "grep -rlFw " q $1 q " --include=\\*_meta.xml | wc -l";
cmd | getline c;
close(cmd);
$6 = c
} 1' accounts.csv
For simplicity we assume that filenames are free of linebreaks and email addresses are free of '.
To reduce possible false positives, I also added the -F and -w option to your grep command.
-F searches literal strings; without it, searching for a.b#c would give false positives for things like axb#c and a-b#c.
-w matches only whole words; without it, searching for b#c would give a false positive for ab#c. This isn't 100% safe, as a-b#c would still give a false positive, but without knowing more about the structure of your xml files we cannot fix this.
A pipeline to reduce the number of greps:
grep -rHo --include=\*_meta.xml -f <(awk -F, 'NR > 1 {print $1}' accounts.csv) \
| gawk -F, -v OFS=',' '
NR == FNR {
# store the filenames for each email
if (match($0, /^([^:]+):(.+)/, m)) tally[m[2]][m[1]]
next
}
FNR > 1 {$4 = length(tally[$1])}
1
' - accounts.csv
Here is a solution using single awk command to achieve this. This solution will be highly performant as compared to other solutions because it is scanning each XML file only once for all the email addresses found in first column of the CSV file. Also it is not invoking any external command or spawning a sub0shell anywhere.
This should work in any version of awk.
cat srch.awk
# function to escape regex meta characters
function esc(s, tmp) {
tmp = s
gsub(/[&+.]/, "\\\\&", tmp)
return tmp
}
BEGIN {FS=OFS=","}
# while processing csv file
NR == FNR {
# save escaped email address in array em skipping header row
if (FNR > 1)
em[esc($1)] = 0
# save each row in rec array
rec[++n] = $0
next
}
# this block will execute for eaxh XML file
{
# loop each email and save count of matched email in array em
# PS: gsub return no of substitutionx
for (i in em)
em[i] += gsub(i, "&")
}
END {
# print header row
print rec[1]
# from 2nd row onwards split row into columns using comma
for (i=2; i<=n; ++i) {
split(rec[i], a, FS)
# 6th column is the count of occurrence from array em
print a[1], a[2], a[3], a[4], a[5], em[esc(a[1])]
}
}
Use it as:
awk -f srch.awk accounts.csv $(find . -name '*_meta.xml') > tmp && mv tmp accounts.csv
A script that handles accounts.csv line by line and replaces the data in accounts.new.csv for comparison.
#! /bin/bash
file_old=accounts.csv
file_new=${file_old/csv/new.csv}
delimiter=","
x=1
# Copy file
cp ${file_old} ${file_new}
while read -r line; do
# Skip first line
if [[ $x -gt 1 ]]; then
# Read data into variables
IFS=${delimiter} read -r address foo bar tally somethingelse <<< ${line}
cnt=$(find . -name '*_meta.xml' -exec grep -lo "${address}" {} \; | wc -l)
# Reset tally
tally=$cnt
# Change line number $x in new file
sed "${x}s/.*/${address} ${foo} ${bar} ${tally} ${somethingelse}/; ${x}s/ /${delimiter}/g" \
-i ${file_new}
fi
((x++))
done < ${file_old}
The input and ouput:
# Input
$ find . -name '*_meta.xml' -exec cat {} \; | sort | uniq -c
2 address#somewhere.com
1 something#that.com
$ cat accounts.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,-1,blah
something#that.com,bar2,foo3,-1,blah
another#here.com,bar4,foo5,-1,blah
# output
$ ./test.sh
$ cat accounts.new.csv
email,foo,bar,tally,somethingelse
address#somewhere.com,bar1,foo2,2,blah
something#that.com,bar2,foo3,1,blah
another#here.com,bar4,foo5,0,blah
I am wrote a simple script to extract text from a bunch of files (*.out) and add two lines at the beginning and a line at the end. Then I add the extracted text with another file to create a new file. The script is here.
#!/usr/bin/env bash
#A simple bash script to extract text from *.out and create another file
for f in *.out; do
#In the following line, n is a number which is extracted from the file name
n=$(echo $f | cut -d_ -f6)
t=$((2 * $n ))
#To extract the necessary text/data
grep " B " $f | tail -${t} | awk 'BEGIN {OFS=" ";} {print $1, $4, $5, $6}' | rev | column -t | rev > xyz.xyz
#To add some text as the first, second and last lines.
sed -i '1i -1 2' xyz.xyz
sed -i '1i $molecule' xyz.xyz
echo '$end' >> xyz.xyz
#To combine the extracted info with another file (ea_input.in)
cat xyz.xyz ./input_ea.in > "${f/abc.out/pqr.in}"
done
./script.sh: line 4: (ls file*.out | cut -d_ -f6: syntax error: invalid arithmetic operator (error token is ".out) | cut -d_ -f6")
How I can correct this error?
In bash, when you use:
$(( ... ))
it treats the contents of the brackets as an arithmetic expression, returning the result of the calculation, and when you use:
$( ... )
it executed the contents of the brackets and returns the output.
So, to fix your issue, it should be as simple as to replace line 4 with:
n=$(ls $f | cut -d_ -f6)
This replaces the outer double brackets with single, and removes the additional brackets around ls $f which should be unnecessary.
The arithmetic error can be avoided by adding spaces between parentheses. You are already using var=$((arithmetic expression)) correctly elsewhere in your script, so it should be easy to see why $( ((ls "$f") | cut -d_ -f6)) needs a space. But the subshells are completely superfluous too; you want $(ls "$f" | cut -d_ -f6). Except ls isn't doing anything useful here, either; use $(echo "$f" | cut -d_ -f6). Except the shell can easily, albeit somewhat clumsily, extract a substring with parameter substitution; "${f#*_*_*_*_*_}". Except if you're using Awk in your script anyway, it makes more sense to do this - and much more - in Awk as well.
Here is an at empt at refactoring most of the processing into Awk.
for f in *.out; do
awk 'BEGIN {OFS=" " }
# Extract 6th _-separated field from input filename
FNR==1 { split(FILENAME, f, "_"); t=2*f[6] }
# If input matches regex, add to array b
/ B / { b[++i] = $1 OFS $4 OFS $5 OFS $6 }
# If array size reaches t, start overwriting old values
i==t { i=0; m=t }
END {
# Print two prefix lines
print "$molecule"; print -1, 2;
# Handle array smaller than t
if (!m) m=i
# Print starting from oldest values (index i + 1)
for(j=1; j<=m; j++) {
# Wrap to beginning of array at end
if(i+j > t) i-=t
print b[i+j]; }
print "$end" }' "$f" |
rev | column -t | rev |
cat - ./input_ea.in > "${f/foo.out/bar.in}"
done
Notice also how we avoid using a temporary file (this would certainly have been avoidable without the Awk refactoring, too) and how we take care to quote all filename variables in double quotes.
The array b contains (up to) the latest t values from matching lines; we collect these into an array which is constrained to never contain more than t values by wrapping the index i back to the beginning of the array when we reach index t. This "circular array" avoids keeping too many values in memory, which would make the script slow if the input file contains many matches.
Suppose I have more than 3000 files file.gz with many lines like below. The fields are separated by commas. I want to count only the line in which the 21st field has today's date (ex:20171101).
I tried this:
awk -F',' '{if { $21 ~ "TZ=GMT+30 date '+%d-%m-%y'" } { ++count; } END { print count; }}' file.txt
but it's not working.
Using awk, something like below
awk -F"," -v toSearch="$(date '+%Y%m%d')" '$21 ~ toSearch{count++}END{print count}' file
The date '+%Y%m%d' produces the date in the format as you requested, e.g. 20170111. Then matching that pattern on the 21st field and counting the occurrence and printing it in the END clause.
Am not sure the Solaris version of grep supports the -c flag for counting the number of pattern matches, if so you can do it as
grep -c "$(date '+%Y%m%d')" file
Another solution using gnu-grep
grep -Ec "([^,]*,){20}$(date '+%Y%m%d')" file
explanation: ([^,]*,){20} means 20 fields before the date to be checked
Using awk and process substitution to uncompress a bunch of gzs and feed them to awk for analyzing and counting:
$ awk -F\, 'substr($21,1,8)==strftime("%Y%m%d"){i++}; END{print i}' * <(zcat *gz)
Explained:
substr($21,1,8) == strftime("%Y%m%d") { # if the 8 first bytes of $21 match date
i++ # increment counter
}
END { # in the end
print i # output counter
}' * <(zcat *gz) # zcat all gzs to awk
If Perl is an option, this solution works on all 3000 gzipped files:
zcat *.gz | perl -F, -lane 'BEGIN{chomp($date=`date "+%Y%m%d"`); $count=0}; $count++ if $F[20] =~ /^$date/; END{print $count}'
These command-line options are used:
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-n loop around each line of the input file
-e execute the perl code
-F autosplit modifier, in this case splits on ,
BEGIN{} executes before the main loop.
The $date and $count variables are initialized.
The $date variable is set to the result of the shell command date "+%Y%m%d"
$F[20] is the 21st element in #F
If the 21st element starts with $date, increment $count
END{} executes after the main loop
Using grep and cut instead of awk and avoiding regular expressions:
cut -f21 -d, file | grep -Fc "$(date '+%Y%m%d')"
I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.
I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c