Get full path name of file and its size using awk - bash

I want to get the file names followed by their size for all files having size in MB or GB. I have done this much so far :
LIST=$(ls -lh -d -1 $PWD/{*,} | awk '{ print $9":"$5 }')
for i in $LIST
do
if [[ $( echo "$i" | cut -f2 -d: | egrep "M|G" | wc -l) -ne 0 ]]
# egrep not working, only finds M
then
echo "$i" >> bigfiles
fi
done
What I am getting is :
amit#C0deDaedalus:~$ test/findbig
/home/amit/Batch:3.8M
/home/amit/Black:3.6M
What I want is :
amit#C0deDaedalus:~$ test/findbig
/home/amit/Batch File Programming.pdf:3.8M
/home/amit/Black Panther - Legend Has It ( Instrumental ).opus:3.6M
Basically, everything is working fine except filenames that I get are not complete. Only first word is shown. I can't figure out whether there is something wrong with logic or syntax but I think it has something to do with awk.
So, How do I get the full path names of files (having spaces in between) in the output ?
I have tried the loop trick in awk, but don't know how to get both of the columns to fit in.

You can use read and the convenient occurrence of the filename at the right-side of the ls -l listing. read puts all the "extra" fields into the final variable:
function f_getfields
{
local perm lnk uname grp size d1 d2 d3 filename
while read perm lnk uname grp size d1 d2 d3 filename
do
echo "$filename $size"
done < <(ls -l)
}
f_getfields

The problem is due to the spaces in your file names. The for loop uses spaces as delimeter. Therefore the first item in your list will be "/home/amit/Batch", second item "File" and so on.
You can use while loop instead of for, something like :
ls -lh -d -1 $PWD/{*,} | awk '{ print $9":"$5 }' | while read LINE
do
echo ${LINE}
# do your stuff here
done
As an aside, if your only intention is to find out large files, you may want to check out disk usage command :
$ du -a | sort -rn | head

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Bash Shellscript Column Check Error Handling

I am writing a Bash Shellscript. I need to check a file for if $value1 contains $value2. $value1 is the column number (1, 4, 5 as an example) and $value2 ($value2 can be '03', '04' , '09' etc) is the String I am looking for. If the column contains the $value2 then perform a move of the file to an error directory. I was wondering what is the best approach to this. I was thinking awk or is there another way?
$value1 and $value2 are stored in a config file. I have control over what format I can use. Here's an example. The file separator is Octal \036. I just depicted with | below.
Example
$value1=5
$value2=04
Input example1.txt
example|42|udajha|llama|04
example|22|udajha|llama|02
Input example2.txt
example|22|udajha|llama|02
Result
move example1.txt to /home/user/error_directory and example2.txt stays in current directory (nothing happens)
awk can report out which files meet this condition:
awk -F"|" -v columnToSearch=$value1 -v valueToFind=$value2 '$columnToSearch==valueToFind{print FILENAME}' example1.txt example2.txt
Then you can do your mv based on that.
Example using a pipe to xargs (with smaller variable names since you get the idea by now):
awk -F"|" -v c=$value1 -v v=$value2 '$c==v{print FILENAME}' example1.txt example2.txt | xargs -I{} mv -i {} /home/user/error_directory
If you're writing a bash shell script then you can break it down by column using cut.
There are really so many options that it depends on what you want to get done.
In my experience with data I'd use a colon rather than pipe because it allows me to avoid the escape with the 'cut' command.
Changing the data files to:
cat example1.txt
example:42:udajha:llama:04
example:22:udajha:llama:02
I'd write it like this: (adding -x so that you can see the processing, but in your code you'd not need to do that.)
[root#]# cat mysript.sh
#!/bin/sh -x
one=`cat example1.txt | cut -d: -f5`
two=`cat example2.txt | cut -d: -f5`
for i in $one
do
if [ $i -eq $two ]
then
movethis=`grep $two example1.txt`
echo $movethis >> /home/me/error.txt
fi
done
cat /home/me/error.txt
[root#]# ./mysript.sh
++ cat example1.txt
++ cut -d: -f5
+ one='04
02 '
++ cat example2.txt
++ cut -d: -f5
+ two=02
+ for i in '$one'
+ '[' 04 -eq 02 ']'
+ for i in '$one'
+ '[' 02 -eq 02 ']'
++ grep 02 example1.txt
+ movethis='example:22:udajha:llama:02 '
+ echo example:22:udajha:llama:02
+ cat /home/me/error.txt
example:22:udajha:llama:02
You can use any command you live to move your content. Touch, cp, mv, what ever you want to use there.

Bash script read specifc value from files of an entire folder

I have a problem creating a script that reads specific value from all the files of an entire folder
I have a number of email files in a directory and I need to extract from each file, 2 specific values.
After that I have to put them into a new file that looks like that:
--------------
To: value1
value2
--------------
This is what I want to do, but I don't know how to create the script:
# I am putting the name of the files into a temp file
`ls -l | awk '{print $9 }' >tmpfile`
# use for the name of a file
`date=`date +"%T"
# The first specific value from file (phone number)
var1=`cat tmpfile | grep "To: 0" | awk '{print $2 }' | cut -b -10 `
# The second specific value from file(subject)
var2=cat file | grep Subject | awk '{print $2$3$4$5$6$7$8$9$10 }'
# Put the first value in a new file on the first row
echo "To: 4"$var1"" > sms-$date
# Put the second value in the same file on the second row
echo ""$var2"" >>sms-$date
.......
and do the same for every file in the directory
I tried using while and for functions but I couldn't finalize the script
Thank You
I've made a few changes to your script, hopefully they will be useful to you:
#!/bin/bash
for file in *; do
var1=$(awk '/To: 0/ {print substr($2,0,10)}' "$file")
var2=$(awk '/Subject/ {for (i=2; i<=10; ++i) s=s$i; print s}' "$file")
outfile="sms-"$(date +"%T")
i=0
while [ -f "$outfile" ]; do outfile="sms-$date-"$((i++)); done
echo "To: 4$var1" > "$outfile"
echo "$var2" >> "$outfile"
done
The for loop just goes through every file in the folder that you run the script from.
I have added added an additional suffix $i to the end of the file name. If no file with the same date already exists, then the file will be created without the suffix. Otherwise the value of $i will keep increasing until there is no file with the same name.
I'm using $( ) rather than backticks, this is just a personal preference but it can be clearer in my opinion, especially when there are other quotes about.
There's not usually any need to pipe the output of grep to awk. You can do the search in awk using the / / syntax.
I have removed the cut -b -10 and replaced it with substr($2, 0, 10), which prints the first 10 characters from column 2.
It's not much shorter but I used a loop rather than the $2$3..., I think it looks a bit neater.
There's no need for all the extra " in the two output lines.
I sugest to try the following:
#!/bin/sh
RESULT_FILE=sms-`date +"%T"`
DIR=.
fgrep -l 'To: 0' "$DIR" | while read FILE; do
var1=`fgrep 'To: 0' "$FILE" | awk '{print $2 }' | cut -b -10`
var2=`fgrep 'Subject' "$FILE" | awk '{print $2$3$4$5$6$7$8$9$10 }'`
echo "To: 4$var1" >>"$RESULT_FIL"
echo "$var2" >>"$RESULT_FIL"
done

Elegant way to check for equal values within an array or any given textfile

Hello i'm fairly new to scripting, and struggling with trying to test/check if 4 lines in a textfile are equal to eachother, and i cannot figure this one out since comparison examples are all with two variables. i've come up with this:
#!/bin/sh
#check if mxf videofiles are older than 10 minutes and parse them into tclist.txt
find . -amin +10 |sed "s/^..//" >tclist.txt
#grep timecode and cut : from the output of mxfprobe and place that into variable TC
for z in $(cat tclist.txt); do TC=$(mxfprobe -i "$z" 2>&1 |grep timecode|sed "s/[^0-9]*//"|sed "s/://"|sed "s/://"|sed "s/://")
echo $TC >>offsetcheck.txt
done;
The output of offsetcheck.txt then looks like this:
10194013
10194013
10194014
10194014
How can i test if those 4 values are equal to eachother? (in this example two files are drifted one frame)
I've tried to place those values into an array and check them for uniqueness...
exec 10<&0
exec < offsetcheck.txt
let count=0
while read LINE; do
ARRAY[$count]=$LINE
((count++))
done
echo ${ARRAY[#]}
exec 0<&10 10<&-
if ($ARRAY !== array_unique($ARRAY))
{
echo There were duplicate values
}
... struggling with trying to test/check if 4 lines in a textfile are
equal to eachother
You could use sort and wc to determine the number of unique values in the file. The following would tell whether the file contains unique values or not:
(( $(sort -u offsetcheck.txt | wc -l) == 1 )) && echo "File contains unique values" || echo "File does not contain unique values"
If you wanted to do the same for an array, you could say:
for i in "${ARRAY[#]}"; do echo "$i" ; done | sort -u | wc -l
to get the number of unique values in the array.
If the values in the array are guaranteed not to have any space, then saying:
echo "${ARRAY[#]}" | tr ' ' '\n' | sort -u | wc -l
would suffice. (But note the if above.)
Looks to me like the whole process can be reduced to
n=$(
find . -amin +10 |
sed "s/^..//" |
xargs -I FILE mxfprobe -i "FILE" 2>&1 |
grep -h timecode |
sed 's/[^0-9]//g' |
sort -u |
wc -l
)
Then check if n == 1

BASH script - print sorted contents from all files in directory with no rep's

In the current directory there are files with names of the form "gradesXXX" (where XXX is a course number) which look like this:
ID GRADE (this line is not contained in the files)
123456789 56
213495873 84
098342362 77
. .
. .
. .
I want to write a BASH script that prints all the IDs that have a grade above a certain number, which is given as the first parameter to said script.
The requirements are that an ID must be printed once at most, and that no intermediate files are used.
I was guided to use two scripts - the first with length of one line, and the second with length of up to six lines (not including the "#!" line).
I'm quite lost with this one so any suggestions will be appreciated.
Cheers.
The answer I was looking for was
// internal script
#!/bin/bash
while read line; do
line_split=( $line )
if (( ${line_split[1]} > $1 )); then
echo ${line_split[0]}
fi
done
// external script
#!/bin/bash
cat grades* | sort -r -n -k 1 | internalScript $1 | cut -f1 -d" " | uniq
OK, a simple solution.
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt 60 ] ; then break ; fi ; echo $ID ; done | sort -u
I'm not sure why two scripts should be necessary. All in a script:
#!/bin/bash
threshold=$1
cat grades[0-9][0-9][0-9] | sort -nurk 2 | while read ID GRADE ; do if [ $GRADE -lt $threshold ] ; then break ; fi ; echo $ID ; done | sort -u
We first cat all the grade files, the sort them by grade in reverse order. The while loop breaks if grade is below threshold, so that only lines with higher grades get their ID printed. sort -u makes sure that every ID is sent only once.
You can use awk:
awk '{ if ($2 > 70) print $1 }' grades777
It prints the first column of every line which seconds column is greater than 70. If you need to change the threshold:
N=71
awk '{ if ($2 > '$N') print $1 }' grades777
That ' are required to pass shell variables in AWK. To work with all grade??? files in the current directory and remove duplicated lines:
awk '{ if ($2 > '$N') print $1 }' grades??? | sort -u
A simple one-line solution.
Yet another solution:
cat grades[0-9][0-9][0-9] | awk -v MAX=70 '{ if ($2 > MAX) foo[$1]=1 }END{for (id in foo) print id }'
Append | sort -n after that if you want the IDs in sorted order.
In pure bash :
N=60
for file in /path/*; do
while read id grade; do ((grade > N)) && echo "$id"; done < "$file"
done
OUTPUT
213495873
098342362

Resources