Shell Script for combining 3 files - bash

I have 3 files with below data
$cat File1.txt
Apple,May
Orange,June
Mango,July
$cat File2.txt
Apple,Jan
Grapes,June
$cat File3.txt
Apple,March
Mango,Feb
Banana,Dec
I require the below output file.
$Output_file.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec
Requirement here is the take out the first column and then common data in column 1 in each file need to be searched and second column needs to be "|" separated. If there is no common column, then same needs to be printed in the output file.
I have tried putting this in a while loop, but it takes time as the file size increase. Wanted a simple solution using shell script.

This should work :
#!/bin/bash
for FRUIT in $( cat "$#" | cut -d "," -f 1 | sort | uniq )
do
echo -ne "${FRUIT},"
awk -F "," "\$1 == \"$FRUIT\" {printf(\"%s|\",\$2)}" "$#" | sed 's/.$/\'$'\n/'
done
Run it as :
$ ./script.sh File1.txt File2.txt File3.txt

A purely native-bash solution (calling no external tools, and thus limited only by the performance constraints of bash itself) might look like:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4 or newer required" >&2; exit 1;; esac
declare -A items=( )
for file in "$#"; do
while IFS=, read -r key value; do
items[$key]+="|$value"
done <"$file"
done
for key in "${!items[#]}"; do
value=${items[$key]}
printf '%s,%s\n' "$key" "${value#'|'}"
done
...called as ./yourscript File1.txt File2.txt File3.txt

This is fairly easy done with a single awk command:
awk 'BEGIN{FS=OFS=","} {a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i in a) print i, a[i]}' File{1,2,3}.txt
Orange,June
Banana,Dec
Apple,May|Jan|March
Grapes,June
Mango,July|Feb
If you want output in the same order as strings appear in original files then use this awk:
awk 'BEGIN{FS=OFS=","} !($1 in a) {b[++n] = $1}
{a[$1] = a[$1] (a[$1] == "" ? "" : "|") $2}
END {for (i=1; i<=n; i++) print b[i], a[b[i]]}' File{1,2,3}.txt
Apple,May|Jan|March
Orange,June
Mango,July|Feb
Grapes,June
Banana,Dec

Related

UNIX average of specific employee as per designation

This is an example of a text file to be given as input
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
My code should be able to run find the average salary of particular employee's designation. For example,
bash script.sh employees.txt Analyst
Then my output should be
50000
My current code to find just the average of all employees doesn't work. I am new to shell. This is my current code
count="$(tail -n 1 salary.txt | grep -o '^[^\s]\+')"
echo "$count"
salary="$(grep -o '[^ ]\+$' salary.txt | paste -sd+)"
echo "$salary"
echo "($salary)/$count" | bc
I get empty values as results.
This is better done in awk:
awk -F, -v dgn='Engineer' '$2 == dgn{s += $3; ++c} END{printf "%.2f\n", s/c}' file.csv
35125.00
Could you please try following(since OP requested for script way, so adding it in a script way where passing 1st argument as Input_file name and 2nd argument as string whose avg is needed).
cat script.ksh
file="$1"
name="$2"
awk -F, -v field="$name" '{a[$2]+=$3;b[$2]++} END{for(i in a){if(i == field){print a[i]/b[i]}}}' "$file"
Now run the script as follwos.
./script.ksh Input_file Analyst
50000
GNU datamash is a useful tool for calculating this kind of thing:
$ datamash -sHt, groupby 2 mean 3 < employees.txt
Combine with grep to limit it to just the title you're interested in.
If you want to do this in the shell:
#!/bin/bash
file=$1
designation=$2
# code to validate user input here ...
sum=0
count=0
while IFS=, read -r n d s; do
if [[ ${designation,,} == "${d,,}" ]]; then
(( sum += s ))
(( count++ ))
fi
done < "$file"
if (( count == 0 )); then
echo "No $designation found in $file"
else
echo $((sum / count))
fi
Using Perl
perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' file
with your given inputs
$ cat john.txt
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
$ perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' john.txt
35125
$

Script returned '/usr/bin/awk: Argument list too long' in using -v in awk command

Here is the part of my script that uses awk.
ids=`cut -d ',' -f1 $file | sed ':a;N;$!ba;s/\n/,/g'`
awk -vdata="$ids" -F',' 'NR > 1 {if(index(data,$2)>0){print $0",true"}else{print $0",false"}}' $input_file >> $output_file
This works perfectly, but when I tried to get data to two or more files like this.
ids=`cut -d ',' -f1 $file1 $file2 $file3 | sed ':a;N;$!ba;s/\n/,/g'`
It returned this error.
/usr/bin/awk: Argument list too long
As I researched, it was not caused by the number of files, but the number of ids fetched.
Does anybody have an idea on how to solve this? Thanks.
You could use an environment variable to pass the data to awk. In awk the environment variables are accessible via an array ENVIRON.
So try something like this:
export ids=`cut -d ',' -f1 $file | sed ':a;N;$!ba;s/\n/,/g'`
awk -F',' 'NR > 1 {if(index(ENVIRON["ids"],$2)>0){print $0",true"}else{print $0",false"}}' $input_file >> $output_file
Change the way you generate your ids so they come out one per line, like this, which I use as a very simple way to generate ids 2,3 and 9:
echo 2; echo 3; echo 9
2
3
9
Now pass that as the first file to awk and your $input_file as the second file to awk:
awk '...' <(echo 2; echo 3; echo 9) "$input_file"
In bash you can generate a pseudo-file with the output of a process using <(some commands), and that is what I am using.
Now, in your awk, pick up the ids from the first file like this:
awk 'FNR==NR{ids[$1]++;next}' <(echo 2; echo 3; echo 9)
which will set ids[2]=1, ids[3]=1 and ids[9]=1.
Then pass both your files and add in your original processing:
awk 'FNR==NR{ids[$1]++;next} {if($2 in ids) print $0",true"; else print $0",false"}' <(echo 2; echo 3; echo 9) "$input_file"
So, for my final answer, your entire code will look like:
awk 'FNR==NR{ids[$1]++;next} {if($2 in ids) print $0",true"; else print $0",false"}' <(cut ... file1 file2 file3 | sed ...) "$input_file"
As #hek2mgl alludes in the comments, you can likely just pass the files which include the ids to awk "as is" and let awk find the ids itself rather than using cut and sed. If there are many, you can make them all come to awk as the first file with:
awk '...' <(cat file1 file2 file3) "$input_file"
There's 2 problems in your script:
awk -vdata="$ids" -F',' 'NR > 1 {if(index(data,$2)>0){print $0",true"}else{print $0",false"}}' $input_file >> $output_file
that could be causing that error:
-vdata=.. - that is gawk-specific, in other awks you need to leave a space between -v and data=. So if you aren't running gawk then idk what your awk will make of that statement but it might treat it as multiple args.
$input_file - you MUST quote shell variables unless you have a specific purpose in mind by leaving them unquoted. If $input_file contains globbing chars or spaces then you leaving it unquoted will cause them to be expanded into potentially multiple files/args.
So try this:
awk -v data="$ids" -F',' 'NR > 1 {if(index(data,$2)>0){print $0",true"}else{print $0",false"}}' "$input_file" >> "$output_file"
and see if you still have the problem. Your script does have other unrelated issues of course, some of which have already been pointed out, and you can post a followup question if you want help with those, but just FYI that awk script could be written more concisely as:
awk -v data="$ids" 'BEGIN{FS=OFS=","} NR > 1{print $0, (index(data,$2) ? "true" : "false")}'

How can I specify a row in awk in for loop?

I'm using the following awk command:
my_command | awk -F "[[:space:]]{2,}+" 'NR>1 {print $2}' | egrep "^[[:alnum:]]"
which successfully returns my data like this:
fileName1
file Name 1
file Nameone
f i l e Name 1
So as you can see some file names have spaces. This is fine as I'm just trying to echo the file name (nothing special). The problem is calling that specific row within a loop. I'm trying to do it this way:
i=1
for num in $rows
do
fileName=$(my_command | awk -F "[[:space:]]{2,}+" 'NR==$i {print $2}' | egrep "^[[:alnum:]])"
echo "$num $fileName"
$((i++))
done
But my output is always null
I've also tried using awk -v record=$i and then printing $record but I get the below results.
f i l e Name 1
EDIT
Sorry for the confusion: rows is a variable that list ids like this 11 12 13
and each one of those ids ties to a file name. My command without doing any parsing looks like this:
id File Info OS
11 File Name1 OS1
12 Fi leNa me2 OS2
13 FileName 3 OS3
I can only use the id field to run a the command that I need, but I want to use the File Info field to notify the user of the actual File that the command is being executed against.
I think your $i does not expand as expected. You should quote your arguments this way:
fileName=$(my_command | awk -F "[[:space:]]{2,}+" "NR==$i {print \$2}" | egrep "^[[:alnum:]]")
And you forgot the other ).
EDIT
As an update to your requirement you could just pass the rows to a single awk command instead of a repeatitive one inside a loop:
#!/bin/bash
ROWS=(11 12)
function my_command {
# This function just emulates my_command and should be removed later.
echo " id File Info OS
11 File Name1 OS1
12 Fi leNa me2 OS2
13 FileName 3 OS3"
}
awk -- '
BEGIN {
input = ARGV[1]
while (getline line < input) {
sub(/^ +/, "", line)
split(line, a, / +/)
for (i = 2; i < ARGC; ++i) {
if (a[1] == ARGV[i]) {
printf "%s %s\n", a[1], a[2]
break
}
}
}
exit
}
' <(my_command) "${ROWS[#]}"
That awk command could be condensed to one line as:
awk -- 'BEGIN { input = ARGV[1]; while (getline line < input) { sub(/^ +/, "", line); split(line, a, / +/); for (i = 2; i < ARGC; ++i) { if (a[1] == ARGV[i]) {; printf "%s %s\n", a[1], a[2]; break; }; }; }; exit; }' <(my_command) "${ROWS[#]}"
Or better yet just use Bash instead as a whole:
#!/bin/bash
ROWS=(11 12)
while IFS=$' ' read -r LINE; do
IFS='|' read -ra FIELDS <<< "${LINE// +( )/|}"
for R in "${ROWS[#]}"; do
if [[ ${FIELDS[0]} == "$R" ]]; then
echo "${R} ${FIELDS[1]}"
break
fi
done
done < <(my_command)
It should give an output like:
11 File Name1
12 Fi leNa me2
Shell variables aren't expanded inside single-quoted strings. Use the -v option to set an awk variable to the shell variable:
fileName=$(my_command | awk -v i=$i -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]])"
This method avoids having to escape all the $ characters in the awk script, as required in konsolebox's answer.
As you already heard, you need to populate an awk variable from your shell variable to be able to use the desired value within the awk script so thi:
awk -F "[[:space:]]{2,}+" 'NR==$i {print $2}' | egrep "^[[:alnum:]]"
should be this:
awk -v i="$i" -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]]"
Also, though, you don't need awk AND grep since awk can do anything grep van do so you can change this part of your script:
awk -v i="$i" -F "[[:space:]]{2,}+" 'NR==i {print $2}' | egrep "^[[:alnum:]]"
to this:
awk -v i="$i" -F "[[:space:]]{2,}+" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}'
and you don't need a + after a numeric range so you can change {2,}+ to just {2,}:
awk -v i="$i" -F "[[:space:]]{2,}" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}'
Most importantly, though, instead of invoking awk once for every invocation of my_command, you can just invoke it once for all of them, i.e. instead of this (assuming this does what you want):
i=1
for num in rows
do
fileName=$(my_command | awk -v i="$i" -F "[[:space:]]{2,}" '(NR==i) && ($2~/^[[:alnum:]]/){print $2}')
echo "$num $fileName"
$((i++))
done
you can do something more like this:
for num in rows
do
my_command
done |
awk -F '[[:space:]]{2,}' '$2~/^[[:alnum:]]/{print NR, $2}'
I say "something like" because you don't tell us what "my_command", "rows" or "num" are so I can't be precise but hopefully you see the pattern. If you give us more info we can provide a better answer.
It's pretty inefficient to rerun my_command (and awk) every time through the loop just to extract one line from its output. Especially when all you're doing is printing out part of each line in order. (I'm assuming that my_command really is exactly the same command and produces the same output every time through your loop.)
If that's the case, this one-liner should do the trick:
paste -d' ' <(printf '%s\n' $rows) <(my_command |
awk -F '[[:space:]]{2,}+' '($2 ~ /^[::alnum::]/) {print $2}')

How can I print the duplicates in a file only once?

I have an input file that contains:
123,apple,orange
123,pineapple,strawberry
543,grapes,orange
790,strawberry,apple
870,peach,grape
543,almond,tomato
123,orange,apple
i want the output to be:
The following numbers are repeated:
123
543
is there a way to get this output using awk; i'm writing the script in solaris , bash
sed -e 's/,/ , /g' <filename> | awk '{print $1}' | sort | uniq -d
awk -vFS=',' \
'{KEY=$1;if (KEY in KEYS) { DUPS[KEY]; }; KEYS[KEY]; } \
END{print "Repeated Keys:"; for (i in DUPS){print i} }' \
< yourfile
There are solutions with sort/uniq/cut as well (see above).
If you can live without awk, you can use this to get the repeating numbers:
cut -d, -f 1 my_file.txt | sort | uniq -d
Prints
123
543
Edit: (in response to your comment)
You can buffer the output and decide if you want to continue. For example:
out=$(cut -d, -f 1 a.txt | sort | uniq -d | tr '\n' ' ')
if [[ -n $out ]] ; then
echo "The following numbers are repeated: $out"
exit
fi
# continue...
This script will print only the number of the first column that are repeated more than once:
awk -F, '{a[$1]++}END{printf "The following numbers are repeated: ";for (i in a) if (a[i]>1) printf "%s ",i; print ""}' file
Or in a bit shorter form:
awk -F, 'BEGIN{printf "Repeated "}(a[$1]++ == 1){printf "%s ", $1}END{print ""} ' file
If you want to exit your script in case a dup is found, then you can exit a non-zero exit code. For example:
awk -F, 'a[$1]++==1{dup=1}END{if (dup) {printf "The following numbers are repeated: ";for (i in a) if (a[i]>1) printf "%s ",i; print "";exit(1)}}' file
In your main script you can do:
awk -F, 'a[$1]++==1{dup=1}END{if (dup) {printf "The following numbers are repeated: ";for (i in a) if (a[i]>1) printf "%s ",i; print "";exit(-1)}}' file || exit -1
Or in a more readable format:
awk -F, '
a[$1]++==1{
dup=1
}
END{
if (dup) {
printf "The following numbers are repeated: ";
for (i in a)
if (a[i]>1)
printf "%s ",i;
print "";
exit(-1)
}
}
' file || exit -1

put awk code in bash and sort the result

I have a awk code for combining 2 files and add the result to the end of file.txt using ">>"
my code
NR==FNR && $2!=0 {two[$0]++;j=1; next }{for(i in two) {split(i,one,FS); if(one[3] == $NF){x=$4;sub( /[[:digit:]]/, "A", $4); print j++,$1,$2,$3,x,$4 | "column -t" ">>" "./Desktop/file.txt"}}}
i want put my awk to bash script and finaly sort my file.txt and save sorted result to file.txt again using >
i tried this
#!/bin/bash
command=$(awk '{NR==FNR && $2!=0 {two[$0]++;j=1; next }{for(i in two) {split(i,one,FS); if(one[3] == $NF){x=$4;sub( /[[:digit:]]/, "A", $4); print $1,$2,$3,$4 | "column -t" ">>" "./Desktop/file.txt"}}}}')
echo -e "$command" | column -t | sort -s -n -k4 > ./Desktop/file.txt
but it gives me error "for reading (no such a file or directory)"
where is my mistake?
Thanks in advance
1) you aren't specifying the input files for your awk script. This:
command=$(awk '{...stuff...}')
needs to be:
command=$(awk '{...stuff...}' file1 file2)
2) You move your awk condition "NR == ..." inside the action part so it will no longer behave as a condition.
3) Your awk script output is going into "file.txt" so "command" is empty when you echo it on the subsequent line.
4) You have unused variables x and j
5) You pass the arg FS to split() unnecessarily.
etc...
I THINK what you want is:
command=$( awk '
NR==FNR && $2!=0 { two[$0]++; next }
{
for(i in two) {
split(i,one)
if(one[3] == $NF) {
sub(/[[:digit:]]/, "A", $4)
print $1,$2,$3,$4
}
}
}
' file1 file2 )
echo -e "$command" | column -t >> "./Desktop/file.txt"
echo -e "$command" | column -t | sort -s -n -k4 >> ./Desktop/file.txt
but it's hard to tell.

Resources