Adding data to line in CSV if value exists in external file

Adding data to line in CSV if value exists in external file - bash

Here is my sample data:
1,32425,New Zealand,number,21004
1,32425,New Zealand,number,20522
1,32434,Australia,number,1542
1,32434,Australia,number,986
1,32434,Fiji,number,1
Here is my expected output:
1,32425,New Zealand,number,21004,No
1,32425,New Zealand,number,20522,No
1,32434,Australia,number,1542,No
1,32434,Australia,number,986,No
1,32434,Fiji,number,1,Yes
Basically I am trying to append the Yes/No based on if field 3 is contained in an external file. Here is what I have currently but as I understand it grep is eating all the stdin in the while loop. So I am only getting No added to the end of each line as the first value is not contained in the external file.
while IFS=, read -r type id country number volume
do
if grep $country externalfile.csv
then
echo "${country}"
sed 's/$/,Yes/' >> file2.csv
else
echo "${country}"
sed 's/$/,No/' >> file2.csv
fi
done < file1.csv
I added the echo "${country}" as I was trying to troubleshoot and that's how I discovered it was only parsing the first line.

Assuming there are no headers -
awk -F, 'NR==FNR{lookup[$1]=$1; next;}
{ if ( lookup[$3] == $3 ) { print $0 ",Yes" } else { print $0 ",No" } }
' externalfile.csv file2.csv
This will parse both files in one pass.
If you just prefer to do it in pure bash,
declare -A lookup
while read c; do lookup["$c"]="$c"; done < externalfile.csv
declare -p lookup # this is just to show you what my example loaded
declare -A lookup='([USA]="USA" [Fiji]="Fiji" )'
while IFS=, read a b c d; do
[[ -n "${lookup[$c]}" ]] && echo "$a,$b,$c,$d,Yes" || echo "$a,$b,$c,$d,No"
done < file2.csv
1,32425,New Zealand,number,21004,No
1,32425,New Zealand,number,20522,No
1,32434,Australia,number,1542,No
1,32434,Australia,number,986,No
1,32434,Fiji,number,1,Yes
No grep needed.

awk -F, -v OFS=, 'NR == FNR { ++a[$1]; next } { $(++NF) = $3 in a ? "Yes" : "No" } 1' externalfile.csv file2.csv

Try this:
while read -r line
do
country=`echo $line | cut -d',' -f3`
if grep "$country" externalfile.csv
then
echo "$line,Yes" >> file2.csv
else
echo "$line,No" >> file2.csv
fi
done < test.txt
You need to put $country inside the ", because some country could contains more than 1 word. For example New Zealand. You can also set country variable easier using cut command.

Related

UNIX average of specific employee as per designation

This is an example of a text file to be given as input
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
My code should be able to run find the average salary of particular employee's designation. For example,
bash script.sh employees.txt Analyst
Then my output should be
50000
My current code to find just the average of all employees doesn't work. I am new to shell. This is my current code
count="$(tail -n 1 salary.txt | grep -o '^[^\s]\+')"
echo "$count"
salary="$(grep -o '[^ ]\+$' salary.txt | paste -sd+)"
echo "$salary"
echo "($salary)/$count" | bc
I get empty values as results.

This is better done in awk:
awk -F, -v dgn='Engineer' '$2 == dgn{s += $3; ++c} END{printf "%.2f\n", s/c}' file.csv
35125.00

Could you please try following(since OP requested for script way, so adding it in a script way where passing 1st argument as Input_file name and 2nd argument as string whose avg is needed).
cat script.ksh
file="$1"
name="$2"
awk -F, -v field="$name" '{a[$2]+=$3;b[$2]++} END{for(i in a){if(i == field){print a[i]/b[i]}}}' "$file"
Now run the script as follwos.
./script.ksh Input_file Analyst
50000

GNU datamash is a useful tool for calculating this kind of thing:
$ datamash -sHt, groupby 2 mean 3 < employees.txt
Combine with grep to limit it to just the title you're interested in.

If you want to do this in the shell:
#!/bin/bash
file=$1
designation=$2
# code to validate user input here ...
sum=0
count=0
while IFS=, read -r n d s; do
if [[ ${designation,,} == "${d,,}" ]]; then
(( sum += s ))
(( count++ ))
fi
done < "$file"
if (( count == 0 )); then
echo "No $designation found in $file"
else
echo $((sum / count))
fi

Using Perl
perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' file
with your given inputs
$ cat john.txt
Name,Designation,Salary
Hari,Engineer,35000
Suresh,Consultant,80000
Umesh,Engineer,45500
Maya,Analyst,50000
Guru,Consultant,100000
Sushma,Engineer,30000
Mohan,Engineer,30000
$ perl -F, -lane ' if(/Engineer/) { $dsg+=$F[2];$c++ } END { print $dsg/$c } ' john.txt
35125
$

Linux: Appending values into files, to the end of particular lines, and at the bottom of the file if there is on "key"

I have one file, file1, that has values like so:
key1|value1|
key2|value2|
key3|value3|
I have another file, file2, that has key based values I would like to add to add to file1:
key2 value4
key3 value5
key4 value6
I would like to add values to file1 to lines where the "key" matches, and if there is no "key" in file1, simply adding the new key & value to the bottom:
key1|value1|
key2|value2|value4|
key3|value3|value5|
key4|value6|
It seems like this is something that could be done with 2 calls to awk, but I am not familiar enough with it. I'm also open to using bash or shell commands.
UPDATE
I found this to work
awk 'NR==FNR {a[$1]=$2; next} {print $1,$2,a[$1];delete a[$1]}END{for(k in a) print k,a[k]}' file2 file1

The only deviation from the desired output is that keys from file1 that are not in file2 are not known AOT, so they are printed at the end to keep things semi-online:
awk -v first=data1.txt -f script.awk data2.txt
BEGIN {
OLD=FS
FS="|"
while (getline < first)
table[$1] = $0
OFS=FS
FS=OLD
}
!($1 in table) {
queue[$1] = $0
}
$1 in table {
id=$1
gsub(FS, OFS)
sub(/[^|]*\|/, "")
print table[id] $0 OFS
delete table[id]
}
END {
for (id in table)
print table[id]
for (id in queue) {
gsub(FS, OFS, queue[id])
print queue[id] OFS
}
}
key2|value2|value4|
key3|value3|value5|
key1|value1|
key4|value6|

this is the LOL answer ... ha ha . I basically loop over both keeping track of them and sort ... silly'ish , probably not even something you would want to use bash for perhaps ..
declare -a checked
checked=()
file="/tmp/file.txt"
> "${file}"
while IFS= read -r line1 ;do
key1=$(echo $line1 | cut -d'|' -f1)
if ! grep -qi ${key1} "/tmp/file2.txt" ; then
echo "$line1" >> "${file}"
continue
fi
while IFS= read -r line2 ;do
key2=$(echo $line2 | cut -d' ' -f1)
if ! grep -qi ${key2} "/tmp/file1.txt" ; then
if ! [[ "${checked[#]}" =~ $key2 ]] ;then
echo "$(echo $line2| awk '{print $1"|"$2}')|" >> "${file}"
checked+=(${key2})
continue
fi
fi
if [[ "$key2" == "$key1" ]] ;then
echo "${line1}$(echo $line2 | cut -d' ' -f2-)|" >> "${file}"
continue
fi
done < "/tmp/file2.txt"
done < "/tmp/file1.txt"
sort -k2 -n ${file}
[[ -f "${file}" ]] && rm -f "${file}"
Output:
key1|value1|
key2|value2|value4|
key3|value3|value5|
key4|value6|

pulling information out of a string in shell script

I am having trouble pulling out the information I need from a string in my shell script. I have read and tried to come up with the correct awk or sed command to do it, but I just can't figure it out. Hopefully you guys can help.
Lets say I have a string as follows:
["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]
Now what I want to do is pull out all of these properties into individual arrays of strings. For example:
I would like to have an array of ids 2817262 2262 28182
an array of name somename somename somename
an array of hasproperty false false true
Can anyone help me come up with the commands I need to pull this out. Also keep in mind the string will likely be much longer than this, so if we can not make it specific to 3 cases that would be helpful. Thanks so much in advance.

You could use grep.
grep -oP '"ids":\K\d+' file
Example:
$ echo '["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]' | grep -oP '"ids":\K\d+'
2817262
2262
28182

Since it is tagged with awk
awk '{while(x=match($0,/"ids":([^,]+)/,a)){print a[1];$0=substr($0,x+RLENGTH)}}' file
This just keeps matching any ids then changing the line to contain only what is after the id.
Output
2817262
2262
28182
Could also do this(inspired by Wintermutes comment on another answer)
awk -v RS=",|]" 'sub(/^.*"ids":/,"")' file

The grep solution is beautiful. You question was tagged awk. The awk solution is ugly:
echo '["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]' \
| awk '{split(substr($0,2,length($0)-2),x,",");
for(i=0;i<length(x);i++) {split(x[i],a,":");
if(a[1]=="\"ids\"") print a[1],a[2]}}'
Output:
"ids" 2817262
"ids" 2262
"ids" 28182
Please choose the grep solution as the correct answer.

Here is a pure bash solution (long-winded, isn't it? I tend to agree with #chepner):
str='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,
"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,
"isvalid":true,"name":"somename","hasproperty":true]'
#Remove [ ]
str=${str/[/}
str=${str/]/}
declare -a ids
declare -a names
declare -a properties
oldIFS="$IFS"
IFS=','
for record in $str
do
type=${record%%:*}
value=${record##*:}
if [[ $type == \"ids\" ]]
then
ids[ids_i++]="$value"
elif [[ $type == \"name\" ]]
then
names[names_i++]="$value"
elif [[ $type == \"hasproperty\" ]]
then
properties[properties_i++]="$value"
else
echo "Ignored type: '$type'" >&2
fi
done
IFS="$oldIFS"
echo "ids: ${ids[#]}"
echo "names: ${names[#]}"
echo "properties: ${properties[#]}"
The only thing going for it is that there are no child processes.

awk 'BEGIN {
Field = 1
Index = 0
}
{
gsub( /[][]/,"")
gsub( /"[a-z]*":/, "")
FS=","
while ( Field < NF) {
ThisID[ Index]=$Field
ThisName[ Index]=$(Field + 2)
ThisProperty [ Index]=$(Field + 3)
Index+=1
Field+=4
}
}
END {
for ( Iter=0;Iter<Index;Iter+=1) printf( "%s ", ThisID[Iter])
printf "\n"
for ( Iter=0;Iter<Index;Iter++) printf( "%s ", ThisName[Iter])
printf "\n"
for ( Iter=0;Iter<Index;Iter++) printf( "%s ", ThisProperty[Iter])
printf "\n"
}' YourFile
still to assign your array to your favorite variable

unset n
string='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]'
while IFS=',' read -ra line
do
((n++))
for i in "${line[#]//\"/}"
do
eval ${i%:*}[$n]=${i#*:}
done
done < <(sed 's/[][]//g;s/,"ids/\n"ids/g' <<<$string)
The above will produce 4 arrays (ids,isvalid,name,hasproperty). If you need not isvalid just add:
unset n
string='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]'
while IFS=',' read -ra line
do
((n++))
for i in "${line[#]//\"/}"
do
[ "${i%:*}" != "isvalid" ] && eval ${i/:/[$n]=}
done
done < <(sed 's/[][]//g;s/,"ids/\n"ids/g' <<<$string)

Given your posted input, if all you wanted was the list of each type of item then this is all you'd need:
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^ids/{print $2}' file
2817262
2262
28182
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^name/{print $2}' file
somename
somename
somename
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^hasproperty/{print $2}' file
false
false
true
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^isvalid/{print $2}' file
true
false
true
but it's extremely unlikely that this is the right way to approach your problem. As I mentioned in a comment, edit your question to provide more information if you'd like some real help with it.

Simple bash script to split csv file by week number

I'm trying to separate a large pipe-delimited file based on a week number field. The file contains data for a full year thus having 53 weeks. I am hoping to create a loop that does the following:
1) check if week number is less than 10 - if it is paste a '0' in front
2) use grep to send the rows to a file (ie `grep '|01|' bigFile.txt > smallFile.txt` )
3) gzip the smaller file (ie `gzip smallFile.txt`)
4) repeat
Is there a resource that would show how to do this?
EDIT :
Data looks like this:
1|#gmail|1|0|0|0|1|01|com
1|#yahoo|0|1|0|0|0|27|com
The column I care about is the 2nd from the right.
EDIT 2:
Here's the script I'm using but it's not functioning:
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
if [[ q -lt 10 ]]; then
#statements
k='0'$q
echo $k
grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
gzip weeks_files/week $k
fi
if [[ q -gt 9 ]]; then
#statements
echo $q
grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done

Very simple in awk ...
awk -F'|' '{ print > ("smallfile-" $(NF-1) ".txt";) }' bigfile.txt
Edit: brackets added for "original-awk".

You're almost there.
#!/bin/bash
for (( i = 1; i <= 12; i++ )); do
#statements
echo 'i :'$i
q=$i
# echo $q
# $q==10
#OLD if [[ q -lt 10 ]]; then
if [[ $q -lt 10 ]]; then
#statements
k='0'$q
echo $k
#OLD grep '|$k|' 20150226_train.txt > 'weeks_files/week'$k
grep "|$k|" 20150226_train.txt > 'weeks_files/week'$k
#OLD gzip weeks_files/week $k
gzip weeks_files/week$k
#OLD fi
#OLD if [[ q -gt 9 ]]; then
elif [[ $q -gt 9 ]] ; then
#statements
echo $q
#OLD grep \'|$q|\' 20150226_train.txt > 'weeks_files/week'$q
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
gzip 'weeks_files/week'$q
fi
done
You didn't alway use $ in front of your variable values. You can only get away with using k or q without a $ inside the shell arthimetic substitution feature, ie z=$(( x+k)) or just to operate on a variable like (( k++ )). There are others.
You need to learn the difference between single quoting and dbl-quoting. You need to use dbl-quoting when you want a value substituted for a variable, as in your lines
grep "|$q|" 20150226_train.txt > 'weeks_files/week'$q
and others.
I'm guessing that your use of grep \'|$q|\' 20150226_train.txt was an attempt to get the value of $q.
The way to get comfortable with debugging this sort of situation is to set the shell debugging option with set -x (turn it off with set +x). You'll see each line that is executed with the values substituted for the variables. Advanced debugging requires echo "varof Interset now = $var" (print statements). Also, you can use set -vx (and set +vx) to see each line or block of code before it is executed, and then the -x output will show which lines where acctually executed. For your script, you'd see the whole if ... elfi ...fi block printed, and then just the lines of -x output with values for variables. It can be confusing, even after years of looking at it. ;-)
So you can go thru and remove all lines with the prefix #OLD, and I'm hoping your code will work for you.
IHTH

mkdir -p weeks_files &&
awk -F'|' '
{ file=sprintf("weeks_files/week%2d",$(NF-1)); print > file }
!seen[file]++ { print file }
' 20150226_train.txt |
xargs gzip
If your data is ordered so that all of the rows for a given week number are contiguous you can make it simpler and more efficient:
mkdir -p weeks_files &&
awk -F'|' '
$(NF-1) != prev { file=sprintf("weeks_files/week%2d",$(NF-1)); print file }
{ print > file; prev=$(NF-1) }
' 20150226_train.txt |
xargs gzip

There are certainly a number of approaches - the 'awk' line below will reformat your data. If you take a sequential approach, then:
1) awk to reformat
awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' SOURCE_FILE > bigFile.txt
2) loop through the weeks, create small file an zip it
for N in {01..53}
do
grep "|${N}|" bigFile.txt > smallFile.${N}.txt
gzip smallFile.${N}.txt
done
3) test script showing reformat step
#!/bin/bash
function show_data {
# Data set w/9 'fields'
# 1| 2 |3|4|5|6|7| 8|9
cat << EOM
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
EOM
}
###
function stars {
echo "## $# ##"
}
###
stars "Raw data"
show_data
stars "Modified data"
# 1| 2| 3| 4| 5| 6| 7| 8|9 ##
show_data | awk -F '|' '{printf "%s|%s|%s|%s|%s|%s|%s|%02d|%s\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}'
Sample run:
$ bash test.sh
## Raw data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|2|com
1|#gmail|1|0|0|0|1|5|com
1|#yahoo|0|1|0|0|0|27|com
## Modified data ##
1|#gmail|1|0|0|0|1|01|com
1|#gmail|1|0|0|0|1|02|com
1|#gmail|1|0|0|0|1|05|com
1|#yahoo|0|1|0|0|0|27|com

How to pass filename through variable to be read it by awk

Good day,
I was wondering how to pass the filename to awk as variable, in order to awk read it.
So far I have done:
echo file1 > Aenumerar
echo file2 >> Aenumerar
echo file3 >> Aenumerar
AE=`grep -c '' Aenumerar`
r=1
while [ $r -le $AE ]; do
lista=`awk "NR==$r {print $0}" Aenumerar`
AEList=`grep -c '' $lista`
s=1
while [ $s -le $AEList ]; do
word=`awk -v var=$s 'NR==var {print $1}' $lista`
echo $word
let "s = s + 1"
done
let "r = r + 1"
done
Thanks so much in advance for any clue or other simple way to do it with bash command line

Instead of:
awk "NR==$r {print $0}" Aenumerar
You need to use:
awk -v r="$r" 'NR==r' Aenumerar

Judging by what you've posted, you don't actually need all the NR stuff; you can replace your whole script with this:
while IFS= read -r lista ; do
awk '{print $1}' "$lista"
done < Aenumerar
(This will print the first field of each line in each of file1, file2, file3. I think that's what you're trying to do?)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Adding data to line in CSV if value exists in external file - bash

awk -F, -v OFS=, 'NR == FNR { ++a[$1]; next } { $(++NF) = $3 in a ? "Yes" : "No" } 1' externalfile.csv file2.csv

Related

UNIX average of specific employee as per designation

Linux: Appending values into files, to the end of particular lines, and at the bottom of the file if there is on "key"

pulling information out of a string in shell script

Simple bash script to split csv file by week number

How to pass filename through variable to be read it by awk

Categories

Resources