Add prefix to all rows and columns efficiently - shell

My aim is to add a prefix to all rows and columns returned from an SQL query (all rows of the same column should take the same prefix). The way I am doing it at the moment is
echo "$(<my_sql_query> | awk '$0="prefixA_"$0' |
awk '$2="prefixB_"$2' |
awk '$3="prefixC_"$3' |
awk '$4="prefixD_"$4')"
The script above does exactly what I want but what I would like to know is whether there is faster way of doing it.

In case you are willing to do it with echo + awk solution then you could do it in a single awk, where we could prefix values in a single shot, though I am not sure about your query but considering here fields are separated by space only.
echo "$<my_sql-query>" |
awk '{$0="prefixA_"$0;$2="prefixB_"$2;$3="prefixC_"$3;$4="prefixD_"$4} 1'
EDIT: Adding a generic solution here, by which we could pass field numbers and their respective values to and it could be added to fields, fair warning not tested it much because samples were not given.
echo "$<my_sql-query>" |
awk '
function addPrefix(fieldNumbers,fieldValues){
num=split(fieldNumbers,arr1,"#")
split(fieldValues,arr2,"#")
for(i=1;i<=num;i++){
$arr1[i]=arr2[i]$arr1[i]
}
}
addPrefix("1#2#3#4","prefixA_#prefixB_#prefixC_#prefixD_")
1'

Related

How to catch the string which contains the highest number?

I have a variable which looks like that:
asgname='Company-DEV-API-65-ServerAutoScalingGroup-122MJNZLAAKW4 Company-DEV-API-68-ServerAutoScalingGroup-1SFNH4CSKKWA4'
I want to update the most current AutoScaling Group, which is in this case 68.
The asg names are separated by a space.
How can I catch the full asg name which contains the higher number?
With bash and GNU sort:
tr ' ' '\n' <<< $asgname | sort -V | tail -n 1
Output:
Company-DEV-API-68-ServerAutoScalingGroup-1SFNH4CSKKWA4
I assume that all strings start with Company-DEV-API-.
awk to help here.
echo "$asgname" | awk -F"-" '{len=$4>len?$4:(len?len:$4)} END{print len}'
Output will be 68.
Explanation: Printing the variable named asgname with "(double quotes) on it. Now using |(pipe) to send it's standard output as standard input to awk command. In awk command I am making field separator as -(dash). inside awk's main body structure I am creating a variable len whose value will be $4(where you have number) now each time it's value will be compared to it's own value and it will replace it's own value if it's value is lesser than current $4's value. So finally we will get the HIGHEST value of it at last. So In END section of awk printing the value of variable len.
EDIT: Just saw edit of your question. In case your variable is having space in it then you could do a minor change into above code and could get the desired results as follows too.
echo "$asgname" | awk -v RS=" " -F"-" '{len=$4>len?$4:(len?len:$4)} END{print len}'

Get all the duplicates record in a csv if a column is different

I have a csv file, which have column wise data, like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23486360,7273881,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
and i just want to find out all the duplicates row ignoring a column, which is column 3. the output should be like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
i tried it by first cutting other columns except 3 in another file using
cut --complement -f 3 -d, filename into another file,
then i tried using the awk command, like awk -F, '{if(FNR==NR){print}}' secondfile
As i don't have complete knowledge of awk, so i'm not able to do it
You can use awk arrays to store the count of each group of columns to identify duplicates.
awk -F "," '{row[$1$2$4]++ ; rec[$0","NR] = $1$2$4 }
END{ for ( key in rec ) { if (row[rec[key]] > 1) { print key } } }' filename | sort -t',' -k5 | cut -f1-4 -d','
An additional sort was required to maintain the original ordering expected in your output.
Note: In your output shown, row with IR23488700 is considered as duplicate even though it is not.
I did the same by first cutting the 3rd column which may be different and then running the awk '++A[$0]==2' file command. Thanks for your help

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

find uniq lines in file, but ignore certain columns

So I've looked around now for a few hours but haven't found anything helpful.
I want to sort through a file that has a large number of lines formatted like
Values1, values2, values3, values4, values5, values6,
but I want to return only the lines that are uniquely related to
Values1, values2, values3, values6
As in I have multiple instances Values1, values2, values3, values6 where their only difference is values4, values5 and I don't want to return those, rather just one instance of the line (preferably the line pertaining to the largest value of values4, values5 but thats not a big deal)
I have tried using
uniq -s ##
but that doesn't work because my values lengths are variable.
I have also tried
sort -u -k 1,3
but that doesn't seem to work either.
mainly my issue is my values are variable in length, I'm not that concerned with sorting by values6 but it would be nice.
any help would be greatly appreciated
With awk, you can print the first time the "key" is seen:
awk '
{ key = $1 OFS $2 OFS $3 OFS $6 }
!seen[key]++
' file
The magic !seen[key]++ is an awk idiom. It returns true only the first time that key is encountered. It then increments the values so that it won't be true for any subsequent encounter.
alternative to awk
cut -d" " -f1-3,6 filename | sort -u
extract only required fields, sort unique
If you absolutely mustn't use the very clean cut method as suggested by #karafka, then with a csv file as input, you could use uniq -f <num> which skips the first <num> columns for the uniqueness comparison.
Since uniq expects blanks as separators we need to change this and also reorder the columns to meet your requirements.
sed 's/,/\t/g' textfile.csv | awk '{ print $4,$5,$1,$2,$3,$6}' | \
sort -k3,4,5,6 | uniq -f 2 | \
awk 'BEGIN{OFS=",";} { print $3,$4,$5,$1,$2,$6}'
This way only first line values (after sort) of $4 and $5 will be printed.

How to use AWK to grab a row in a file by a certain column

1|1001|399.00|123
1|1001|29.99|234
2|1002|98.00|345
2|1002|29.98|456
3|1003|399.00|567
4|1004|234.56|456
How would I use awk to grab all the rows with '1002' in column 2?
If I wanted to grab all the rows with '2' in the first column, I could use grep ^2, but how do I search by different columns?
The typical solution is:
awk '$2 == 1002' FS=\| input-file
you get a slightly different result with:
$2 ~ 1002, which technically satisfies your query, but is probably not what you want. (It does a regex match, and so will match if the second column is "341002994").

Resources