I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.
I have a csv file that looks like this:
arruba,jamaica, bermuda, bahama, keylargo, montigo, kokomo
80,70,90,85,86,89,83
77,75,88,87,83,85,77
76,77,83,86,84,86,84
I want to have a shell script set up so that I can extract out the data so that I can categorize data by columns.
I know that the line of code:
IFS="," read -ra arr <"vis.degrib"
for ((i=0 ; i<${#arr[#]} ; i++));do
ifname=`printf "%s\n" "${arr[i]}"`
echo "$ifname"
done
will print out the individual column components for the first row. How do I also do this again for subsequent rows?
Thank you for your time.
I'm extrapolating from the OP
awk -F, 'NR==1{for(i=1;i<=NF;i++) {gsub(/^ +/,"",$i);print $i}}' vis.degrib
will print
arruba
jamaica
bermuda
bahama
keylargo
montigo
kokomo
note that there is trimming of the space from the beginning of each field. If you remove the condition NR==1, the same will be done for all rows. Was this your request? Please comment...
Perhaps you want to convert columnar format to row based format (transpose)? There are many ways, this awk script will do
awk -F, -v OFS=, '{sep=(NR==1)?"":OFS} {for(i=1;i<=NF;i++) a[i]=a[i] sep $i} END{for(i=1;i<=NF;i++) print a[i]}' vis.degrib
will print
arruba,80,77,76
jamaica,70,75,77
bermuda,90,88,83
bahama,85,87,86
keylargo,86,83,84
montigo,89,85,86
kokomo,83,77,84
you can again trim the space from the beginning of the labels as shown above.
Another approach without awk.
tr ',' '\n' <vis.degrib | pr -4ts,
will generate the same
arruba,80,77,76
jamaica,70,75,77
bermuda,90,88,83
bahama,85,87,86
keylargo,86,83,84
montigo,89,85,86
kokomo,83,77,84
4 is the number of rows in the original file.
How do i output the first duplicate of a csv file?
for example if i have:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:25,1,-84,567,654
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:26,1,-86,123,124
00:0D:67:24:D7:2C,1,-56,245,134
00:0D:67:24:D7:2C,1,-83,442,123
00:18:E7:EB:BC:A9,5,-70,123,136
00:18:E7:EB:BC:A9,5,-90,986,545
00:22:A4:25:A8:F9,6,-81,124,234
00:22:A4:25:A8:F9,6,-90,456,654
64:0F:28:D9:6E:F9,1,-67,789,766
64:0F:28:D9:6E:F9,1,-85,765,123
74:9D:DC:CB:73:89,10,-70,253,777
i want my output to look like this:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:2C,1,-56,245,134
00:18:E7:EB:BC:A9,5,-70,123,136
00:22:A4:25:A8:F9,6,-81,124,234
64:0F:28:D9:6E:F9,1,-67,789,766
74:9D:DC:CB:73:89,10,-70,253,777
i was thinking along the lines of first outputting the first line of the csv file so like awk (code that outputs first row) >> file.csv then compare the first field of the row to the first field of the next row, if they are the same, check the next row. Until it comes to a new row, the code will output the new different row so again awk (code that outputs) >> file.csv and it will repeat until the check is complete
im kinda of new to bash coding, but i love it so far, im currently phrasing a csv file and i need some help. Thanks everyone
Using awk:
awk -F, '!a[$1]++' file.csv
awk forms an array where the 1st column is the key and the value is the count of no. of times the particular key is present. '!a[$1]++' will be true only when the 1st occurence of the 1st column, and hence the first occurrence of the line gets printed.
If I understand what you're getting at you want something like this:
prev_field=""
while read line
do
current_field=$(echo $line | cut -d ',' -f 1)
[[ $current_field != $prev_field ]] && echo $line
prev_field=$current_field
done < "stuff.csv"
Where stuff.csv is the name of your file. That's assuming that what you're trying to do is take the first field in the csv row and only print the first unique occurrence of it, which if that's the case I think your output may be missing a few.
Using uniq:
sort lines.csv | uniq -w 17
Provided your first column is fixed size (17). lines.csv is a file with your original input.
perl -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file
if you want to change the file inplace:
perl -i -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file