Extracting data from a particular column number in csv file - shell

I have a csv file that looks like this:
arruba,jamaica, bermuda, bahama, keylargo, montigo, kokomo
80,70,90,85,86,89,83
77,75,88,87,83,85,77
76,77,83,86,84,86,84
I want to have a shell script set up so that I can extract out the data so that I can categorize data by columns.
I know that the line of code:
IFS="," read -ra arr <"vis.degrib"
for ((i=0 ; i<${#arr[#]} ; i++));do
ifname=`printf "%s\n" "${arr[i]}"`
echo "$ifname"
done
will print out the individual column components for the first row. How do I also do this again for subsequent rows?
Thank you for your time.

I'm extrapolating from the OP
awk -F, 'NR==1{for(i=1;i<=NF;i++) {gsub(/^ +/,"",$i);print $i}}' vis.degrib
will print
arruba
jamaica
bermuda
bahama
keylargo
montigo
kokomo
note that there is trimming of the space from the beginning of each field. If you remove the condition NR==1, the same will be done for all rows. Was this your request? Please comment...
Perhaps you want to convert columnar format to row based format (transpose)? There are many ways, this awk script will do
awk -F, -v OFS=, '{sep=(NR==1)?"":OFS} {for(i=1;i<=NF;i++) a[i]=a[i] sep $i} END{for(i=1;i<=NF;i++) print a[i]}' vis.degrib
will print
arruba,80,77,76
jamaica,70,75,77
bermuda,90,88,83
bahama,85,87,86
keylargo,86,83,84
montigo,89,85,86
kokomo,83,77,84
you can again trim the space from the beginning of the labels as shown above.
Another approach without awk.
tr ',' '\n' <vis.degrib | pr -4ts,
will generate the same
arruba,80,77,76
jamaica,70,75,77
bermuda,90,88,83
bahama,85,87,86
keylargo,86,83,84
montigo,89,85,86
kokomo,83,77,84
4 is the number of rows in the original file.

Related

AWK write to a file based on number of fields in csv

I want to iterate over a csv file and discard the rows while writing to a file which doesnt have all columns in a row.
I have an input file mtest.csv like this
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP2##TestProcess2##TestDevice2
TestIP3##TestProcess3##TestDevice3##TestID3
But I am trying to only write those row records where all the 4 columns are present. The output should not have the TestIP2 column complete row as it has 3 columns.
Sample output should look like this:
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3
I used to do like this to get all the columns earlier but it writes the TestIP2 row as well which has 3 columns
awk -F "\##" '{print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50)}' mtest.csv >output2.csv
But when I try to ensure that it writes to file when all 4 columns are present, it doesn't work
awk -F "\##", 'NF >3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,50)"\##"substr($4,1,50); exit}' mtest.csv >output2.csv
You are making things harder than it need to be. All you need to do is check NF==4 to output any records containing four fields. Your total awk expression would be:
awk -F'##' NF==4 < mtest.csv
(note: the default action by awk is print so there is no explicit print required.)
Example Use/Output
With your sample input in mtest.csv, you would receive:
$ awk -F'##' NF==4 < mtest.csv
IP##Process##Device##ID
TestIP1##TestProcess2##TestDevice1##TestID1
TestIP3##TestProcess3##TestDevice3##TestID3
Thanks David and vukung
Both your solutions are okay.I want to write to a file so that i can trim the length of each field as well
I think this below statement works out
awk -F "##" 'NF>3 {print $1"\##"substr($2,1,50)"\##"substr($3,1,2)"\##"substr($4,1,3)}' mtest.csv >output2.csv

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

Indexing variable created by awk in bash

I'm having some trouble indexing a variable (consisting of 1 line with 4 values) derived from a text file with awk.
In particular, I have a text-file containing all input information for a loop. Every row contains 4 specific input values, and every iteration makes use of a different row of the input file.
Input file looks like this:
/home/hannelore/TVB-pipe_local/subjects/CON02T1/ 10012 100000 1001 --> used for iteration 1
/home/hannelore/TVB-pipe_local/subjects/CON02T1/ 10013 7200 1001 --> used for iteration 2
...
From this input text file, I identified the different columns (path, seed, count, target), and then I wanted to index these variables in each iteration of the loop. However, index 0 returns the entire variable and higher indices return without output. Using awk, cut, or IFS on this obtained variable, I wasn't able to split the variable. Can anyone help me with this?
Some code that I used:
seed=$(awk '{print $2}' $input_file)
--> extract column information from input file, this works
seedsplit=$(awk '{print $2}' $seed)
seedsplit=$(cut -f2 -d ' ' $seed)"
Thank you in advance!
Kind regards,
Hannelore
If I understand you correctly you want to extract the values from the input file row by row.
while read a b c d; do echo "var1:" ${a}; done < file
will print
var1: /home/hannelore/TVB-pipe_local/subjects/CON02T1/
var1: /home/hannelore/TVB-pipe_local/subjects/CON02T1/
similarly you can access the other variables in b,c, and d.
If you want an array, then use array assignment notation:
seed=( $(awk '{print $2}' $input_file) )
Now you will have the words from each line of output from awk in a separate array element.
col1=( $(awk '{print $1}' $input_file) )
col3=( $(awk '{print $3}' $input_file) )
Now you have three arrays which can be indexed in parallel.
for i in $(seq 1 "${#col1[#]}"
do
echo "${col1[$i]} in col1; ${seed[$i]} in col2; ${col3[$i]} in col3"
done

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Output the first duplicate in a csv file

How do i output the first duplicate of a csv file?
for example if i have:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:25,1,-84,567,654
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:26,1,-86,123,124
00:0D:67:24:D7:2C,1,-56,245,134
00:0D:67:24:D7:2C,1,-83,442,123
00:18:E7:EB:BC:A9,5,-70,123,136
00:18:E7:EB:BC:A9,5,-90,986,545
00:22:A4:25:A8:F9,6,-81,124,234
00:22:A4:25:A8:F9,6,-90,456,654
64:0F:28:D9:6E:F9,1,-67,789,766
64:0F:28:D9:6E:F9,1,-85,765,123
74:9D:DC:CB:73:89,10,-70,253,777
i want my output to look like this:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:2C,1,-56,245,134
00:18:E7:EB:BC:A9,5,-70,123,136
00:22:A4:25:A8:F9,6,-81,124,234
64:0F:28:D9:6E:F9,1,-67,789,766
74:9D:DC:CB:73:89,10,-70,253,777
i was thinking along the lines of first outputting the first line of the csv file so like awk (code that outputs first row) >> file.csv then compare the first field of the row to the first field of the next row, if they are the same, check the next row. Until it comes to a new row, the code will output the new different row so again awk (code that outputs) >> file.csv and it will repeat until the check is complete
im kinda of new to bash coding, but i love it so far, im currently phrasing a csv file and i need some help. Thanks everyone
Using awk:
awk -F, '!a[$1]++' file.csv
awk forms an array where the 1st column is the key and the value is the count of no. of times the particular key is present. '!a[$1]++' will be true only when the 1st occurence of the 1st column, and hence the first occurrence of the line gets printed.
If I understand what you're getting at you want something like this:
prev_field=""
while read line
do
current_field=$(echo $line | cut -d ',' -f 1)
[[ $current_field != $prev_field ]] && echo $line
prev_field=$current_field
done < "stuff.csv"
Where stuff.csv is the name of your file. That's assuming that what you're trying to do is take the first field in the csv row and only print the first unique occurrence of it, which if that's the case I think your output may be missing a few.
Using uniq:
sort lines.csv | uniq -w 17
Provided your first column is fixed size (17). lines.csv is a file with your original input.
perl -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file
if you want to change the file inplace:
perl -i -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file

Resources