Related
I have a CSV file (test.csv) that looks like this:
WH_01,TRAINAMS,A10,1221-ESD
WH_03,TRAINLON,L10A3,3005-21
WH_01,TRAINAMS,A101,PWR-120
WH_02,TRAINCLE,A1,074-HD-SATA
WH_01,TRAINAMS,A10,PWR-120
WH_02,TRAINCLE,A15,102-55665
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
1). I can sort the file based on the value in column# 2 as follows:
sort -t, -k2,2 test.csv > testsort.csv
2). Next I would like to split the file based on the value in column# 2. Using the above example, it should create 3 files:
testsort_1.csv:
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
testsort_2.csv:
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv:
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
How can I do this? Not sure if the sort is even required and if the above can be achieved without sorting.
Thank you.
Good move separating sort and awk.
$ sort -t, -k2,2 test.csv |awk -F, '!($2 in T) {T[$2]=++i} {print > ("testsort_" i ".csv")}'
$ tail -n +1 testsort*
==> testsort_1.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> testsort_2.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> testsort_3.csv <==
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
!($2 in T) - If the second field is not found in the indices of array T,
{T[$2]=++i} - increment the counter and save the second field as an index.
{print} - print every line
> "file" - overwrite, redirect, and append output to file
("." i ".") - concatenate "strings" and variable
Since you're not sure if you need to sort that almost certainly means you don't and you just think it'd be useful for some reason plus you're just sorting on $2 and then splitting into different files based on the value of $2 so sorting is doing no good whatsoever.
All you actually need to do is:
awk -F, '{print > ($2".csv")}'
Look:
$ ls
test.csv
$ awk -F, '{print > ($2".csv")}' test.csv
$ ls
test.csv TRAINAMS.csv TRAINCLE.csv TRAINLON.csv
$ tail -n +1 TRAIN*
==> TRAINAMS.csv <==
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A101,PWR-120
WH_01,TRAINAMS,A10,PWR-120
==> TRAINCLE.csv <==
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
==> TRAINLON.csv <==
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,UK-B3,101859
If you got past about 20 output file names and weren't using GNU awk then you'd have to close() each one whenever $2 changes and use >> instead of > to append to them.
If for some reason you really do need to use the output file names from your question then that'd be:
awk -F, '!($2 in map){map[$2]="testsort_"++cnt".csv"} {print > map[$2]}' test.csv
You can do it in a fairly simple fashion by keeping a counter for the filename and using sprintf to create the filename for each successive group of files. You use the FNR (file record number) to distinguish between the first and subsequent records.
For example:
$ sort -t, -k2 file.csv |
awk -F, -v cnt=1 -v fn="testsort_1.csv" '
FNR==1 {
prev=$2
print $0 > fn
}
FNR>1 {
if ($2!=prev) {
cnt++
fn=sprintf("%s_%d.csv", "testsort", cnt)
}
print $0 > fn
prev=$2
}'
(note: you set the initial filename as a variable to begin, and then create all subsequent filenames from your cnt (count) using sprintf. prev tracks the second field from the previous record. fn is the filename created by sprintf and the counter.)
A shorter version of the same script declaring prev as a variable initially, would be:
sort -t, -k2 file.csv |
awk -F, -v cnt=0 -v prev="" '{
if ($2!=prev) {
cnt++
fn = "testsort_" cnt ".csv"
prev=$2
}
print $0 > fn
}'
If you do not wish to have sequentially numbered files, but instead want the "testsort_number.csv" taken from the sorted records, look at #Cyrus now-deleted answer that provides an excelled (and shorter) solution in that regard. (I see you already have great answer)
Example Use/Output
With your input in file.csv, the following output files would be created:
$ for i in testsort_{1..3}.csv; do printf "\n%s\n" $i; cat $i; done
testsort_1.csv
WH_01,TRAINAMS,A10,1221-ESD
WH_01,TRAINAMS,A10,PWR-120
WH_01,TRAINAMS,A101,PWR-120
testsort_2.csv
WH_02,TRAINCLE,A1,074-HD-SATA
WH_02,TRAINCLE,A15,102-55665
testsort_3.csv
WH_03,TRAINLON,L10A3,3005-20
WH_03,TRAINLON,L10A3,3005-21
WH_03,TRAINLON,UK-B3,101859
I have two files:
In the first one (champions.csv) I have the number and the name of some LoL champions
1,Annie
2,Olaf
3,Galio
4,Twisted Fate
5,Xin Zhao
6,Urgot
7,LeBlanc
8,Vladimir
9,Fiddlesticks
10,Kayle
11,Master Yi
In the second one (top.csv) I have couples of champions (first and second column) and the number of won matches by them (third column)
2,1,3
3,1,5
4,1,6
5,1,1
6,1,10
7,1,9
8,1,11
10,4,12
7,5,2
3,3,6
I need to substitute the numbers of the second file with the respective names of the first file.
I tried using awk and storing the names in an array but it didn't work
lengthChampions=`cat champions.csv | wc -l`
for i in `seq 1 $length`; do
name=`cat champions.csv | head -$i | tail -1 | awk -F',' '{print $2}'`
champions[$i]=$name
done
for i in `seq 1 10`; do
champion1=${champions[`cat top.csv | head -$i | tail -1 | awk -F',' '{print $1}'`]}
champion2=${champions[`cat top.csv | head -$i | tail -1 | awk -F',' '{print $2}'`]}
awk -F',' 'NR=='$i' {$1='$champion1'} {$2='$champion2'} {print $1","$2","$3}' top.csv > tmptop.csv && mv tmptop.csv top.csv
done
I would like a solution for this problem maybe with less code than this. The result should be something like that (not the actual result for my files):
Ahri,Ashe,1502
Camille,Ezreal,892
Ekko,Dr. Mundo,777
Fizz,Caitlyn,650
Gnar,Ezreal,578
Fiora,Irelia,452
Janna,Graves,321
Jax,Jinx,245
Ashe,Corki,151
Katarina,Lee Sin,102
this can be accomplished in a single awk call. associate numbers with champions in an array and use it for replacing numbers in second file.
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1]=$2;next} {$1=a[$1];$2=a[$2]} 1' champions.csv top.csv
Olaf,Annie,3
Galio,Annie,5
Twisted Fate,Annie,6
Xin Zhao,Annie,1
Urgot,Annie,10
LeBlanc,Annie,9
Vladimir,Annie,11
Kayle,Twisted Fate,12
LeBlanc,Xin Zhao,2
Galio,Galio,6
in case there should be some numbers in top.csv that don't exist in champions.csv, use the following instead to prevent those numbers from being deleted:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1]=$2;next} ($1 in a){$1=a[$1]} ($2 in a){$2=a[$2]} 1' champions.csv top.csv
Assuming that the 2nd column of champions.csv isn't too huge, (i.e. larger than the maximum size of the bash array ${c[#]}), then using bash and cut:
readarray -t -O 1 c < <(cut -d, -f2 champions.csv)
while IFS=, read x y z; do
printf '%s,%s,%s\n' "${c[$x]}" "${c[$y]}" "$z"
done < top.csv
Output:
Olaf,Annie,3
Galio,Annie,5
Twisted Fate,Annie,6
Xin Zhao,Annie,1
Urgot,Annie,10
LeBlanc,Annie,9
Vladimir,Annie,11
Kayle,Twisted Fate,12
LeBlanc,Xin Zhao,2
Galio,Galio,6
I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt
I have a text file like this
A;green;3
B;blue;2
A;red;4
C;red;2
C;blue;3
B;green;3
I have to write a script that if started with parameter "B" gives me the color of the row with the biggest number (from the rows starting with B). In this case it would be the last line, so the output would be "green".
How do I separate the elements by ";"-s and newlines and store them into a matrix so I can work with it? Do I even need to do that, or is there an easier solution?
Thanks in advance!
awk + sort solution:
awk -v param="B" -F';' '$1==param{ print $2; exit }' <(sort -t';' -k1,1 -k3nr file.txt)
The output:
green
Or in addition to #William Pursell's answer - to extract only color value:
awk -F';' '/^B/ && $3>m{ m=$3; c=$2 }END{ print c }' file.txt
green
Via bash script:
get_max_color.sh script:
#!/bin/bash
awk -F';' -v p="$1" '$0~"^"p && $3>m{ m=$3; c=$2 }END{ print c }' "$2"
Usage:
bash get_max_color.sh B "file.txt"
green
You just need to filter out the appropriate lines and store the one with the max value seen. The obvious solution is:
awk '/^B/ && $3 > m{s=$0} END { print s}' FS=\; input
To use a parameter, do
awk "/^$1/"' && $3 > m{s=$0} END { print s}' FS=\; input
A non-awk solution, possibly less elegant and slower than the already proposed solution:
sort -r -t\; -k1,1 -k3 file | uniq -w1 | grep "B" | cut -f2 -d\;
awk to the rescue!
probably not understood what you want to achieve but
awk -v key="$c" -F\; 'm[$1]<$3{m[$1]=$3; c[$1]=$2} END{print c[key]}' file
will pick the highest coded color from the file for the key
some poor usage pattern
$ for c in A B C;
do
echo $c "->" $(awk -v key="$c" -F\; 'm[$1]<$3 {m[$1]=$3; c[$1]=$2}
END {print c[key]}' file);
done;
A -> red
B -> green
C -> blue
you can probably implement the rest of the script in awk and do this process once.
Or, perhaps you want an associative array, can be done as below:
$ declare -A colors;
while IFS=\; read k c _ ;
do
colors[$k]=$c;
done < <(sort -t\; -k1,1 -k3nr file | uniq -w1)
$ echo ${colors[A]}
red
Attempting to correlate GEO IP data from one CSV to an access log of another.
Sample lines of data:
CSV1
Bob,App1,8-Jan-15,8.8.8.8
April,App3,2-Jan-15,5.5.5.5
George,App2,1-Feb-15,8.8.8.8
CSV2
8.8.8.8,US,United States,CA,California,Mountain View,94040,America/Los_Angeles
5.5.5.5,US,United States,FL,Florida,Miami
I want to search CSV1 for any IP listed in in CSV2 and append fields 1,2,4 to CSV1 when the IP matches.
So far I have, but I'm getting errors I believe at the SED portion.
#!/bin/bash
for LINE in $( cat CSV2 | awk -F',' '{print $1 "," $2 "," $4}' )
do
$IP = $( echo $LINE | cut -d, -f1 )
sed -i.bak "s/"$IP/\""$LINE\"" CSV1
done
Desired Output:
Bob,App1,8-Jan-15,8.8.8.8,United States,CA
Dawn,App3,2-Jan-15,5.5.5.5,United States,FL
George,App2,1-Feb-15,8.8.8.8,United States,CA
Using the join command:
$ join -t , -1 4 -2 1 -o 1.1,1.2,1.3,1.4,2.3,2.4 <(sort -t, -k4,4 CSV1) <(sort -t, CSV2)
Bob,App1,8-Jan-15,8.8.8.8,United States,CA
Using sort is overkill here, but for >1 line files, join requires the files to be sorted on the join key
With awk
$ awk -F, -v OFS=, 'NR == FNR {a[$1] = $3 OFS $4; next} $4 in a {print $0, a[$4]}' CSV2 CSV1
Bob,App1,8-Jan-15,8.8.8.8,United States,CA