how to summarize data based on a field in a row - bash

In bash, how can I read in a large .csv file and summarize the data? I need to get totals for each person.
example input:
joey 4
joey 3
joey 4
joey 6
paul 7
paul 3
paul 1
paul 4
trevor 5
trevor 6
henry 7
mark 8
mark 9
tom 0
It should end up like this in the end:
joey 17
paul 15
trevor 11
henry 7
mark 17
tom 2

list=`your example input | awk '{print $1}' | uniqe`
it gives You something like this:
joey
paul
trevor
henry
mark
tom
Now let's make a two for loops:
for i in $list
do
for j in `$list | grep $i | awk '{print $2}'`
do
counter=$counter+$j
done
echo "$i $j"
done
First loop is going by the names and second one is just counting results for each name. Guess it should work, and it's quite easy way.

Related

Adding data from an array to a new column in a file using bash [duplicate]

This question already has answers here:
Iterate over two arrays simultaneously in bash
(6 answers)
How to print two arrays side by side with bash script?
(2 answers)
Closed 2 years ago.
So I have Name age and city data:
name=(Alex Barbara Connor Daniel Matt Peter Stan)
age=(22 23 55 32 21 8 89)
city=(London Manchester Rome Alberta Naples Detroit Amsterdam)
and I want to set up the following as 3 column data with the headings Name Age and city, I can easily get the first column using
touch info.txt
echo "Name Age City" > info.txt
for n in ${name[#]}; do
echo $n >> info.txt
done
but I can't figure how to get the rest of the data, and I can't seem to find anywhere on how to add data that's different as a new column.
Any help would be greatly appreciated, thank you.
Try something like this:
name=(Alex Barbara Connor Daniel Matt Peter Stan)
age=(22 23 55 32 21 8 89)
city=(London Manchester Rome Alberta Naples Detroit Amsterdam)
touch info.txt
echo "Name Age City" > info.txt
for n in $(seq 0 6); do
echo ${name[$n]} ${age[$n]} ${city[$n]} >> info.txt
done
Output in info.txt:
Name Age City
Alex 22 London
Barbara 23 Manchester
Connor 55 Rome
Daniel 32 Alberta
Matt 21 Naples
Peter 8 Detroit
Stan 89 Amsterdam
JoseLinares solved your problem. For your information, here is a solution with the paste command whose purpose is exactly that: putting data from different sources in separate columns.
$ printf 'Name\tAge\tCity\n'
$ paste <(printf '%s\n' "${name[#]}") \
<(printf '%3d\n' "${age[#]}") \
<(printf '%s\n' "${city[#]}")
Name Age City
Alex 22 London
Barbara 23 Manchester
Connor 55 Rome
Daniel 32 Alberta
Matt 21 Naples
Peter 8 Detroit
Stan 89 Amsterdam
You can fix a specific width for each column (here 20 is used)
name=(Alex Barbara Connor Daniel Matt Peter Stan)
age=(22 23 55 32 21 8 89)
city=(London Manchester Rome Alberta Naples Detroit Amsterdam)
for i in "${!name[#]}"; do
printf "%-20s %-20s %-20s\n" "${name[i]}" "${age[i]}" "${city[i]}"
done
Output:
Alex 22 London
Barbara 23 Manchester
Connor 55 Rome
Daniel 32 Alberta
Matt 21 Naples
Peter 8 Detroit
Stan 89 Amsterdam

reading a file into an array in bash

Here is my code
#!bin/bash
IFS=$'\r\n'
GLOBIGNORE='*'
command eval
'array=($(<'$1'))'
sorted=($(sort <<<"${array[*]}"))
for ((i = -1; i <= ${array[-25]}; i--)); do
echo "${array[i]}" | awk -F "/| " '{print $2}'
done
I keep getting an error that says "line 5: array=($(<)): command not found"
This is my problem.
As a whole my code should read in a file as a command line argument, sort the elements, then print out column 2 of the last 25 lines. I haven't been able to test this far so if there's a problem there too any help would be appreciated.
This is some of what the file contains:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
16227 nicole
15308 daniel
15163 babygirl
14726 monkey
14331 lovely
14103 jessica
13984 654321
13981 michael
13488 ashley
13456 qwerty
13272 111111
13134 iloveu
13028 000000
12714 michelle
11761 tigger
11489 sunshine
11289 chocolate
11112 password1
10836 soccer
10755 anthony
10731 friends
10560 butterfly
10547 purple
10508 angel
10167 jordan
9764 liverpool
9708 justin
9704 loveme
9610 fuckyou
9516 123123
9462 football
9310 secret
9153 andrea
9053 carlos
8976 jennifer
8960 joshua
8756 bubbles
8676 1234567890
8667 superman
8631 hannah
8537 amanda
8499 loveyou
8462 pretty
8404 basketball
8360 andrew
8310 angels
8285 tweety
8269 flower
8025 playboy
7901 hello
7866 elizabeth
7792 hottie
7766 tinkerbell
7735 charlie
7717 samantha
7654 barbie
7645 chelsea
7564 lovers
7536 teamo
7518 jasmine
7500 brandon
7419 666666
7333 shadow
7301 melissa
7241 eminem
7222 matthew
In Linux you can simply do a
sort -nbr file_to_sort | head -n 25 | awk '{print $2}'
read in a file as a command line argument, sort the elements, then
print out column 2 of the last 25 lines.
From that discription of the problem, I suggest:
#! /bin/sh
sort -bn $1 | tail -25 | awk '{print $2}'
As a rule, use the shell to operate on filenames, and never use the
shell to operate on data. Utilities like sort and awk are far
faster and more powerful than the shell when it comes to processing a
file.

Dividing one file into separate based on line numbers

I have the following test file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
I want to separate it in a way that each file contains the last line of the previous file as the first line. The example would be:
file 1:
1
2
3
4
5
file2:
5
6
7
8
9
file3:
9
10
11
12
13
file4:
13
14
15
16
17
file5:
17
18
19
20
That would make 4 files with 5 lines and 1 file with 4 lines.
As a first step, I tried to test the following commands I wrote to get only the first file which contains the first 5 lines. I can't figure out why the awk command in the if statement, instead of printing the first 5 lines, it prints the whole 20?
d=$(wc test)
a=$(echo $d | cut -f1 -d " ")
lines=$(echo $a/5 | bc -l)
integer=$(echo $lines | cut -f1 -d ".")
for i in $(seq 1 $integer); do
start=$(echo $i*5 | bc -l)
var=$((var+=1))
echo start $start
echo $var
if [[ $var = 1 ]]; then
awk 'NR<=$start' test
fi
done
Thanks!
Why not just use the split util available from your POSIX toolkit. It has an option to split on number of lines which you can give it as 5
split -l 5 input-file
From the man split page,
-l, --lines=NUMBER
put NUMBER lines/records per output file
Note that, -l is POSIX compliant also.
$ ls
$
$ seq 20 | awk 'NR%4==1{ if (out) { print > out; close(out) } out="file"++c } {print > out}'
$
$ ls
file1 file2 file3 file4 file5
.
$ cat file1
1
2
3
4
5
$ cat file2
5
6
7
8
9
$ cat file3
9
10
11
12
13
$ cat file4
13
14
15
16
17
$ cat file5
17
18
19
20
If you're ever tempted to use a shell loop to manipulate text again, make sure to read https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice first to understand at least some of the reasons to use awk instead. To learn awk, get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
oh. and wrt why your awk command awk 'NR<=$start' test didn't work - awk is not shell, it has no more access to shell variables (or vice-versa) than a C program does. To init an awk variable named awkstart with the value of a shell variable named start and then use that awk variable in your script you'd do awk -v awkstart="$start" 'NR<=awkstart' test. The awk variable can also be named start or anything else sensible - it is completely unrelated to the name of the shell variable.
You could improve your code by removing the unneccesary echo cut and bc and do it like this
#!/bin/bash
for i in $(seq $(wc -l < test) ); do
(( i % 4 != 1 )) && continue
tail +$i test | head -5 > "file$(( 1+i/4 ))"
done
But still the awk solution is much better. Reading the file only once and taking actions based on readily available information (like the linenumber) is the way to go. In shell you have to count the lines, there is no way around it. awk will give you that (and a lot of other things) for free.
Use split:
$ seq 20 | split -l 5
$ for fn in x*; do echo "$fn"; cat "$fn"; done
xaa
1
2
3
4
5
xab
6
7
8
9
10
xac
11
12
13
14
15
xad
16
17
18
19
20
Or, if you have a file:
$ split -l test_file

Print names alphabetically and how many appearances for each name

I have a file that includes names, one on each line. I want to print the names alphabetically, but (and here is where it gets confusing at least for me) next to each name I must print the number of appearances of that name with exactly one space between the name and the number of appearances.
For example if the file includes these names:
Barry
Don
John
Sam
Harry
Don
Don
Sam
it must print
Barry 1
Don 3
Harry 1
John 1
Sam 2
Any ideas?
sort | uniq -c will get you very close, just with the columns reversed.
$ sort file | uniq -c
1 Barry
3 Don
1 Harry
1 John
2 Sam
If you really need them in the proscribed order you could swap them with awk.
$ sort test.txt | uniq -c | awk '{print $2, $1}'
Barry 1
Don 3
Harry 1
John 1
Sam 2
With awk :
% awk '{
a[$1]++
}
END{
for (i in a) {
print i, a[i]
}
}' file
Output:
Barry 1
Harry 1
Don 3
John 1
Sam 2
Given:
$ cat file
Barry
Don
John
Sam
Harry
Don
Don
Sam
You can do:
$ awk '{a[$1]++} END { for (e in a) print e, a[e] }' file | sort
Barry 1
Don 3
Harry 1
John 1
Sam 2

split file into multiple files (by columns)

I have a file data.txt in which there are 200 columns and rows (a square matrix). So, i have been trying to split my file into 200 files, each of then with one of the column from the big data file. These where my two attempts employing cut and awk, however i don't understand why is not working.
NM=`awk 'NR==1{print NF-2}' < file.txt`
echo $NM
for (( i=1; i = $NM; i++ ))
do
echo $i
cut -f ${i} file.txt > tmpgrid_0${i}.dat
#awk '{print '$i'}' file.txt > tmpgrid_0${i}.dat
done
Any suggestions?.
EDIT: Thank you very much to all of you. All answers were valid but i cannot vote to all of them.
awk '{for(i=1;i<=5;i++){name=FILENAME"_"i;print $i> name}}' your_file
Tested with 5 columns:
> cat temp
PHE 5 2 4 6
PHE 5 4 6 4
PHE 5 4 2 8
TRP 7 5 5 9
TRP 7 5 7 1
TRP 7 5 7 3
TYR 2 4 4 4
TYR 2 4 4 0
TYR 2 4 5 3
> nawk '{for(i=1;i<=5;i++){name=FILENAME"_"i;print $i> name}}' temp
> ls -1 temp_*
temp_1
temp_2
temp_3
temp_4
temp_5
> cat temp_1
PHE
PHE
PHE
TRP
TRP
TRP
TYR
TYR
TYR
>
To summarise my comments, I suggest something like this (untested as I have no sample file):
NM=$(awk 'NR==1{print NF-2}' file.txt)
echo $NM
for (( i=1; i <= $NM; i++ ))
do
echo $i
awk '{print $'$i'}' file.txt > tmpgrid_0${i}.dat
done
An alternative solution using tr and split
< file.txt tr ' ' '\n' | split -nr/200
This assumes that the file is space delimited, but the tr command could be tweaked as appropriate for any delimiter. Essentially this puts each entry on its own line, and then uses split's round robin version to write each 200th line to the same file.
paste -d' ' x* | cmp - file.txt
verifies that it worked if split is writing files with an x prefix.
I got this solution from Reuti on the coreutils mailing list.

Resources