Creating a mapping count - bash

I have this data with two columns
Id Users
123 2
123 1
234 5
234 6
34 3
I want to create this count mapping from the given data like this
123 3
234 11
34 3
How can I do it in bash?

You have to use associative arrays, something like
declare -A newmap
newmap["123"]=2
newmap["123"]=$(( ${newmap["123"]} + 1))
obviously you have to iterate through your input, see if the entry exists then add to it, else initialize it

It will be easier with awk.
Solution 1: Doesn't expect the file to be sorted. Stores entire file in memory
awk '{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
34 3
234 11
123 3
What we are doing here is using the first column as key and adding second column as value. In the END block we iterate over our array and print the key=value pair.
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk 'NR>1{a[$1]+=$2}END{for(x in a) print x,a[x]}' file
NR>1 is telling awk to skip the first line. NR contains the line number so we instruct awk to start creating our array from second line onwards.
Solution 2: Expects the file to be sorted. Does not store the file in memory.
awk '$1!=prev && NR>1{print prev,sum}{prev=$1; sum+=$2}END{print prev,sum}' file
123 3
234 14
34 17
If you have the Id Users line in your input file and want to exclude it from the output, then add NR>1 condition by saying:
awk '$1!=prev && NR>2{print prev, sum}NR>1{prev = $1; sum+=$2}END{print prev, sum}' ff
123 3
234 14
34 17

A Bash (4.0+) solution:
declare -Ai count
while read a b ; do
count[$a]+=b
done < "$infile"
for idx in ${!count[#]}; do
echo "${idx} ${count[$idx]}"
done
For a sorted output the last line should read
done | sort -n

Related

Update values in column in a file based on values from an array using bash script

I have a text file with the following details.
#test.txt
team_id team_level team_state
23 2
21 4
45 5
I have an array in my code teamsstatearr=(12 34 45 ...) and I want to be able add the value in the array to the third column. The array could have many elements and the test.txt file is just a small portion that I have shown below.
Details of the file contents:
The text file has only three headers. The headers are separated by tab. The number of rows in the file are equivalent to the number of items in the array as well.
Thus my test.txt would look like the following.
team_id team_level team_state
23 2 12
21 4 34
45 5 45
(many more rows are present)
What I have done as of now: I don't see the file have the update in the third column with the values.
# Write the issue ids to file
for item in "${teamstatearr[#]}"
do
printf '%s\n' "item id in loop: ${item}"
awk -F, '{$2=($item)}1' OFS='\t', test.txt
done
I would appreciate if anyone could help me find the most easiest and efficient way to do it.
If you don't mind a slightly different table layout, you could do:
teamsstatearr=(12 34 45)
{
# print header
head -n1 test.txt
# combine the remaining lines of test.txt and the array values
paste <(tail -n+2 test.txt) <(printf '%s\n' "${teamsstatearr[#]}")
# use `column -t` to format the output as table
} | column -t
Output:
team_id team_level team_state
23 2 12
21 4 34
45 5 45
To write the output to the same file, you can redirect the output to a new file and overwrite the original file with mv:
teamsstatearr=(12 34 45)
{
head -n1 test.txt
paste <(tail -n+2 test.txt) <(printf '%s\n' "${teamsstatearr[#]}")
} | column -t > temp && mv temp test.txt
If you have sponge from the moreutils package installed, you could to this without a temporary file:
teamsstatearr=(12 34 45)
{
head -n1 test.txt
paste <(tail -n+2 test.txt) <(printf '%s\n' "${teamsstatearr[#]}")
} | column -t | sponge test.txt
Or using awk and column (with the same output):
teamsstatearr=(12 34 45)
awk -v str="${teamsstatearr[*]}" '
BEGIN{split(str, a)} # split `str` into array `a`
NR==1{print; next} # print header
{print $0, a[++cnt]} # print current line and next array element
' test.txt | column -t

Adding the last number in each file to the numbers in the following file

I have some directories, each of these contains a file with list of integers 1-N which are not necessarily consecutive and they may be different lengths. What I want to achieve is a single file with a list of all those integers as though they had been generated in one list.
What I am trying to do is to add the final value N from file 1 to all the values in file 2, then take the new final value of file 2 and add it to all the values in file 3 etc.
I have tried this by setting a counter and looping over the files, resetting the counter when I get to the end of the file. The problem is the p=0 will continue to reset which is kind of obvious in the code but I am not sure how else to do it.
What I tried:
p=0
for i in dirx/dir_*; do
(cd "$i" || exit;
awk -v p=$p 'NR>1{print last+p} {last=$0} END{$0=last; p=last; print}' file >> /someplace/bigfile)
done
Which is similar to the answer suggested in this question Replacing value in column with another value in txt file using awk
Now I'm wondering whether I need an if else, if it's the first dir then p=0 if not then p=last value from the first file though I'm not sure on that or how I'd get it to take the last value. I used awk because that's what I understand a small amount of and would usually use.
With GNU awk
gawk '{print $1 + last} ENDFILE {last = last + $1}' file ...
Demo:
$ cat a
1
2
4
6
8
$ cat b
2
3
5
7
$ cat c
1
2
3
$ gawk '{print $1 + last} ENDFILE {last = last + $1}' a b c
1
2
4
6
8
10
11
13
15
16
17
18

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.
$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'
Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.
Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

assign sequential number for 1st column of data and start from 1 when it reaches a blank line. using awk or/and sed

I have a big data file consist of blocks of xy data, blocks are separated by a blank line. now I want to change all x to a set of sequential number, and start from 1 for next block. number of rows within each block could be different.
input:
165168 14653
5131655 51365
155615 1356
13651625 13651
12 51
55165 51656
64 64
651456 546546
desired output:
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546
I would use:
$ awk '!NF{i=0; print; next} {print ++i, $2}' file
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546
Explanation
It is a matter of keeping a counter i and resetting it appropriately.
!NF{i=0; print; next} if there are no fields, that is, if the line is empty, print an empty line and reset the counter.
{print ++i, $2} otherwise, increment the counter and print it together with the 2nd field.
Maybe even
awk '!NF { n=NR } NF { $1=NR-n } 1' file
So on an empty line, we set n to the current line number. On nonempty lines, we change the first field to the current line number minus n. Print all lines.

Resources