Combining lines with same string in Bash - bash

I have a file with a bunch of lines that looks like this:
3 world
3 moon
3 night
2 world
2 video
2 pluto
1 world
1 pluto
1 moon
1 mars
I want to take each line that contains the same word, and combine them while adding the preceding number, so that it looks like this:
6 world
4 moon
3 pluto
3 night
2 video
1 mars
I've been trying combinations with sed, but I can't seem to get it right. My next idea was to sort them, and then check if the following line was the same word, then add them, but I couldn't figure out how to get it to sort by word rather than the number.

Sum and sort:
awk -F" " '{c[$2]+=$1} END {for (i in c){print c[i], i}}' | sort -n -r

Related

Extract blocks of lines with sed

How would one go about with sed to extract n lines of a file every m-th line?
Say my textfile looks like this:
myfile.dat:
1
2
3
4
5
6
7
8
9
10
Say that I want to extract blocks of three lines and then skipping two lines throughout the entire file, such that my output looks like this:
output.dat:
1
2
3
6
7
8
Any suggestions on how one could achieve this with sed?
Edit:
For my example I could just have used
sed -n 'p;n;p;n;p;n;n' myfile.dat > output.dat
or with GNU sed (not preferred due to portability)
sed '1~5b;2~5b;3~5b;d' myfile.dat > output.dat
However, I typically want to print blocks of 2450 lines from a file with 49 002 450 lines, such that my outputfile contains 247 450 lines.
This might work for you (GNU sed):
sed -n '1~5,+2p' file
Starting at line 1, print line numbers with modulus 5 and the following two lines.
An alternative:
sed -n 'N;N;p;n;n' file
In your case the below would work. It's checking the remainder when divided by 5 is between 1 and 3:
awk 'NR%5==1, NR%5==3' myfile.dat

Checking for changes in two lists in Bash

I have these two files containing a list of items, where the quantity of each item is separated by a space. These lists are supposed to be already ordered and always having the same amount of items each, however I would prefer making a code that relies on the item name and not on the number of the line.
I need to have an output where only the changes are present, for example an echo for every item that has changed its associated value. I know I could use diff or meld for this, but I need a very specific output, because then I have to send a mail for every one of these changes, so I guess I should be using something like awk.
cat before.txt
Apples 3
Oranges 5
Bananas 7
Avocados 2
cat after.txt
Apples 3
Oranges 7
Bananas 7
Avocados 3
output wanted:
Oranges has changed form 5 to 7
Avocados has changed form 2 to 3
awk is your friend
awk 'NR==FNR{price[$1]=$2;next}
$1 in price{
if(price[$1]!=$2){
printf "%s has changed from %s to %s%s",$1,price[$1],$2,ORS
}
}' before.txt after.txt
Output
Oranges has changed from 5 to 7
Avocados has changed from 2 to 3
If you're new to awk consider buying [ Effective awk Programming ] by Arnold Robbins.
Not as great as other answer but simple to understand. This is not very economic way to do this task , however I have added it as it makes things simple . Of course if performance is not really a concern.
paste before.txt after.txt | awk '$2!=$4 {print $1 " are changes from " $2 " to " $4}'
Oranges are changes from 5 to 7
Avocados are changes from 2 to 3

Linux command to remove lines containing a duplicated value in a text file?

If I have a text file with the following form
1 1
1 3
3 4
2 2
5 7
...
Is there a Linux command that can give me the following result?
1 3
3 4
5 7
...
So, I want to delete the lines 1 1 and 2 2.
Yes, you can use something like:
awk '$1!=$2{print}' inputfilename
or the slightly less verbose (thanks to ooga):
awk '$1!=$2' inputfilename
which uses the "missing action means print" feature of awk.
Both these awk commands print lines where the columns don't match, and throw away everything else.

Frequency count of particular field appended to line without deleting duplicates

Trying to work out how to get a frequency appended or prepended to each line in a file WITHOUT deleting duplicate occurrences (which uniq can do for me).
So, if input file is:
mango
mango
banana
apple
watermelon
banana
I need output:
mango 2
mango 2
banana 2
apple 1
watermelon 1
banana 2
All the solutions I have seen delete the duplicates. In other words, what I DON'T want is:
mango 2
banana 2
apple 1
watermelon 1
Basically you cannot do it in one pass without keeping everything in memory. If this is what you want to do, then use python/perl/awk/whatever. The algorithm is quite simple.
Let's do it with standard Unix tools. This is a bit cumbersome and can be improved but should do the work:
$ sort input | uniq -c > input.count
$ nl input | sort -k 2 > input.line
$ join -1 2 -2 2 input.line input.count | sort -k 2 | awk '{print $1 " " $3}
The first step is to count the number occurrences of a given word.
As you said you cannot both repeat and keep line ordering. So we have to fix that. The second step prepends the line number that we will use later to fix the ordering issue.
In the last step, we join the two temporary files on the original word, the second column contains the original line number sort we sort on this key and strip it from the final output.

Extract a range of rows, with overlap using sed

I have a (dummy) file that looks like this:
header
1
2
3
4
5
6
7
8
9
10
And I need a command that would give me different files made of rows extracted every four lines with one overlaping row. So I would have something like this:
1
2
3
4
3
4
5
6
5
6
7
8
7
8
9
10
So here is what I got (it is not much, sorry):
tail -n + 2 | sed -n 1,4p > window1.txt
But I don't know how to apply this over all the file, with an overlap.
Thanks in advance.
This might work for you (GNU sed and split):
sed -nr '1{N;N;N};:a;p;$q;s/^.*\n.*\n(.*\n.*)$/\1/;N;N;ba' file | split -dl4
EDIT:
To make this programmable use:
sed -nr ':a;$!{N;s/[^\n]+/&/4;Ta};p;$q;s/.*((\n[^\n]*){2})$/\1/;D' file |
split -dl4 file-name-prefix
Where 4 is the number lines per file and 2 is the number of overlap lines.
File-name-prefix is your chosen file name which will have numbers appended (see man split).

Resources