Using grep and awk together - bash

I have a file (A.txt) with 4 columns of numbers and another file with 3 columns of numbers (B.txt). I need to solve the following problems:
Find all lines in A.txt whose 3rd column has a number that appears any where in the 3rd column of B.txt.
Assume that I have many files like A.txt in a directory and I need to run this for every file in that directory.
How do I do this?

You should never see someone using grep and awk together because whatever grep can do, you can also do in awk:
Grep and Awk
grep "foo" file.txt | awk '{print $1}'
Using Only Awk:
awk '/foo/ {print $1}' file.txt
I had to get that off my chest. Now to your problem...
Awk is a programming language that assumes a single loop through all the lines in a set of files. And, you don't want to do this. Instead, you want to treat B.txt as a special file and loop though your other files. That normally calls for something like Python or Perl. (Older versions of BASH didn't handle hashed key arrays, so these versions of BASH won't work.) However, slitvinov looks like he found an answer.
Here's a Perl solution anyway:
use strict;
use warnings;
use feature qw(say);
use autodie;
my $b_file = shift;
open my $b_fh, "<", $b_file;
#
# This tracks the values in "B"
#
my %valid_lines;
while ( my $line = <$b_file> ) {
chomp $line;
my #array = split /\s+/, $line;
$valid_lines{$array[2]} = 1; #Third column
}
close $b_file;
#
# This handles the rest of the files
#
while ( my $line = <> ) { # The rest of the files
chomp $line;
my #array = split /\s+/, $line;
next unless exists $valid_lines{$array[2]}; # Next unless field #3 was in b.txt too
say $line;
}

Here is an example. Create the following files and run
awk -f c.awk B.txt A*.txt
c.awk
FNR==NR {
s[$3]
next
}
$3 in s {
print FILENAME, $0
}
A1.txt
1 2 3
1 2 6
1 2 5
A2.txt
1 2 3
1 2 6
1 2 5
B.txt
1 2 3
1 2 5
2 1 8
The output should be:
A1.txt 1 2 3
A1.txt 1 2 5
A2.txt 1 2 3
A2.txt 1 2 5

Related

Adding the last number in each file to the numbers in the following file

I have some directories, each of these contains a file with list of integers 1-N which are not necessarily consecutive and they may be different lengths. What I want to achieve is a single file with a list of all those integers as though they had been generated in one list.
What I am trying to do is to add the final value N from file 1 to all the values in file 2, then take the new final value of file 2 and add it to all the values in file 3 etc.
I have tried this by setting a counter and looping over the files, resetting the counter when I get to the end of the file. The problem is the p=0 will continue to reset which is kind of obvious in the code but I am not sure how else to do it.
What I tried:
p=0
for i in dirx/dir_*; do
(cd "$i" || exit;
awk -v p=$p 'NR>1{print last+p} {last=$0} END{$0=last; p=last; print}' file >> /someplace/bigfile)
done
Which is similar to the answer suggested in this question Replacing value in column with another value in txt file using awk
Now I'm wondering whether I need an if else, if it's the first dir then p=0 if not then p=last value from the first file though I'm not sure on that or how I'd get it to take the last value. I used awk because that's what I understand a small amount of and would usually use.
With GNU awk
gawk '{print $1 + last} ENDFILE {last = last + $1}' file ...
Demo:
$ cat a
1
2
4
6
8
$ cat b
2
3
5
7
$ cat c
1
2
3
$ gawk '{print $1 + last} ENDFILE {last = last + $1}' a b c
1
2
4
6
8
10
11
13
15
16
17
18

Add line numbers for duplicate lines in a file

My text file would read as:
111
111
222
222
222
333
333
My resulting file would look like:
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Or the resulting file could alternatively look like the following:
1
2
1
2
3
1
2
I've specified a comma as a delimiter here but it doesn't matter what the delimeter is --- I can modify that at a future date.In reality, I don't even need the original text file contents, just the line numbers, because I can just paste the line numbers against the original text file.
I am just not sure how I can go through numbering the lines based on repeated entries.
All items in list are duplicated at least once. There are no single occurrences of a line in the file.
$ awk -v OFS=',' '{print ++cnt[$0], $0}' file
1,111
2,111
1,222
2,222
3,222
1,333
2,333
Use a variable to save the previous line, and compare it to the current line. If they're the same, increment the counter, otherwise set it back to 1.
awk '{if ($0 == prev) counter++; else counter = 1; prev=$0; print counter}'
Perl solution:
perl -lne 'print ++$c{$_}' file
-n reads the input line by line
-l handles newlines
++$c{$_} increments the value assigned to the contents of the current line $_ in the hash table %c.
Software tools method, given textfile as input:
uniq -c textfile | cut -d' ' -f7 | xargs -L 1 seq 1
Shell loop-based variant of the above:
uniq -c textfile | while read a b ; do seq 1 $a ; done
Output (of either method):
1
2
1
2
3
1
2

How to merge specific columns from many files in one file

I have 100+ tab separated files in one directory and i want to merge 2nd column from each file to one file.
I was trying to use paste like this:
paste -d" " *.tsv >> result.tsv
It appends everything, i can't figure out how to apply awk '{print $2}' to it. Can anyone suggest how to approach such a task?
Example input:
file1
1 2 3 4
2 3 4 5
file2
3 4 5 6
5 6 7 8
file 3
7 6 5 6
2 3 4 4
Desired output file:
2 4 6
3 6 3
gawk
awk '{a[FNR]=a[FNR]?a[FNR]" "$2:$2}END{for(i=1;i<=length(a);i++)print a[i]}' *
If python is good for you, then you can use this script for any number of files:
#! /usr/bin/env python
# invoke with column nr to extract as first parameter followed by
# filenames. The files should all have the same number of rows
import sys
col = int(sys.argv[1])
res = {}
for file_name in sys.argv[2:]:
for line_nr, line in enumerate(open(file_name)):
res.setdefault(line_nr, []).append(line.split('\t')[col-1])
for line_nr in sorted(res):
print '\t'.join(res[line_nr])
Note: Script suggested by someone on Unix-StackExchange forum.
There is another solution too here Link
Trying for a solution without awk:
rm -f r.tsv
for i in *.tsv; do
if [[ -f r.tsv ]]; then
paste r.tsv <(cut -f 2 "$i") > tmp.txt
else
cut -f 2 "$i" > tmp.txt
fi
mv tmp.txt r.tsv
done
It's longer than the awk solution, even when put on a single line.
Here's a simple script that illustrates how a command-line utility capable of transposition (here datamash) can be used to paste together a specific column from each of a potentially large number of files.
#!/bin/bash
# requires datamash
TMP=$(mktemp /tmp/reshape.XXX)
for file
do
cut -f 2 < "$file" | tr '\n' '\t' >> $TMP
echo >> $TMP
done
# -W means: Use whitespace (one or more spaces and/or tabs)
# for field delimiters; the output will have tab-separated values
datamash --no-strict -W transpose < $TMP
/bin/rm $TMP

AWK: parsing arguments in a loop

I'm trying to write a simple script that will display the fields specified by the user as bash arguments. For example I've got text file looks like this:
1 2 3 4 5
1 2 3 4 5
a b c d e
And for example user types:
./script.sh text 1 2 5
Where $1 = text, and other parameters (like $2 $3 and $4) are the fields, so output will look like this:
1 2 5
1 2 5
a b e
I've got this code which prints all the columns defined as a arguments, but one below the others:
#!/bin/bash
text="$1"
shift
for x in $#; do
awk '{print $var}' var="$x" $text
done
Output for example ./script.sh text 1 2 5:
1
1
a
2
2
b
5
5
e
I guess output looks like that because loop "for" is outside of AWK. Is it a good solution for this task to place the loop inside AWK? I tried a few things but always have trouble with the syntax.
Thank you for your time and help!
file="$1"
shift
awk -v flds="$*" 'BEGIN{n=split(flds,f)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' "$file"
You don't need to loop over the params, pass all of them to awk with -v option:
awk -v v1=$2 -v v2=$3 -v v3=$4 '{print $v1, $v2, $v3;}' $1
You may want to perform additional checks such as whether the file ($1) contains enough fields, the file ($1) exists etc. But the idea is the same.
In your code, you are reading the file multiple times, each checking for only a particular field but to get your desired output, each line must be checked for multiple fields at the same time.
Pass the columns to awk and split them into an array and print the column corresponding to each value in the array:
file=$1
shift
p="$#"
awk -v l="$p" '{t=split(l,a," "); for (i=1;i<=t;i++) printf $(a[i]) " ";printf "\n";}' $file

Sort subgroups of lines with command-line tools

I've been trying to find a way of sorting this with standard commandline tools, bash, awk, sort, whatever but can't find a way apart from using perl or similar.
Any hint?
Input data
header1
3
2
5
1
header2
5
1
3
.....
.....
Output data
header1
1
2
3
5
header2
1
....
Thanks
Assumes sections are separated by blank lines and the header doesn't necessarily contain the string "header". Leaves the sections in the original order so the sort is stable. Reads from stdin, displays on stdout.
#!/bin/bash
function read_section() {
while read LINE && [ "$LINE" ]; do echo "$LINE"; done
}
function sort_section() {
read HEADER && (echo "$HEADER"; sort; echo)
}
while read_section | sort_section; do :; done
Or as a one-liner:
cat test.txt | while (while read LINE && [ "$LINE" ]; do echo "$LINE"; done) | (read HEADER && (echo "$HEADER"; sort; echo)); do :; done
Try this:
mark#ubuntu:~$ cat /tmp/test.txt
header1
3
2
5
1
header2
5
1
3
mark#ubuntu:~$ cat /tmp/test.txt | awk '/header/ {colname=$1; next} {print colname, "," , $0}' | sort | awk '{if ($1 != header) {header = $1; print header} print $3}'
header1
1
2
3
5
header2
1
3
5
To get rid of the blank lines, I guess you can add a "| grep -v '^$'" at the end...
Use AWK to prefix the header to each number line.
sort the resulting file.
remove the prefix to return the file to original format.
with GNU awk, you can use its internal sort functions.
awk 'BEGIN{ RS=""}
{
print $1
for(i=2;i<=NF;i++){
a[i]=$i
}
b=asort(a,d)
for(i=1;i<=b;i++){
print d[i]
}
delete d
delete a
} ' file
output
# more file
header1
3
2
5
1
header2
5
1
3
# ./test.sh
header1
1
2
3
5
header2
1
3
5

Resources