Compare files by position - bash

I want to compare two files only by their first column.
My first looks like this:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf
000f7d4a8234d5c9 hdisk22 vgarcbkp
My second file looks like this:
000f7d4a8234a675 hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdiskpower61 [Lun_vgarcbkp]
This is the output I would like to generate:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdisk22 vgarcbkp hdiskpower61 [Lun_vgarcbkp]
I wonder why diff does not support positional compare.
Something like diff -y -p1-17 file1 file2. Any idea?

You can use join to produce your desired output :
join -a 1 file1 file2
The -a 1 option states to output lines from the first file which have no correspondances in the second, so this assumes the first file contains every id that is present in the second.
It also relies on the files being sorted on their first file, which could be the case according to your sample data. If it's not you will need to sort them beforehand (the join command will warn you about your files not being sorted).
Sample execution :
$ echo '1 a b
> 2 c d
> 3 e f' > test1
$ echo '2 9 8
> 3 7 6' > test2
$ join -a 1 test1 test2
1 a b
2 c d 9 8
3 e f 7 6
Ideone test with your sample data.

Related

Merge multiple text files on one column in bash

How do you merge multiple plain text files (>2) on the first column? For example, I have three files like the below:
cat file1.txt
a 1
b 2
c 3
cat file2.txt
a 2
b 3
c 4
cat file3.txt
a 3
b 4
c 5
I am trying to merge these files into one file like in the first column like this:
cat ideal.txt
a 1 2 3
b 2 3 4
c 3 4 5
How about join?
Join lines of two sorted files on a common field.
More information: https://www.gnu.org/software/coreutils/join
join file1.txt file2.txt > join1.txt
join join1.txt file3.txt > ideal.txt
cat ideal.txt
Here's a script, I named the file "jj" you might use in order to work with many a file. To run it type: ./jj file1.txt file2.txt file3.txt
#!/usr/bin/env bash
# define temporary location, WIP/CACHE
tmp="/tmp/outjointmp"
# define target location
out="/tmp/outjoin"
# truncate both files, just in case there is any residue from anything
: > "$out"
: > "$tmp"
# first, copy the contents of the first file into the target file
cat "$1" > "$out"
# loop through all remaining arguments
while [[ $# -gt 1 ]]; do
join "$out" "$2" > "$tmp"
shift
# copy over the temp into destination file
cat "$tmp" > "$out"
done
cat "$out"
result&output:
$ ./jj file1.txt file2.txt file3.txt
a 1 2 3
b 2 3 4
c 3 4 5
A recursive function using process substitution should do the trick in order to join more than two files:
#!/bin/bash
join_rec() {
if (($# <= 2)); then
join "$#"
else
join "$1" <(join_rec "${#:2}")
fi
}
join_rec file*.txt > joined_file
assuming input files are sorted.

Bash: reshape a dataset of many rows to dataset of many columns

Suppose I have the following data:
# all the numbers are their own number. I want to reshape exactly as below
0 a
1 b
2 c
0 d
1 e
2 f
0 g
1 h
2 i
...
And I would like to reshape the data such that it is:
0 a d g ...
1 b e h ...
2 c f i ...
Without writing a complex composition. Is this possible using the unix/bash toolkit?
Yes, trivially I can do this inside a language. The idea is NOT TO "just" do that. So if some cat X.csv | rs [magic options] sort of solution (and rs, or the bash reshape command, would be great, except it isn't working here on debian stretch) exists, that is what I am looking for.
Otherwise, an equivalent answer that involves a composition of commands or script is out of scope: already got that, but would rather not have it.
Using GNU datamash:
$ datamash -s -W -g 1 collapse 2 < file
0 a,d,g
1 b,e,h
2 c,f,i
Options:
-s sort
-W use whitespace (spaces or tabs) as delimiters
-g 1 group on the first field
collapse 2 print comma-separated list of values of the second field
To convert the tabs and commas to space characters, pipe the output to tr:
$ datamash -s -W -g 1 collapse 2 < file | tr '\t,' ' '
0 a d g
1 b e h
2 c f i
bash version:
function reshape {
local index number key
declare -A result
while read index number; do
result[$index]+=" $number"
done
for key in "${!result[#]}"; do
echo "$key${result[$key]}"
done
}
reshape < input
We just need to make sure input is in unix format

Joining lines, modulo the number of records

Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file

Average number of rows in 10000 text files

I have a set of 10000 text files (file1.txt, file2.txt,...file10000.txt). Each one has a different number of rows. I'd like to know which is the average number of rows, among these 10000 files, excluding the last row. For example:
File1:
a
b
c
d
last
File2:
a
b
c
last
File2:
a
b
c
d
e
last
here I should obtain 4 as result. I tried with python but it requires too much time to read all the files. How could I do with a shell script?
Here's one way:
touch file{1..3}.txt
file 1 has 1 line, file 2 two lines and so on...
$ for i in {1..3}; do wc -l file${i}.txt; done | awk '{sum+=$1}END{print sum/NR}'
2

Replacing values in large table using conversion table

I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem:
Say I have a file "OLD_FILE", containing a header and approximately 2 million rows:
COL1 COL2 COL3 COL4 COL5
rs10 7 92221824 C A
rs1000000 12 125456933 G A
rs10000010 4 21227772 T C
rs10000012 4 1347325 G C
rs10000013 4 36901464 C A
rs10000017 4 84997149 T C
rs1000002 3 185118462 T C
rs10000023 4 95952929 T G
...
I want to replace the first value of each row with a corresponding value, using a large (2.8M rows) conversion table. In this conversion table, the first column lists the value I want to have replaced, and the second column lists the corresponding new values:
COL1_b36 COL2_b37
rs10 7_92383888
rs1000000 12_126890980
rs10000010 4_21618674
rs10000012 4_1357325
rs10000013 4_37225069
rs10000017 4_84778125
rs1000002 3_183635768
rs10000023 4_95733906
...
The desired output would be a file where all values in the first column have been changed according to the conversion table:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
...
Additional info:
Performance is an issue (the following command takes approximately a year:
while read a b; do sed -i "s/\b$a\b/$b/g" OLD_FILE ; done < CONVERSION_TABLE
A complete match is necessary before replacing
Not every value in the OLD_FILE can be found in the conversion table...
...but every value that could be replaced, can be found in the conversion table.
Any help is very much appreciated.
Here's one way using awk:
awk 'NR==1 { next } FNR==NR { a[$1]=$2; next } $1 in a { $1=a[$1] }1' TABLE OLD_FILE
Results:
COL1 COL2 COL3 COL4 COL5
7_92383888 7 92221824 C A
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
Explanation, in order of appearance:
NR==1 { next } # simply skip processing the first line (header) of
# the first file in the arguments list (TABLE)
FNR==NR { ... } # This is a construct that only returns true for the
# first file in the arguments list (TABLE)
a[$1]=$2 # So when we loop through the TABLE file, we add the
# column one to an associative array, and we assign
# this key the value of column two
next # This simply skips processing the remainder of the
# code by forcing awk to read the next line of input
$1 in a { ... } # Now when awk has finished processing the TABLE file,
# it will begin reading the second file in the
# arguments list which is OLD_FILE. So this construct
# is a condition that returns true literally if column
# one exists in the array
$1=a[$1] # re-assign column one's value to be the value held
# in the array
1 # The 1 on the end simply enables default printing. It
# would be like saying: $1 in a { $1=a[$1]; print $0 }'
This might work for you (GNU sed):
sed -r '1d;s|(\S+)\s*(\S+).*|/^\1\\>/s//\2/;t|' table | sed -f - file
You can use join:
join -o '2.2 1.2 1.3 1.4 1.5' <(tail -n+2 file1 | sort) <(tail -n+2 file2 | sort)
This drops the headers of both files, you can add it back with head -n1 file1.
Output:
12_126890980 12 125456933 G A
4_21618674 4 21227772 T C
4_1357325 4 1347325 G C
4_37225069 4 36901464 C A
4_84778125 4 84997149 T C
3_183635768 3 185118462 T C
4_95733906 4 95952929 T G
7_92383888 7 92221824 C A
Another way with join. Assuming the files are sorted on the 1st column:
head -1 OLD_FILE
join <(tail -n+2 CONVERSION_TABLE) <(tail -n+2 OLD_FILE) | cut -f 2-6 -d' '
But with data of this size you should consider using a database engine.

Resources