How to merge specific columns from many files in one file - bash

I have 100+ tab separated files in one directory and i want to merge 2nd column from each file to one file.
I was trying to use paste like this:
paste -d" " *.tsv >> result.tsv
It appends everything, i can't figure out how to apply awk '{print $2}' to it. Can anyone suggest how to approach such a task?
Example input:
file1
1 2 3 4
2 3 4 5
file2
3 4 5 6
5 6 7 8
file 3
7 6 5 6
2 3 4 4
Desired output file:
2 4 6
3 6 3

gawk
awk '{a[FNR]=a[FNR]?a[FNR]" "$2:$2}END{for(i=1;i<=length(a);i++)print a[i]}' *

If python is good for you, then you can use this script for any number of files:
#! /usr/bin/env python
# invoke with column nr to extract as first parameter followed by
# filenames. The files should all have the same number of rows
import sys
col = int(sys.argv[1])
res = {}
for file_name in sys.argv[2:]:
for line_nr, line in enumerate(open(file_name)):
res.setdefault(line_nr, []).append(line.split('\t')[col-1])
for line_nr in sorted(res):
print '\t'.join(res[line_nr])
Note: Script suggested by someone on Unix-StackExchange forum.
There is another solution too here Link

Trying for a solution without awk:
rm -f r.tsv
for i in *.tsv; do
if [[ -f r.tsv ]]; then
paste r.tsv <(cut -f 2 "$i") > tmp.txt
else
cut -f 2 "$i" > tmp.txt
fi
mv tmp.txt r.tsv
done
It's longer than the awk solution, even when put on a single line.

Here's a simple script that illustrates how a command-line utility capable of transposition (here datamash) can be used to paste together a specific column from each of a potentially large number of files.
#!/bin/bash
# requires datamash
TMP=$(mktemp /tmp/reshape.XXX)
for file
do
cut -f 2 < "$file" | tr '\n' '\t' >> $TMP
echo >> $TMP
done
# -W means: Use whitespace (one or more spaces and/or tabs)
# for field delimiters; the output will have tab-separated values
datamash --no-strict -W transpose < $TMP
/bin/rm $TMP

Related

Delete values in line based on column index using shell script

I want to be able to delete the values to the RIGHT(starting from given column index) from the test.txt at the given column index based on a given length, N.
Column index refers to the position when you open the file in the VIM editor in LINUX.
If my test.txt contains 1234 5678, and I call my delete_var function which takes in the column number as 2 to start deleting from and length N as 2 to delete as input, the test.txt would reflect 14 5678 as it deleted the values from column 2 to column 4 as the length to delete was 2.
I have the following code as of now but I am unable to understand what I would put in the sed command.
delete_var() {
sed -i -r 's/not sure what goes here' test.txt
}
clmn_index= $1
_N=$2
delete_var "$clmn_index" "$_N" # call the method with the column index and length to delete
#sample test.txt (before call to fn)
1234 5678
#sample test.txt (after call to fn)
14 5678
Can someone guide me?
You should avoid using regex for this task. It is easier to get this done in awk with simple substr function calls:
awk -v i=2 -v n=2 'i>0{$0 = substr($0, 1, i-1) substr($0, i+n)} 1' file
14 5678
Assumping OP must use sed (otherwise other options could include cut and awk but would require some extra file IOs to replace the original file with the modified results) ...
Starting with the sed command to remove the 2 characters starting in column 2:
$ echo '1234 5678' > test.txt
$ sed -i -r "s/(.{1}).{2}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
Where:
(.{1}) - match first character in line and store in buffer #1
.{2} - match next 2 characters but don't store in buffer
(.*$) - match rest of line and store in buffer #2
\1\2 - output contents of buffers #1 and #2
Now, how to get variables for start and length into the sed command?
Assume we have the following variables:
$ s=2 # start
$ n=2 # length
To map these variables into our sed command we can break the sed search-replace pattern into parts, replacing the first 1 and 2 with our variables like such:
replace {1} with {$((s-1))}
replace {2} with {${n}}
Bringing this all together gives us:
$ s=2
$ n=2
$ echo '1234 5678' > test.txt
$ set -x # echo what sed sees to verify the correct mappings:
$ sed -i -r "s/(.{"$((s-1))"}).{${n}}(.*$)/\1\2/g" test.txt
+ sed -i -r 's/(.{1}).{2}(.*$)/\1\2/g' test.txt
$ set +x
$ cat test.txt
14 5678
Alternatively, do the subtraction (s-1) before the sed call and just pass in the new variable, eg:
$ x=$((s-1))
$ sed -i -r "s/(.{${x}}).{${n}}(.*$)/\1\2/g" test.txt
$ cat test.txt
14 5678
One idea using cut, keeping in mind that storing the results back into the original file will require an intermediate file (eg, tmp.txt) ...
Assume our variables:
$ s=2 # start position
$ n=2 # length of string to remove
$ x=$((s-1)) # last column to keep before the deleted characters (1 in this case)
$ y=$((s+n)) # start of first column to keep after the deleted characters (4 in this case)
At this point we can use cut -c to designate the columns to keep:
$ echo '1234 5678' > test.txt
$ set -x # display the cut command with variables expanded
$ cut -c1-${x},${y}- test.txt
+ cut -c1-1,4- test.txt
14 5678
Where:
1-${x} - keep range of characters from position 1 to position $(x) (1-1 in this case)
${y}- - keep range of characters from position ${y} to end of line (4-EOL in this case)
NOTE: You could also use cut's ability to work with the complement (ie, explicitly tell what characters to remove ... as opposed to above which says what characters to keep). See KamilCuk's answer for an example.
Obviously (?) the above does not overwrite test.txt so you'd need an extra step, eg:
$ echo '1234 5678' > test.txt
$ cut -c1-${x},${y}- test.txt > tmp.txt # store result in intermediate file
$ cat tmp.txt > test.txt # copy intermediate file over original file
$ cat test.txt
14 5678
Looks like:
cut --complement -c $1-$(($1 + $2 - 1))
Should just work and delete columns between $1 and $2 columns behind it.
please provide code how to change test.txt
cut can't modify in place. So either pipe to a temporary file or use sponge.
tmp=$(mktemp)
cut --complement -c $1-$(($1 + $2 - 1)) test.txt > "$tmp"
mv "$tmp" test.txt
Below command result in the elimination of the 2nd character. Try to use this in a loop
sed s/.//2 test.txt

how can I know the number of lines and columns that many txt files have

I have many files in my directory.
It is very difficult to open one by one and see how many lines they have or how many columns they have.
I want to know if there is any automatic way to do it
As an example. I create a txt file in my desktop and call it my file
check myfile Myname
FALSE 0 Q9Y383
FALSE 1 Q9Y383
FALSE 2 Q9Y383
FALSE 3 Q15366-2
FALSE 6 Q15366-2
FALSE 7 Q15366-2
I paste this in there and so I am sure I have 3 columns and 7 rows (when I open them by xls file)
I tried to do it for one single file like
wc -l mytextfile
it shows 0
This is only one file, what If I have 1000 files ?
Given:
$ cat /tmp/f.txt
check myfile Myname
FALSE 0 Q9Y383
FALSE 1 Q9Y383
FALSE 2 Q9Y383
FALSE 3 Q15366-2
FALSE 6 Q15366-2
FALSE 7 Q15366-2
For a single file, you can use awk:
$ awk 'NR==1{cols=NF} END{print cols, NR}' /tmp/f.txt
3 7
If you have gawk you can handle multiple files (*.ext) files easily:
$ gawk 'BEGIN { printf "%4s%8s\n", "cols", "lines"}
FNR==1{cols=NF}
ENDFILE{cnt++;printf "%3i %10i %-60s\n", cols, FNR, FILENAME}
END{ printf "%14i lines in %i files\n", NR, cnt}' /tmp/*.txt
Which produces (for me)
cols lines
3 7 /tmp/f.txt
1 20000000 /tmp/test.txt
20000007 lines in 2 files
Edit
If you have ancient Mac files (where the newlines are not some form of \n) you can do:
$ awk -v RS='\r' 'NR==1{cols=NF} END{print cols, NR}' your_file
Or,
$ gawk -v RS='\r' 'BEGIN { printf "%4s%8s\n", "cols", "lines"}
FNR==1 { cols=NF }
ENDFILE { cnt++;printf "%3i %10i %-60s\n", cols, FNR, FILENAME }
END { printf "%14i lines in %i files\n", NR, cnt}' *.files
wc -l file will show you number of lines; assuming comma-separated values and no literal commas in the header, read -r -d $'\r' -a cols <file && echo "${#cols[#]}" will give you number of columns (in the first line).
All of these will work with wildcards. If you have 1000 files, then, you can run:
printf '%s\0' *.txt | xargs -0 wc -l
...or...
for file in *.txt; do
read -r -a cols <"$file" && echo "$file ${#cols[#]}"
done
Note that in at least one other question, you had a text file with CR newlines rather than LF or CRLF newlines. For those, you'll want to use read -r -d $'\r' -a cols.
Similarly, if your text file format prevents wc -l from working correctly for that same reason, you might need the following much-less-efficient alternative:
for file in *.txt; do
printf '%s\t' "$file"
tr '\r' '\n' <"$file" | wc -l
done
Just use for statement.
for f in *
do
wc -l "$f"
done
and add things to for loop, when you have any other things to repeat
Your file has ‘mac’ line endings – that is, lines separated by carriage-return rather than newline (which are ‘unix’ line endings), and it appears that wc can recognise only the latter.
You have two options: convert your input files to ‘mac’ line endings once, or on the fly.
For example
% alias frommac="tr '\r' '\n'"
% frommac <myfile >myfile.unix
% wc -l myfile.unix
or
% frommac <myfile | wc -l
If you have lots of these files, then you could do something like
% wc -l *.unix
(if you've pre-converted the input files as above), or
% for f in *; do frommac <$f | wc -l; done
...or something along those lines.

Combine two lines from different files when the same word is found in those lines

I'm new with bash, and I want to combine two lines from different files when the same word is found in those lines.
E.g.:
File 1:
organism 1
1 NC_001350
4 NC_001403
organism 2
1 NC_001461
1 NC_001499
File 2:
NC_001499 » Abelson murine leukemia virus
NC_001461 » Bovine viral diarrhea virus 1
NC_001403 » Fujinami sarcoma virus
NC_001350 » Saimiriine herpesvirus 2 complete genome
NC_022266 » Simian adenovirus 18
NC_028107 » Simian adenovirus 19 strain AA153
i wanted an output like:
File 3:
organism 1
1 NC_001350 » Saimiriine herpesvirus 2 complete genome
4 NC_001403 » Fujinami sarcoma virus
organism 2
1 NC_001461 » Bovine viral diarrhea virus 1
1 NC_001499 » Abelson murine leukemia virus
Is there any way to get anything like that output?
You can get something pretty similar to your desired output like this:
awk 'NR == FNR { a[$1] = $0; next }
{ print $1, ($2 in a ? a[$2] : $2) }' file2 file1
This reads in each line of file2 into an array a, using the first field as the key. Then for each line in file1 it prints the first field followed by the matching line in a if one is found, else the second field.
If the spacing is important, then it's a little more effort but totally possible.
For a more Bash 4 ish solution:
declare -A descriptions
while read line; do
name=$(echo "$line" | cut -d '»' -f 1 | xargs echo)
description=$(echo "$line" | cut -d '»' -f 2)
eval "descriptions['$name']=' »$description'"
done < file2
while read line; do
name=$(echo "$line" | cut -d ' ' -f 2)
if [[ -n "$name" && -n "${descriptions[$name]}" ]]; then
echo "${line}${descriptions[$name]}"
else
echo "$line"
fi
done < file1
We could create a sed-script from the second file and apply it to the first file. It is straight forward, we use the sed s command to construct another sed s command from each line and store in a variable for later usage:
sc=$(sed -rn 's#^\s+(\w+)([^\w]+)(.*)$#s/\1/\1\2\3/g;#g; p;' file2 )
sed "$sc" file1
The first command looks so weird, because we use # in the outer sed s and we use the more common / in the inner sed s command as delimiters.
Do a echo $sc to study the inner one. It just takes the parts of each line of file2 into different capture groups and then combines the captured strings to a s/find/replace/g; with
find is \1
replace is \1\2\3
You want to rebuild file2 into a sed-command file.
sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2
You can use process substitution to use the result without storing it in a temp file.
sed -f <(sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2) File1

Using grep and awk together

I have a file (A.txt) with 4 columns of numbers and another file with 3 columns of numbers (B.txt). I need to solve the following problems:
Find all lines in A.txt whose 3rd column has a number that appears any where in the 3rd column of B.txt.
Assume that I have many files like A.txt in a directory and I need to run this for every file in that directory.
How do I do this?
You should never see someone using grep and awk together because whatever grep can do, you can also do in awk:
Grep and Awk
grep "foo" file.txt | awk '{print $1}'
Using Only Awk:
awk '/foo/ {print $1}' file.txt
I had to get that off my chest. Now to your problem...
Awk is a programming language that assumes a single loop through all the lines in a set of files. And, you don't want to do this. Instead, you want to treat B.txt as a special file and loop though your other files. That normally calls for something like Python or Perl. (Older versions of BASH didn't handle hashed key arrays, so these versions of BASH won't work.) However, slitvinov looks like he found an answer.
Here's a Perl solution anyway:
use strict;
use warnings;
use feature qw(say);
use autodie;
my $b_file = shift;
open my $b_fh, "<", $b_file;
#
# This tracks the values in "B"
#
my %valid_lines;
while ( my $line = <$b_file> ) {
chomp $line;
my #array = split /\s+/, $line;
$valid_lines{$array[2]} = 1; #Third column
}
close $b_file;
#
# This handles the rest of the files
#
while ( my $line = <> ) { # The rest of the files
chomp $line;
my #array = split /\s+/, $line;
next unless exists $valid_lines{$array[2]}; # Next unless field #3 was in b.txt too
say $line;
}
Here is an example. Create the following files and run
awk -f c.awk B.txt A*.txt
c.awk
FNR==NR {
s[$3]
next
}
$3 in s {
print FILENAME, $0
}
A1.txt
1 2 3
1 2 6
1 2 5
A2.txt
1 2 3
1 2 6
1 2 5
B.txt
1 2 3
1 2 5
2 1 8
The output should be:
A1.txt 1 2 3
A1.txt 1 2 5
A2.txt 1 2 3
A2.txt 1 2 5

Reorder lines of file by given sequence

I have a document A which contains n lines. I also have a sequence of n integers all of which are unique and <n. My goal is to create a document B which has the same contents as A, but with reordered lines, based on the given sequence.
Example:
A:
Foo
Bar
Bat
sequence: 2,0,1 (meaning: First line 2, then line 0, then line 1)
Output (B):
Bat
Foo
Bar
Thanks in advance for the help
Another solution:
You can create a sequence file by doing (assuming sequence is comma delimited):
echo $sequence | sed s/,/\\n/g > seq.txt
Then, just do:
paste seq.txt A.txt | sort tmp2.txt | sed "s/^[0-9]*\s//"
Here's a bash function. The order can be delimited by anything.
Usage: schwartzianTransform "A.txt" 2 0 1
function schwartzianTransform {
local file="$1"
shift
local sequence="$#"
echo -n "$sequence" | sed 's/[^[:digit:]][^[:digit:]]*/\
/g' | paste -d ' ' - "$file" | sort -n | sed 's/^[[:digit:]]* //'
}
Read the file into an array and then use the power of indexing :
echo "Enter the input file name"
read ip
index=0
while read line ; do
NAME[$index]="$line"
index=$(($index+1))
done < $ip
echo "Enter the file having order"
read od
while read line ; do
echo "${NAME[$line]}";
done < $od
[aman#aman sh]$ cat test
Foo
Bar
Bat
[aman#aman sh]$ cat od
2
0
1
[aman#aman sh]$ ./order.sh
Enter the input file name
test
Enter the file having order
od
Bat
Foo
Bar
an awk oneliner could do the job:
awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
$s is your sequence.
take a look this example:
kent$ seq 10 >file #get a 10 lines file
kent$ s=$(seq 0 9 |shuf|tr '\n' ','|sed 's/,$//') # get a random sequence by shuf
kent$ echo $s #check the sequence in var $s
7,9,1,0,5,4,3,8,6,2
kent$ awk -vs="$s" '{d[NR-1]=$0}END{split(s,a,",");for(i=1;i<=length(a);i++)print d[a[i]]}' file
8
10
2
1
6
5
4
9
7
3
One way(not an efficient one though for big files):
$ seq="2 0 1"
$ for i in $seq
> do
> awk -v l="$i" 'NR==l+1' file
> done
Bat
Foo
Bar
If your file is a big one, you can use this one:
$ seq='2,0,1'
$ x=$(echo $seq | awk '{printf "%dp;", $0+1;print $0+1> "tn.txt"}' RS=,)
$ sed -n "$x" file | awk 'NR==FNR{a[++i]=$0;next}{print a[$0]}' - tn.txt
The 2nd line prepares a sed command print instruction, which is then used in the 3rd line with the sed command. This prints only the line numbers present in the sequence, but not in the order of the sequence. The awk command is used to order the sed result depending on the sequence.

Resources