concatenating multiple files - bash

I have multiple files, and in each file is the following:
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
That is, each file contains one gene sequence for species HM001 to HM050. I would like to concatenate all these files, so I have a single file that contains the genome for species HM001 to HM050:
>HM001
ATGCT...ATGAA...ATGTT
>HM002
ATGTC...ATGCT...ATGCT
>HM003
ATGCC...ATGC...ATGAT
The ellipses are not actually required in the final file. I suppose cat should be used, but I'm not sure how. Any ideas would be appreciated.

Data parsing and formatting will be alot easier with awk. Try this:
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
For files like:
==> f1 <==
>HM001
ATGCT...
>HM002
ATGTC...
>HM003
ATGCC...
==> f2 <==
>HM001
ATGDD...
>HM002
ATGDD...
>HM003
ATGDD...
==> f3 <==
>HM001
ATGEE...
>HM002
ATGEE...
>HM003
ATGEE...
awk -v RS=">" 'FNR>1{a[$1]=a[$1]?a[$1] FS $2:$2}END{for(x in a) print RS x ORS a[x]}' f1 f2 f3
>HM001
ATGCT... ATGDD... ATGEE...
>HM002
ATGTC... ATGDD... ATGEE...
>HM003
ATGCC... ATGDD... ATGEE...

Might I suggest converting your group of files into a CSV? It's almost
exactly what you're suggesting, and is easily incorporated into just
about any application for processing (e.g., Excel, R, python).
Up front, I'll assume that all species and gene sequences are simply
alpha-numeric, no spaces or quote-like characters. I'm also assuming
access to sed, sort, and uniq, which are all standard in *nix,
MacOSX, and easily accessible for windows via
msys or
cygwin, to name two.
First, generate an array of file names and species. I'm assuming the
files are named file1, file2, etc. Just adjust the first line
accordingly; it's just a glob, not an executed command.
FILES=($(file*))
SPECIES=($(sed -ne 's/^>//gp' file* | sort | uniq))
This gives us one line per species, sorted, with no repeats. This
ensures that our columns are independent and the set is complete.
Next, create a CSV header row with named columns, dumping it into a
CSV file named csvfile:
echo -n "\"Species\"" > csvfile
for fn in ${FILES[#]} ; do echo -n ",\"${fn}\"" ; done >> csvfile
echo >> csvfile
Now iterate through each gene sequence and extract it from all files:
for sp in ${SPECIES[#]} ; do
echo -n "\"${sp}\""
for fn in ${FILES[#]}; do
ANS=$(sed -ne '/>'${sp}'/,/^/ { /^[^>]/p }' ${fn})
echo -n ",\"${ANS}\""
done
echo
done >> csvfile
This works but is inefficient for larger data sets (i.e., large
numbers of files and/or species). Better implementations (e.g, python,
ruby, perl, even R) would read each file once, forming an
internally-maintained matrix, dictionary, or associative array, and
write out the CSV in one chunk.

What about appending them using echo - along these lines?:
find . -type f -exec bash -c 'echo "append this" >> "$0"' {} \;
Source: https://stackoverflow.com/a/15604608/1662973
I would do it using "type", but that is MSDOS. The above should work for you.

The simplest way I can think of is to use cat. For example (assuming you're on a *nix-type system):
cat file1 file2 file3 > outfile

Another awk implementation:
awk '
{key=$0; getline; value[key] = value[key] $0}
END {for (key in value) {print key; print value[key]}}
' file ...
Now, this will probably not output the keys in sorted order: array keys are inherently unsorted. To ensure sorted output, use gawk and
awk '
{key=$0; getline; val[key] = val[key] $0}
END {
n = asorti(val, keys)
for (i=1; i<=n; i++) {print keys[i]; print val[keys[i]]}
}
' file ...

Related

cat multiple files into one using same amount of rows as file B from A B C

This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.

How can I make a script that calls awk in a loop over k/v pairs faster?

I have numerous amounts of text files that I would like to loop through. While looping I would like to find lines that match a list of strings and extract each to a separate folder. I have a variable "ij" that need to be split into "i" and "j" to match two columns. For example 2733 needs to be split into 27 and 33. The script searches each text file and extracts every line that has an i and j of 2733.
The problem here is that I have nearly 100 different strings, so it takes about 35 hours to get through all these strings.
Is there any way to extract all of the variables to separate files in just one loop? I am trying to loop through a text file, extract all the lines that are in my list of strings and output them to their own folder, then move onto the next text file.
I am currently using the "awk" command to accomplish this.
list="2741 2740 2739 2738 2737 2641 2640 2639 2638 2541 2540 2539 2538 2441 2440 2439 2438 2341 2340 2339 2241 2240 2141"
for string in $list
do
for i in ${string:0:2}
do
for j in ${string:2:2}
do
awk -v i=$i -v j=$j '$2==j && $3==i {print $0}' $datadir/*.txt >"${fileout}${i}_${j}_Output.txt"
done
done
done
So I did this:
# for each 4 digits in the list
# add "a[" and "];" before and after the four numbers
# so awk array is "a[2741]; a[2740]; a[2739]; ...."
awkarray=$(awkarray=$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')
awk -vfileout="$fileout" '
BEGIN {'"$awkarray"'}
$2 $3 in a {
print $0 > fileout $2 "_" $3 "_Output.txt"
}
' "$datadir"/*.txt
So first I transform the list to load it as an array in awk. The array has only indexes, so I can check if an index exists in an array, the array elements have no values. Then I simply check if the concatenation of $2 and $3 exists in the array, if it exists, the output is redirected to proper filename.
Remember to quote your variables. $datadir/*.txt may not work, when datadir contains spaces, do "$datadir"/*.txt. The newlines in awk script
above can be removed, so if you prefer a oneliner:
awk -vfileout="$fileout" 'BEGIN {'"$(<<<"$list" sed -E 's/[0-9]{4}/a[&];/g')"'} $2 $3 in a { print $0 > fileout $2 "_" $3 "_Output.txt" }' "$datadir"/*.txt

Shell script - copy lines from file by key

I have two input files such that:
file1
123
456
789
file2
123|foo
456|bar
999|baz
I need to copy the lines from file2 whose keys are in file1, so the end result is:
file3
123|foo
456|bar
Right now, I'm using a shell script that loops through they key file and uses grep for each one:
grep "^${keys[$keyindex]}|" $datafile >&4
But as you can imagine, this is extremely slow. The key file (file1) has approximately 400,000 keys and the data file (file2) has about 750,000 rows. Is there a better way to do this?
You can try using join:
join -t'|' file1.txt file2.txt > file3.txt
I would use something like Python, which would process it pretty fast if you used an optimized data type like set. Not sure of your exact requirements, so you would need to adjust accordingly.
#!/usr/bin/python
# Create a set to store all of the items in file1
Set1 = set()
for line in open('file1', 'r'):
Set1.add(line.strip())
# Open a file to write to
file4 = open('file4', 'w')
# Loop over file2, and only write out the items found in Set1
for line in open('file2', 'r'):
if '|' not in line:
continue
parts = line.strip().split('|', 1)
if parts[0] in Set1:
file4.write(parts[1] + "\n")
join is the best solution, if sorting is OK. An awk solution:
awk -F \| '
FILENAME==ARGV[1] {key[$1];next}
$1 in key
' file1 file2

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Deleting lines from one file which are in another file

I have a file f1:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2:
line2
line8
..
..
I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?
grep -v -x -f f2 f1 should do the trick.
Explanation:
-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2
One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).
Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2
For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0] creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.
if you have Ruby (1.9+)
#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
Which has O(N^2) complexity. If you want to care about performance, here's another version
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
diff was used to show there are no differences between the 2 files generated.
Some timing comparisons between various other answers:
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
comm can also be used with stdin and here strings:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
Seems to be a job suitable for the SQLite shell:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
Did you try this with sed?
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh
Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
Obviously won't work for huge files but it did the trick for me. A few notes:
I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
A Python way of filtering one list using another list.
Load files:
>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()
Remove '\n' string at the end of each line:
>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]
Print only the f1 lines that are also in the f2 file:
>>> [a for a in f1 if all(b not in a for b in f2)]
$ cat values.txt
apple
banana
car
taxi
$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...
I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.
$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
grep -v $x $from.final > $from.final.tmp
mv $from.final.tmp $from.final
done
executing...
$ ./weed_out source.txt
and you get a nicely cleaned up file....

Resources