How to merge three text files into three columns on screen - bash

How can I merge three text files into three columns on screen?
1 A 1
2 B 2
3 C 3
D
E
I tried...
paste file1.txt file2.txt file3.txt | column -s $'\t' -t
...but I always get
1 A 1
2 B 2
3 C 3
D
E
Thanks in advance for your help!
line 1-2 of file1.txt
USB Device Class ID:
CdRom&Ven_ZALMAN&Prod__Virtual_CD-Rom&Rev_
line 1-2 of file2.txt
USB Instance ID:
______XX00000001&1
line 1-2 of file3.txt
Last updated (Subkey):
2015-01-12 15:08:45 UTC+0000

I don't know your input files, but paste works as intended.
$ paste <(seq 1 4) <(seq 10 17) <(seq 5 9)
1 10 5
2 11 6
3 12 7
4 13 8
14 9
15
16
17

:|paste -d ' ' file1 - file2 - file3 | column -ts "| " combine many files as a table column -t and -s as a separator "| " .
the output will be like that
1 A 1
2 B 2
3 C 3
D
E

If you only have 3 files or a few to deal with you can do this:
$ paste foo[12].txt | expand -t 45 | paste - foo3.txt | expand -t 12
USB Device Class ID: USB Instance ID: Last updated (Subkey):
CdRom&Ven_ZALMAN&Prod__Virtual_CD-Rom&Rev_ ______XX00000001&1 2015-01-12 15:08:45 UTC+0000
______XY0000000182
$
You need to choose the tab expansions 45 and 12 depending upon maximum line widths in foo1.txt and foo2.txt.

Related

Count overlapping occurrences of a substring *in a very large file* using Bash

I have files on the order of a few dozen gigabytes (genome data) on which I need to find the number of occurrences for a substring. While the answers I've seen here use grep -o then wc -l, this seems like a hacky way that might not work for the very large files I need to work with.
Does the grep -o/wc -l method scale well for large files? If not, how else would I go about doing it?
For example,
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt
111
222
333
444
555
666
must return 6 occurrences for aaa. (Except there are maybe 10 million more lines of this.)
Find 6 overlapping substrings aaa in the string
line="aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt"
You don't want to see the strings, you want to count them.
When you try
# wrong
grep -o -F "aaa" <<< "${line}" | wc -l
you are missing the overlapping strings.
With the substring aaa you have 5 hits in aaaaaaa, so how handle ${line}?
Start with
grep -Eo "a{3,}" <<< "${line}"
Result
aaa
aaaa
aaaaa
Hom many hits do we have? 1 for aaa, 2 for aaaa and 3 for aaaaa.
Compare the total count of characters with the number of lines (wc):
match lines chars add_to_total
aaa 1 4 1
aaaa 1 5 2
aaaaa 1 6 3
For each line substract 3 from the total count of characters for that line.
When the result has 3 lines and 15 characters, calculate
15 characters - (3 lines * 3 characters) = 15 - 9 = 6
In code:
read -r lines chars < <(grep -Eo "a{3,}" <<< "${line}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
Or for a file
read -r lines chars < <(grep -Eo "a{3,}" "${file}" | wc -lc)
echo "Substring count: $((chars - (3 * lines)))"
aaa was "easy", how about other searchstrings?
I think you have to look for the substring and think of a formula that works for that substring. abcdefghi will have no overlapping strings, but abcdabc might.
Potential matches with abcdabc are
abcdabc
abcdabcdabc
abcdabcdabcdabc
Use testline
line="abcdabcdabcdabc something else abcdabcdabcdabc no match here abcdabc and abcdabcdabc"
you need "abc(dabc)+" and have
match lines chars add_to_total
abcdabcdabcdabc 1 16 3
abcdabcdabcdabc 1 16 3
abcdabc 1 8 1
abcdabcdabc 1 12 2
For each line substract 4 from the total count of characters and divide the answer by 4. Or (characters/4) - nr_line. When the result has 4 lines and 52 characters, calculate
(52 characters / fixed 4) / 4 lines = 13 - 4 = 9
In code:
read -r lines chars < <(grep -Eo "abc(dabc)+" <<< "${line}" | wc -lc)
echo "Substring count: $(( chars / 4 - lines))"
When you have a large file, you might want to split it first.
I suppose there are 2 approaches to this (both methods report 29/6 for the 2 test lines):
Use the summation method :
# WHINY_USERS=1 is a shell param for mawk-1 to pre-sort array
${input……} | WHINY_USERS=1 {m,g}awk '
BEGIN {
1 FS = "[^a]+(aa?[^a]+)*"
1 OFS = "|"
1 PROCINFO["sorted_in"] = "#ind_str_asc"
} {
2 _ = ""
2 OFS = "|"
2 gsub("^[|]*|[|]*$",_, $!(NF=NF))
2 split(_,__)
split($-_,___,"[|]+")
12 for (_ in ___) {
12 __[___[_]]++
}
2 _____=____=_<_
2 OFS = "\t"
2 print " -- line # "(NR)
7 for (_ in __) {
7 print sprintf(" %20s",_), __[_], \
______=__[_] * (length(_)-2),\
"| "(____+=__[_]), _____+=______
}
print "" }'
|
-- line # 1
aaa 3 3 | 3 3
aaaa 2 4 | 5 7
aaaaa 3 9 | 8 16
aaaaaaaaaaaaaaa 1 13 | 9 29
-- line # 2
aaa 1 1 | 1 1
aaaa 1 2 | 2 3
aaaaa 1 3 | 3 6
Print out all the copies of that substring :
{m,g}awk' {
2 printf("%s%.*s",____=$(_=_<_),_, NF=NF)
9 do { _+=gsub(__,_____)
} while(index($+__,__))
2 if(_) {
2 ____=substr(____,-_<_,_)
2 gsub(".", (":")__, ____)
2 print "}-[(# " (_) ")]--;\f\b" substr(____, 2)
} else { print "" } }' FS='[^a]+(aa?[^a]+)*' OFS='|' __='aaa' _____='aa'
|
aaagtcgaaaaagtccatgcaaataaaagtcgaaaaagtccatgcatatgatactttttttttt
tttttttaaagtcgaaaaagaaaaaaaaaaaaaaatataaaatccatgc}-[(# 29)]--;
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:
aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa:aaa
aaataaaagtcgaaaaagtccatgcatatgatacttttttttttttttttt}-[(# 6)]--;
aaa:aaa:aaa:aaa:aaa:aaa

Dividing one file into separate based on line numbers

I have the following test file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
I want to separate it in a way that each file contains the last line of the previous file as the first line. The example would be:
file 1:
1
2
3
4
5
file2:
5
6
7
8
9
file3:
9
10
11
12
13
file4:
13
14
15
16
17
file5:
17
18
19
20
That would make 4 files with 5 lines and 1 file with 4 lines.
As a first step, I tried to test the following commands I wrote to get only the first file which contains the first 5 lines. I can't figure out why the awk command in the if statement, instead of printing the first 5 lines, it prints the whole 20?
d=$(wc test)
a=$(echo $d | cut -f1 -d " ")
lines=$(echo $a/5 | bc -l)
integer=$(echo $lines | cut -f1 -d ".")
for i in $(seq 1 $integer); do
start=$(echo $i*5 | bc -l)
var=$((var+=1))
echo start $start
echo $var
if [[ $var = 1 ]]; then
awk 'NR<=$start' test
fi
done
Thanks!
Why not just use the split util available from your POSIX toolkit. It has an option to split on number of lines which you can give it as 5
split -l 5 input-file
From the man split page,
-l, --lines=NUMBER
put NUMBER lines/records per output file
Note that, -l is POSIX compliant also.
$ ls
$
$ seq 20 | awk 'NR%4==1{ if (out) { print > out; close(out) } out="file"++c } {print > out}'
$
$ ls
file1 file2 file3 file4 file5
.
$ cat file1
1
2
3
4
5
$ cat file2
5
6
7
8
9
$ cat file3
9
10
11
12
13
$ cat file4
13
14
15
16
17
$ cat file5
17
18
19
20
If you're ever tempted to use a shell loop to manipulate text again, make sure to read https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice first to understand at least some of the reasons to use awk instead. To learn awk, get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
oh. and wrt why your awk command awk 'NR<=$start' test didn't work - awk is not shell, it has no more access to shell variables (or vice-versa) than a C program does. To init an awk variable named awkstart with the value of a shell variable named start and then use that awk variable in your script you'd do awk -v awkstart="$start" 'NR<=awkstart' test. The awk variable can also be named start or anything else sensible - it is completely unrelated to the name of the shell variable.
You could improve your code by removing the unneccesary echo cut and bc and do it like this
#!/bin/bash
for i in $(seq $(wc -l < test) ); do
(( i % 4 != 1 )) && continue
tail +$i test | head -5 > "file$(( 1+i/4 ))"
done
But still the awk solution is much better. Reading the file only once and taking actions based on readily available information (like the linenumber) is the way to go. In shell you have to count the lines, there is no way around it. awk will give you that (and a lot of other things) for free.
Use split:
$ seq 20 | split -l 5
$ for fn in x*; do echo "$fn"; cat "$fn"; done
xaa
1
2
3
4
5
xab
6
7
8
9
10
xac
11
12
13
14
15
xad
16
17
18
19
20
Or, if you have a file:
$ split -l test_file

Add column to csv file

I have two files and I need catch the last column of a file and append to other file.
file1
1 2 3
1 2 3
1 2 3
file2
5 5
5 5
5 5
Initial proposal
#!/usr/bin/env bash
column=$(awk '{print $(NF)}' $file1)
paste -d',' $file2 < $column
Expected result
file2
5 5 3
5 5 3
5 5 3
But, This script does not work yet
OBS: I do not know how many columns have in the file. I need more generic solution.
You can use this paste command:
paste -d " " file2 <(awk '{print $NF}' file1)
5 5 3
5 5 3
5 5 3
To append last column of file1 to file2:
paste -d " " file2 <(rev file1 | cut -d " " -f 1 | rev)
Output:
5 5 3
5 5 3
5 5 3
To paste the second column of file 1 to file 2:
while read line; do
read -u 3 c1 c2 c3;
echo $line $c2;
done < file2 3< file1
You can use Perl too:
$ paste -d ' ' file2.txt <(perl -lne 'print $1 if m/(\S+)\s*$/' file1.txt)
5 5 3
5 5 3
5 5 3
Or grep:
$ paste -d ' ' file2.txt <(grep -Eo '(\S+)\s*$' file1.txt)
5 5 3
5 5 3
5 5 3

How to grep two column from a single file

cat Error00
4 0 375
4 2001 21
4 2002 20
cat Error01
4 0 465
4 2001 12
4 2002 40
4 2016 1
I want output as below
4 0 375 465
4 2001 21 12
4 2002 20 20
4 2016 - 1
i am using the below query. here problem is i m not able to handle grep for two field because space is coming.
please suggest how can to get rid of this.
keylist=$(awk '{print $1,$2'} Error0[0-1] | sort | uniq)
for key in ${keylist} ; do
echo ${key}
val_a=$(grep "^${key}" Error00 | awk '{print $3}') ;val_a=${val_a:---}
val_b=$(grep "^${key}" Error01 | awk '{print $1,$2}') ; val_b=${val_b:--- --}
echo $key ${val_a} >>testreport
done
i m geting the oputput as below
4 375 465
0
4 21 12
2001
4 20 20
2002
4 - 1
2016
A single awk one liner can handle this easily:
awk 'FNR==NR{a[$1,$2]=$3;next}{print $1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3}' err0 err1
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
For formatted output you can use printf instead of print. Like Jonathan Leffler suggest:
printf "%s %-6s %-6s %s\n",$1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
However a general solution is to use column -t for a nice table output:
awk '{....}' err0 err1 | column -t
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
grep is not really the right tool for this job. You can either play with awk or Perl (or Python, or …), or you can use join. However, join only joins on a single column at a time, and you appear to need to join on two columns. So, we're going to have to massage the data so that it will work with join. I'm about to assume you're using bash and so have process substitution available. You can do the job without, but it is fiddlier and involves temporary files (and traps to clean them up, etc).
The key to the join will be to replace the blank between the first two columns with a colon (or any other convenient character — control-A would work fine too), then join the files on column 1 with a replacement character. The inputs must be sorted; the output must have the colon replaced with a blank.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$
The 's/ */:/' operation replaces the first sequence of one or more blanks with a colon; the input data has two blanks between the 4 and the 0 in the first line of Error00. The input to join must be in sorted order of the joining field, here the first field. The output is the join field, the second column of Error00 and the second column of Error01 (remembering that means the second column after the first two have been fused by the colon). If there's an unmatched line in the first file, generate an output line (-a 1); ditto for the second file; and for the missing fields, insert a dash (-e '-'). The final sed removes the colon that was added.
If you want the data formatted, pipe it through awk.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /' |
> awk '{printf("%s %-6s %-6s %s\n", $1, $2, $3, $4)}'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$

Paste side by side multiple files by numerical order

I have many files in a directory with similar file names like file1, file2, file3, file4, file5, ..... , file1000. They are of the same dimension, and each one of them has 5 columns and 2000 lines. I want to paste them all together side by side in a numerical order into one large file, so the final large file should have 5000 columns and 2000 lines.
I tried
for x in $(seq 1 1000); do
paste `echo -n "file$x "` > largefile
done
Instead of writing all file names in the command line, is there a way I can paste those files in a numerical order (file1, file2, file3, file4, file5, ..., file10, file11, ..., file1000)?
for example:
file1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
...
file2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
....
file 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
....
paste file1 file2 file3 .... file 1000 > largefile
largefile
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
....
Thanks.
If your current shell is bash: paste -d " " file{1..1000}
you need rename the files with leading zeroes, like
paste <(ls -1 file* | sort -te -k2.1n) <(seq -f "file%04g" 1000) | xargs -n2 echo mv
The above is for "dry run" - Remove the echo if you satisfied...
or you can use e.g. perl
ls file* | perl -nlE 'm/file(\d+)/; rename $_, sprintf("file%04d", $1);'
and after you can
paste file*
With zsh:
setopt extendedglob
paste -d ' ' file<->(n)
<x-y> is to match positive decimal integer numbers from x to y. x and/or y can be omitted so <-> is any positive decimal integer number. It could also be written [0-9]## (## being the zsh equivalent of regex +).
The (n) is the globbing qualifiers. The n globbing qualifier turns on numeric sorting which sorts on all sequences of decimal digits appearing in the file names.

Resources