One file has 3 columns and looks like this:
1 249250621 225280621
2 243199373 238207373
3 198022430 194797140
4 191154276 187661676
5 180915260 177695260
6 171115067 167395067
7 159138663 155353663
8 146364022 142888922
9 141213431 120143431
10 135534747 131314747
11 135006516 131129516
12 133851895 130481895
13 115169878 95589878
14 107349540 88289540
15 102531392 81694769
16 90354753 78884753
17 81195210 77795210
18 78077248 74657248
19 59128983 55808983
20 63025520 59505520
21 48129895 35108702
22 51304566 34894566
X 155270560 151100560
Y 59373566 25653566
My other file looks like:
5677533
4506000
2272564
2753699
4015846
2163243
3812595
2885199
8064159
3522086
2006115
1490517
1072244
1423429
3009679
2705191
1479591
800436
929876
648000
347993
972862
7812943
9660863
When I try
paste file1.txt file2.txt > file3.txt
I get:
1 567753321 225280621
2 450600073 238207373
3 227256430 194797140
4 275369976 187661676
5 401584660 177695260
6 216324367 167395067
7 381259563 155353663
8 288519922 142888922
9 806415931 120143431
10 352208647 131314747
11 200611516 131129516
12 149051795 130481895
13 107224478 95589878
14 142342940 88289540
15 300967992 81694769
16 27051913 78884753
17 14795910 77795210
18 80043648 74657248
19 92987683 55808983
20 64800020 59505520
21 34799395 35108702
22 97286266 34894566
X 781294360 151100560
Y 96608636 25653566
The data from file2.txt overwrites the 2nd column in file 1. I want the data in file2.txt to be appended to the end in a new column, separated by a tab in file1.txt into the new file3.txt. But it does not seem to be working. Thoughts? Thanks.
Edit: For file2, I can create it with the same first column index 1-22, x, y. But then when I try to join, it doesn't work either. This is my output on join:
1 249250621 225280621
5677533
2 243199373 238207373
4506000
3 198022430 194797140
2272564
4 191154276 187661676
2753699
5 180915260 177695260
4015846
6 171115067 167395067
2163243
7 159138663 155353663
3812595
8 146364022 142888922
2885199
9 141213431 120143431
8064159
10 135534747 131314747
3522086
11 135006516 131129516
2006115
12 133851895 130481895
1490517
13 115169878 95589878
1072244
14 107349540 88289540
1423429
15 102531392 81694769
3009679
16 90354753 78884753
2705191
17 81195210 77795210
1479591
18 78077248 74657248
800436
19 59128983 55808983
929876
20 63025520 59505520
648000
21 48129895 35108702
347993
22 51304566 34894566
972862
X 155270560 151100560
7812943
Y 59373566 25653566 9660863
Try running dos2unix on both file1.txt and file2.txt first.
==> dos2unix file*.txt
file1.txt: done.
file2.txt: done.
==> paste file1.txt file2.txt > file3.txt
==> cat file3.txt
1 249250621 225280621 5677533
2 243199373 238207373 4506000
3 198022430 194797140 2272564
4 191154276 187661676 2753699
5 180915260 177695260 4015846
6 171115067 167395067 2163243
7 159138663 155353663 3812595
8 146364022 142888922 2885199
9 141213431 120143431 8064159
10 135534747 131314747 3522086
11 135006516 131129516 2006115
12 133851895 130481895 1490517
13 115169878 95589878 1072244
14 107349540 88289540 1423429
15 102531392 81694769 3009679
16 90354753 78884753 2705191
17 81195210 77795210 1479591
18 78077248 74657248 800436
19 59128983 55808983 929876
20 63025520 59505520 648000
21 48129895 35108702 347993
22 51304566 34894566 972862
X 155270560 151100560 7812943
Y 59373566 25653566 9660863
Related
So I have a C program that outputs many numbers. I have to check them all. The problem is, each time I run my program, I need to change seeds. In order to do that, I've been doing it manually and was trying to make a shell script to work around this.
I've tried using sed but couldn't manage to do it.
I'm trying to get the output like this:
a=$(./algorithm < input.txt)
b=$(./algorithm2 < input.txt)
c=$(./algorithm3 < input.txt)
The output of each algorithm program is something like this:
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
So the variable a has all this output, and what I need is
variable a to contain this whole string
and variable a1 to contain only the third number, in this case, 315.
Another example:
2 3 712
1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
echo $b should give this output:
2 3 712
1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
and echo $b1 should give this output:
712
Thanks!
Not exactly what you are asking, but one way to do this would be to store the results of your algorithm in arrays, and then dereference the item of interest. You'd write something like:
a=( $(./algorithm < input.txt) )
b=( $(./algorithm2 < input.txt) )
c=( $(./algorithm3 < input.txt) )
Notice the extra () that encloses the statements. Now, a, b and c are arrays, and you can access the item of interest like ${a[0]} or $a[1].
For your particular case, since you want the 3rd element, that would have index = 2, hence:
a1=${a[2]}
b1=${b[2]}
c1=${c[2]}
Since you are using the Bash shell (see your tags), you can use Bash arrays to easily access the individual fields in your output strings. For example like so:
#!/bin/bash
# Your lines to gather the output:
# a=$(./algorithm < input.txt)
# b=$(./algorithm2 < input.txt)
# c=$(./algorithm3 < input.txt)
# Just to use your example output strings:
a="$(printf "12 13 315 \n 1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5")"
b="$(printf "2 3 712 \n 1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21")"
# Put the output in arrays.
a_array=($a)
b_array=($b)
# You can access the array elements individually.
# The array index starts from 0.
# (The names a1 and b1 for the third elements were your choice.)
a1="${a_array[2]}"
b1="${b_array[2]}"
# Print output strings.
# (The newlines in $a and $b are gobbled by echo, since they are not quoted.)
echo "Output a:" $a
echo "Output b:" $b
# Print third elements.
echo "3rd from a: $a1"
echo "3rd from b: $b1"
This script outputs
Output a: 12 13 315 1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
Output b: 2 3 712 1 23 15 12 31 23 3 2 5 6 6 1 2 3 5 51 2 3 21
3rd from a: 315
3rd from b: 712
Explanation:
The trick here is that array constants (literals) in Bash have the form
(<space_separated_list_of_elements>)
for example
(1 2 3 4 a b c nearly_any_string 99)
Any variable that gets such an array assigned, automatically becomes an array variable. In the script above, this is what happens in a_array=($a): Bash expands the $a to the <space_separated_list_of_elements> and reads the whole expression again interpreting it as an array constant.
Individual elements in such arrays can be referenced like variables by using expressions of the form
<array_name>[<idx>]
like a variable name. Therein, <array_name>is the name of the array and <idx> is an integer that references the individual element. For arrays that are represented by array constants, the index counts elements continuously starting from zero. Therefore, in the script, ${a_array[2]} expands to the third element in the array a_array. If the array would have less elements, a_array[2] would be considered unset.
You can output all elements in the array a_array, the corresponding index array, and the number of elements in the array respectively by
echo "${a_array[#]}"
echo "${!a_array[#]}"
echo "${#a_array[#]}"
These commands can be used to track down the fate of the newline: Given the script above, it is still in $a, as can be seen by (watch the quotes)
echo "$a"
which yields
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
But the newline did not make it into the array a_array. This is because Bash considers it as part of the whitespace that separates the third and the fourth element in the array assignment. The same applies if there are no extra spaces around the newline, like here:
12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
I actually assume that the output of your C program comes in this form.
This will store the full string in a[0] and the individual fields in a[1-N]:
$ tmp=$(printf '12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5\n')
$ a=( $(printf '_ %s\n' "$tmp") )
$ a[0]="$tmp"
$ echo "${a[0]}"
12 13 315
1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5
$ echo "${a[3]}"
315
Obviously replace $(printf '12 13 315\n1 2 3 4 5 6 7 8 10 2 8 9 1 0 0 2 3 4 5\n') with $(./algorithm < input.txt) in your real code.
I have two files.
File A: is a list of indexes.
1
2
3
4
13
14
15
16
19
20
The second file, file B contains some information that I want to extract depending on the indexes that aren't in the list of the file A.
File B:
#SRR4293698.1 5 length=35
GCTGGNCTTTGTGCATGCAATCTAGNNTCTTCTT
+SRR4293698.1 5 length=35
AAAAA#FFFFFFFFFFFFFFFFFFF##FFFFFFF
#SRR4293698.5 5 length=35
GCTGGNCTTTGTGCATGCAATCTAGNNTCTTCTT
+SRR4293698.5 5 length=35
AAAAA#FFFFFFFFFFFFFFFFFFF##FFFFFFF
#SRR4293698.8 8 length=36
CTGGCNTCTACAATATCTGGACGAGNTTCCGCATNA
+SRR4293698.8 8 length=36
AAAAA#FFFFFFAFFFFFFF.FF)F#FFFFFFFF#F
#SRR4293698.9 9 length=76
CTTCANATCATTTTCAGACTTTTCANACTGCTTGNT
+SRR4293698.9 9 length=76
AAAAA#FFFFFFFFFFF7FF7FFFF#FFFFFFFF#F
#SRR4293698.10 10 length=76
...
I would expect to extract lines 5-12, 17 and so on.
#SRR4293698.5 5 length=35
GCTGGNCTTTGTGCATGCAATCTAGNNTCTTCTT
+SRR4293698.5 5 length=35
AAAAA#FFFFFFFFFFFFFFFFFFF##FFFFFFF
#SRR4293698.8 8 length=36
CTGGCNTCTACAATATCTGGACGAGNTTCCGCATNA
+SRR4293698.8 8 length=36
AAAAA#FFFFFFAFFFFFFF.FF)F#FFFFFFFF#F
#SRR4293698.10 10 length=76
...
I tried some thing, I found that :sed -nf <(sed 's/.*/&p/' A) B that works extracting the B file lines that are in the A file. I was thinking to generate a AllIndex - A file, to get a new list of indexes, but definitely I think that there is another smarter way to do.
Thanks in advance!
Short awk approach:
awk 'NR == FNR{ ind[$1]; next }!(FNR in ind)' file_a file_b
The output:
#SRR4293698.5 5 length=35
GCTGGNCTTTGTGCATGCAATCTAGNNTCTTCTT
+SRR4293698.5 5 length=35
AAAAA#FFFFFFFFFFFFFFFFFFF##FFFFFFF
#SRR4293698.8 8 length=36
CTGGCNTCTACAATATCTGGACGAGNTTCCGCATNA
+SRR4293698.8 8 length=36
AAAAA#FFFFFFAFFFFFFF.FF)F#FFFFFFFF#F
#SRR4293698.10 10 length=76
...
With GNU sed and bash:
sed -f <(sed 's/.*/&d/' numbers.txt) file.txt
Output:
#SRR4293698.5 5 length=35
GCTGGNCTTTGTGCATGCAATCTAGNNTCTTCTT
+SRR4293698.5 5 length=35
AAAAA#FFFFFFFFFFFFFFFFFFF##FFFFFFF
#SRR4293698.8 8 length=36
CTGGCNTCTACAATATCTGGACGAGNTTCCGCATNA
+SRR4293698.8 8 length=36
AAAAA#FFFFFFAFFFFFFF.FF)F#FFFFFFFF#F
#SRR4293698.10 10 length=76
I am trying to apply simple awk script to the dataset file.
The file has 150 columns, I need cols between 20 to 30 only.
below is the script I used to get the records with field between 20 to 30.
code
BEGIN{}
{
for(f=20;f<=30;f++){
print $f;
}
}
I dont know why I get each value of the 10 fields in next line.
That is,
sample dataset
1 2 3 4 5 6 7
2 2 3 4 5 6 7
3 3 3 4 5 6 7
4 4 4 4 5 6 7
5 5 5 5 5 6 7
6 6 6 6 6 6 7
7 7 7 7 7 7 7
I get output as
1
2
3
4
5
6
7
2
2
3
4
5
6
7
...so on
Below is another way of doing the same
awk -v f=20 -v t=30 '{for(i=f;i<=t;i++) \
printf("%s%s",$i,(i==t)?"\n":OFS)}' file
Notes
f and t are the starting and the ending columns respectively.
We used the ternary operator to control the field separator between the needed columns.
Edit
If you need columns 20 thru 30 and the last column, below would suffice :
awk -v f=20 -v t=30 '{for(i=f;i<=t;i++) \
printf("%s%s",$i,(i==t)?OFS""$NF"\n":OFS)}' file
Solution
BEGIN{FS=" ";}
{
for(f=20;f<=30;f++){
printf("%s ",$f);
}print "";
}
This question already has answers here:
How can I sum values in column based on the value in another column?
(5 answers)
Combine text from two files, output to another [duplicate]
(2 answers)
Closed 6 years ago.
I want to extract and combine a certain column from a bunch of text files into a single file as shown.
File1_example.txt
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt
A 123 5
B 234 6
C 345 7
D 456 8
File3_example.txt
A 123 9
B 234 10
C 345 11
D 456 12
...
..
.
File100_example.txt
A 123 55
B 234 66
C 345 77
D 456 88
How can I loop through my files of interest and paste these columns together so that the final result is like below without having to type out 1000 unique file names?
1 5 9 ... 55
2 6 10 ... 66
3 7 11 ... 77
4 8 12 ... 88
Try this:
paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Example:
File1_example.txt:
A 123 1
B 234 2
C 345 3
D 456 4
File2_example.txt:
A 123 5
B 234 6
C 345 7
D 456 8
Run command as:
$ paste File[0-9]*_example.txt | awk '{i=3;while($i){printf("%s ",$i);i+=3}printf("\n")}'
Output:
1 5
2 6
3 7
4 8
I tested below code with first 3 files
cat File*_example.txt | awk '{a[$1$2]= a[$1$2] $3 " "} END{for(x in a){print a[x]}}' | sort
1 5 9
2 6 10
3 7 11
4 8 12
1) use an awk array, a[$1$2]= a[$1$2] $3 " " index is column1 and column2, array value appends all column 3.
2) END{for(x in a){print a[x]}} travesrsed array a and prints all values.
3)use sort to sort the output.
when cating you need to ensure the file order is preserved, one way is to explicitly specify the files
cat File{1..100}_example.txt | awk '{print $NF}' | pr 4ts' '
extract last column by awk and align using pr
I have a file with all different integer in which each line may have different lenghts, like this:
1 2 3 4 5
16 7 8
9 10 101 102 13 14
15 6 17
24 28 31 30 18
I would like to print in output the number of elements that a line presents and the number of times there is the same number of elements per lines; the output of this example should be:
3 2
5 2
6 1
In the first column there are the number of elements per line, in the second the number of lines that presents the same number of elements.
The first line in the file has 5 elements and also the 5th one etc etc.
Print the count for the number of fields:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file
5 2
6 1
3 2
Pipe to sort for ordered output:
$ awk '{a[NF]++}END{for(k in a)print k,a[k]}' file | sort
3 2
5 2
6 1