extracting lines if the first field matches another list saved in a different file -- shell command - macos

I have two files. One contains a list of items, e.g.,
Allie
Bob
John
Laurie
Another file (file2) contains a different list of items in a different order, but some items might overlap with the items in file 1, e.g,
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
I want to intersect these two files and extract only those lines from file 2 whose first field matches an item in field 1.
i.e., my output should be
Allie 45 56 67 23
Laurie 45 56 6 75
(preferably in this order, but it's OK if not)
grep -f file1 file2 doesn't do what I want.
I also need something efficient because the second file is HUGE.
I also tried this:
awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' file2 file1

If order doesn't matter then
awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Explanation
FNR==NR{ arr[$1]; next } Here we read first file (file1), arr is array, whose index key being first field $1.
$1 in arr we read second file ( file2), if array arr which was created while reading first file, has index key which is second file's first column ($1 in arr gives true, if index key exists), then print current record/row/line from file2
Test Results:
akshay#db-3325:/tmp$ cat file1
Allie
Bob
John
Laurie
akshay#db-3325:/tmp$ cat file2
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
akshay#db-3325:/tmp$ awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23

No need for complex joins, it is a filtering function
$ grep -wFf file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
has the benefit or keeping the order in file2 as well. -w option is for full word matches to eliminate sub-string matches to create false positives. Of course if your sample input is not representative and your data may contain key like entries in other fields this will not work without qualifying beginning of line.

This is the job that join is built for.
Providing a reproducer testable via copy-and-paste with shell functions (which you could replace with your actual input files):
cat_file1() {
printf '%s\n' Allie Bob John Laurie
}
cat_file2() {
printf '%s\n' 'Laurie 45 56 6 75' \
'Moxipen 10 45 56 56' \
'Allie 45 56 67 23'
}
join <(cat_file1 | sort) <(cat_file2 | sort)
...properly emits:
Allie 45 56 67 23
Laurie 45 56 6 75
Of course, don't cat file1 | sort -- run sort <file1 to provide a real handle for better efficiency, or (better!) store your inputs in sorted form in the first place.

Related

Unix sort function that sorts a space delimited txt file according to specific column by ASCII value

I have tried to come up with this answer but everything I try does not work.
My code below is what I have come up with:
sort -k$field_number "$1".db > temp.txt && cp temp.txt "$1.db"
Shouldn't this line of code sort the .db file by ASCII value (the sort function should sort by ASCII by default?). In the code, field_number corresponds to the column I wish to sort the lines of the file by. When I use my code to format the file (where I am sorting by column 2), I get the output below.
Textfile (the .db file) format:
a 5 5 5
Green 72 72 72
Smith 84 72 93
Jones 85 73 94
z 9 9 9
Ford 92 64 93
Miller 93 73 87
bobua che Apple Xor
Maybe your problem is with your collection. Try this please:
LC_COLLATE=C sort -n --ignore-case -k$field_number "$1".db > temp.txt && cp temp.txt "$1.db"

How to add the elements in a for loop [duplicate]

This question already has answers here:
Summing values of a column using awk command
(2 answers)
Closed 1 year ago.
so basically my code looks through data and greps whatever it begins with, and so I've been trying to figure out a way where I'm able to add the those values.
the sample input is
35 45 75 76
34 45 53 55
33 34 32 21
my code:
for id in $(awk '{ print $1 }' < $3); do echo $id; done
I'm printing it right now to see the values but basically whats outputted is
35
34
33
I'm trying to add them all together but I cant figure out how, some help would be appreciated.
my desired output would be
103
Lots of ways to do this, a few ideas ...
$ cat numbers.dat
35 45 75 76
34 45 53 55
33 34 32 21
Tweaking OP's current code:
$ sum=0
$ for id in $(awk '{ print $1 }' < numbers.dat); do ((sum+=id)); done
$ echo "${sum}"
102
Eliminating awk:
$ sum=0
$ while read -r id rest_of_line; do sum=$((sum+id)); done < numbers.dat
$ echo "${sum}"
102
Using just awk (looks like Aivean beat me to it):
$ awk '{sum+=$1} END {print sum}' numbers.dat
102
awk '{ sum += $1 } END { print sum }'
Test:
35 45 75 76
34 45 53 55
33 34 32 21
Result:
102
(sum(35, 34, 33) = 102, that's what you want, right?)
Here is the detailed explanation of how this works:
$1 is the first column of the input.
sum is the variable that holds the sum of all the values in the first column.
END { print sum } is the action to be performed after all the input has been processed.
So the awk program is basically summing up the first column of the input and printing the result.
This answer was partially generated by Davinci Codex model, supervised and verified by me.

How to exclude lines in a file based on a range of values taken from a second file

I have a file with a list of value ranges:
2 4
6 9
13 14
and a second file that looks like this:
HiC_scaffold_1 1 26
HiC_scaffold_1 2 27
HiC_scaffold_1 3 27
HiC_scaffold_1 4 31
HiC_scaffold_1 5 34
HiC_scaffold_1 6 35
HiC_scaffold_1 7 37
HiC_scaffold_1 8 37
HiC_scaffold_1 9 38
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 13 39
HiC_scaffold_1 14 39
HiC_scaffold_1 15 42
and I would like to exclude rows from file 2 where the value of column 2 falls within a range defined by file 1. The ideal output would be:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
I know how to extract a single range with awk:
awk '$2 == "2", $2 == "4"' file2.txt
but my file 1 has many many range values (lines) and I need to exclude rather than extract the rows that correspond to these values.
This is one awy:
$ awk '
NR==FNR { # first file
min[NR]=$1 # store mins and maxes in pairs
max[NR]=$2
next
}
{ # second file
for(i in min)
if($2>=min[i]&&$2<=max[i])
next
}1' ranges data
Output:
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
If the ranges are not huge and integer valued but the data is huge, you could make an exclude map of the values to speed up comparing:
$ awk '
NR==FNR { # ranges file
for(i=$1;i<=$2;ex[i++]); # each value in the range goes to exclude hash
next
}
!($2 in ex)' ranges data # print if not found in ex hash
If your ranges aren't huge:
$ cat tst.awk
NR==FNR {
for (i=$1; i<=$2; i++) {
bad[i]
}
next
}
!($2 in bad)
$ awk -f tst.awk file1 file2
HiC_scaffold_1 1 26
HiC_scaffold_1 5 34
HiC_scaffold_1 10 39
HiC_scaffold_1 11 39
HiC_scaffold_1 12 39
HiC_scaffold_1 15 42
sedception
If the second column of file2.txt always equals to the index of its line, you can use sed to prune the lines. If this is not your case, please refer to the awkception paragraph.
sed $(sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt) file2.txt
Where file1.txt contains your ranges and file2.txt is the data itself.
Basically it constructs a sed call that chains a list of -e i,jd expressions, meaning that it will delete lines between the ith line and the jth line.
In your example sed 's/^\([0-9]*\)[[:space:]]*\([0-9]*\)/-e \1,\2d/' file1.txt would output -e 2,4d -e 6,9d -e 13,14d which is the list of expressions for calling sed on file2.txt.
In the end it will call:
sed -e 2,4d -e 6,9d -e 13,14d file2.txt
This command deletes all lines between the 2nd and the 4th, and all lines between the 6th and the 9th, and all lines between the 13th and the 14th.
Obviously it does not work if the second column of file2.txt does not match the index of its own line.
awkception
awk "{$(awk '{printf "if ($2>=%d && $2<=%d) next\n", $1, $2}' file1.txt)}1" file2.txt
This solution works even if the second column does not match the index of its line.
The method uses awk to create an awk program, just like sed created sed expressions in the sedception solution.
In the end this will call :
awk '{
if ($2>=2 && $2<=4) next
if ($2>=6 && $2<=9) next
if ($2>=13 && $2<=14) next
}1' file2.txt
It should be noted that this solution is significantly slower than sed.

Count duplicated couple of lines

I have a configuration file with this format:
cod 11
loc1 23
pto1 33
loc2 55
pto2 66
cod 12
loc1 55
pto1 66
loc2 88
pto2 77
...
I want to count how many times a pair of numbers appear in sequence loc/pto (indipendently of loc/pto number). In the example, the couple 55/66 appears 2 times (once as loc1/pto1 and one as loc2/pto2).
I have googled around and tried some combination of grep, uniq and awk but I only managed in count single line or number duplicated. I read the man documentation of those commands not finding any clue relative to my problem.
You could use the following:
$ sort file | uniq -f1 -dc
2 loc1 55
2 pto1 66
-f1 is skipping the 1st field when comparing lines
-dc is printing duplicate line with its associated count
Despite no visible effort on the part of the OP, this was an interesting question to work out.
awk '{for (i=1 ; i < 10 ; i++) if (NR == i) array[i]=$2} END {for (i=1 ; i < 10 ; i++) print array[i] "," array[i+1]}' file | sort | uniq -c
Output-
1 11,23
1 12,55
1 23,33
1 33,55
2 55,66
1 66,12
1 66,88
1 88,
The output tells you that 55 is followed by 66 twice. Other pairs only occur once.
Explanation-
I define an array in awk whoe elements are the ith number in the second column. The part after the END concatenates the ith and i+1th element. Then there is a sort | uniq -c to see if these pairs occur more than once.
If you want to know how many times a duplicate number appeared in the file:
awk '{print $2}' <filename> | sort | uniq -dc
Output:
2 55
2 66
If you want to know how many times a number appeared in the file regardless of being duplicate or not:
awk '{print $2}' <filename> | sort | uniq -c
Output:
1 11
1 12
1 23
1 33
2 55
2 66
1 77
1 88
If you want to print the full line on duplicate match based on second column:
awk '{print $2}' <filename> | sort | uniq -d | grep -F -f - <filename>
Output:
loc2 55
pto2 66
loc1 55
pto1 66

Removing multiple block of lines of a text file in bash

Assume a text file with 40 lines of data. How can I remove lines 3 to 10, 13 to 20, 23 to 30, 33 to 40, in place using bash script?
I already know how to remove lines 3 to 10 with sed, but I wonder if there is a way to do all the removing, in place, with only one command line. I can use for loop but the problem is that with each iteration of loop the lines number will be changed and it needs some additional calculation of line numbers to be removed.
here is an awk oneliner, works for your needs no matter your file has 40 lines or 40k lines:
awk 'NR~/[12]$/' file
for example, with 50 lines:
kent$ seq 50|awk 'NR~/[12]$/'
1
2
11
12
21
22
31
32
41
42
sed -i '3,10d;13,20d;23,30d;33,40d' file
This might work for you (GNU sed):
sed '3~10,+7d' file
Deletes lines in the range of 3 and thereafter steps of 10 for the following 7 lines to be deleted.
If the file was longer than 40 lines and you were only interested in the first 40 lines:
sed '41,$b;3~10,+7d' file
The first instruction tells sed to ignore lines 41 to end-of-file.
Could also be written:
sed '1,40{3~10,+7d}' file
#Kent's answer is the way to go for this particular case, but in general:
$ seq 50 | awk '{idx=(NR%10)} idx>=1 && idx<=2'
1
2
11
12
21
22
31
32
41
The above will work even if you want to select the 4th through 7th lines out of every 13, for example:
$ seq 50 | awk '{idx=(NR%13)} idx>=4 && idx<=7'
4
5
6
7
17
18
19
20
30
31
32
33
43
44
45
46
its not constrained to N out of 10.
Or to select just the 3rd, 5th and 6th lines out of every 13:
$ seq 50 | awk 'BEGIN{split("3 5 6",tmp); for (i in tmp) tgt[tmp[i]]=1} tgt[NR%13]'
3
5
6
16
18
19
29
31
32
42
44
45
The point is - selecting ranges of lines is a job for awk, definitely not sed.
awk '{m=NR%10} !(m==0 || m>=3)' file > tmp && mv tmp file

Resources