Selecting a random read from pair of reads in fastq file - bash

I have a question about random selection of a read from a sampled pair-end fastq files. I read some topics regarding this manner but none could solve my problem, which is:
I got two fastq files R1.fastq and R2.fastq. What I want to achieve is to randomly sample those files and from each sampled pair of reads I want to randomly select only one read.
What I did so far is...
I sampled my files using seqtk:
seqtk sample -s100 R1.fastq 10000 > R1_sample.fastq
seqtk sample -s100 R2.fastq 10000 > R2_sample.fastq
then I sorted each file by sequence ID like this:
paste - - - - < R1_sample.fastq | sort -k1 -t " " | tr "\t" "\n" > R1_sample_sorted.fastq
I did the same with R2_sample.fastq. Then I merged both sorted files so that R1 would be in one column and R2 in the second column:
pr -mts R1_sample_sorted.fastq R2_sample_sorted.fastq > merged.fastq
the file looks like this:
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TGATGTTTGGATGTAAAGTGAAATATTAGTTGGCG AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAA
+ +
BBBFFFFFFFFFFFIFFIFFIIIIFIIIFIIFIII B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
CCTCCTAGGCGACCCAGACAATTATACCCTAGCCA TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGG
+ +
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TTCTATTTATTACCTCAGAAGTTTTTTTCTTCGCA GTAAAAGGCTCAGAAAAATCCTGCGAAGAAAAAAA
+ +
BBBFFFFFFFFFFIIIIIIIIFIIFIIIFIIIIII BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
And now I want to randomly select only one read from each pair. My initial idea was to use shuf to get a random number from range 1-2:
shuf -i1-2 -n1
and then somehow select the read cooresponding to the number I got from shuf. For example in the first iteration I got 1 so I pick the read from column 1, in the socond iteration I got 2 so from the next pair of reads I pick the read in the second column etc.
I got stuck here. So my question is, is there a neat way to do this? Maybe with awk or some other method? Any help will be very appreciated.
Comment to Ashafixs answer:
Thanks for your respond and sorry for the huge delay...!
I've tested your solutions and they both seem to have flaws.
For the first script I constructed test fastq files R1 and R2 each containing 6 reads. After running the script I expect it to output 6 reads as well (24 lines) in the correct order(ID,seq,desc,qual) but as a set of reads randomly selected from R1 or R2 file. What I got from the script is:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 1:N:0:ATGCTCGTTCTCTCGT
GTAAGGTTAGGAGGGTGTTAATTATTAAAATTAAGGCGAAGTTTATTACTCTTTTTTGAATGTTG
+
BBBFFFFFFFFFFIIBFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFF
You can see that the output is not correct. The second read is missing three lines and there should be six reads in total not three. In addition each time I run the script it outputs different number of reads.
For the second script I input a merged fastq file like described above. The output looks similar to the first script output:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGGTCGCCTAGGAGGAGATCGGAAGAGCGTCGT
+
BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIIIFFFIIIIIIFIIIIIFIIIFFFFFFFFFFF
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:88140 1:N:0:ATGCTCGTTCTCTCGT
ACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGAGAAGCTCTTCATC
+
BBBFFFFFFFFFFIIIIIIIIIIFIIIIIIFIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 2:N:0:ATGCTCGTTCTCTCGT
CTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATTAACACCCTC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIII
but this time I always get five reads. Still missing one. And the second and third read headers are the same. This should not happen.

You could try the following script (it also works as one liner). First it gets all the headers from your first fastq file, then it randomly picks a fastq file and returns 4 lines from it.
Please note: This only works if both files have identical headers at identical positions.
#!/bin/bash
headers=$(grep # R1_sample.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var "R$r"_sample.fastq | grep -m 1 -A 4 $line
var=$((var+4))
done
Alternatively you could expand your merge and pick a column approach. cut is used to remove a random column from the merged output.
#!/bin/bash
headers=$(grep # merged.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var merged.fastq | grep -m 1 -A 4 $line | cut -d$'\t' -f$r
var=$((var+4))
done

Related

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

How to compare multiple extension-less files in Bash

I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.
Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.
If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4
Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Detect if a series of numbers is sequential in bash/awk

So I have a series of scripts that generate intermediary text files along the way as a means of storing information across different scripts. Essentially the scripts detect rows within data that have been approved by the user for removal. The line numbers that are to be removed from the source file are stored in a file.
For example, say I have a source data file like this:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a4,b4,c4,d4
a5,b5,c5,d5
a6,b6,c6,d6
a7,b7,c7,d7
And the intermediary file would contain something like this:
1 3 4 5 6
Which would result, when the script is run, in an output data file as follows:
a2,b2,c2,d2
a7,b7,c7,d7
This all works fine, there is nothing to fix in this code. The problem is, when I'm dealing with actual data files sometimes there are literally thousands of numbers stored in the intermediary file for removal. This means I can't use a loop, because it will take a massive amount of time, and my current method of using sed gets overloaded with a error: too many arguments. Many of the line numbers are consecutive, so here's where I get to my question:
Is there a way in bash or awk to detect whether a series of space-separated numbers are consecutive?
I can sort out everything beyond that, I'm just stumped on how I could do this in one/a series of step(s). My plan, if I can detect consecutive values, is to change the intermediary file from:
1 3 4 5 6
To:
1 3-6
And then I'll be able to write code that will run on each range of values in a more manageable way.
If possible I'd like to avoid looping through each value and checking individually whether or not it's one step above the previous value, since I'm dealing with tens of thousands of numbers in a list.
If this is not possible in bash/awk, is there another way to accomplish this task to reduce the overall number of arguments passed to my script and greatly reduce the chances of encountering an error for too many arguments?
What about this?
BEGIN {
getline < "intermediate.txt"
split($0, skippedlines, " ")
skipindex = 1
}
{
if (skippedlines[skipindex] == NR)
++skipindex;
else
print
}
Use cat, join, and cut:
Files infile and ids:
a1,b1,c1,d1 1
a2,b2,c2,d2 3
a3,b3,c3,d3 4
a4,b4,c4,d4 5
a5,b5,c5,d5 6
a6,b6,c6,d6
a7,b7,c7,d7
Removal of selected lines:
$ join -v 2 ids <(cat -n infile) | cut -f 2 -d ' '
a2,b2,c2,d2
a7,b7,c7,d7
What's going on:
First, the initial file receives an id on each line, with cat -n infile;
then, the resulting file is joined on the first column with the file holding the ids;
only non-matching lines from second file are printed -- join -v 2;
the first column, with the ids, is removed;
and, it's a neat shell one-liner (:
In case your file with ids is written as an unique line, you can still make use of the above one-liner, simply adding a translation on the file with ids, as follows:
$ join -v 2 <(tr ' ' '\n' ids) <(cat -n infile) | cut -f 2 -d ' '
#jmihalicza's answer nicely uses awk to solve the whole problem of selecting the lines from source file that match those in the intermediate file. For completeness, the following awk program reduces the list of individual line numbers to ranges, where possible, which I think answers the original question:
{ for (j = 1; j <= NF; j++) {
lin[i++] = $j;
}
}
END {
start = lin[0];
j = 1;
while (j <= i) {
end = start
while (lin[j] == (lin[j-1]+1)) {
end = lin[j++];
}
if ((end+0) > (start+0)) {
printf "%d-%d ",start,end
} else {
printf "%d ",start
}
start = lin[j++];
}
}
Given this script, which I've called merge.awk and a file testlin.txt as follows:
1 3 4 5 6 9 10 11 13 15
... we can do this:
$ awk -f merge.awk <testlin.txt
1 3-6 9-11 13 15
This might work for you (GNU sed):
sed -r 's/\S+/&d/g;s/\s+/\n/g' intermediate_file | sed -f - source_file
Change the intermediate file into a sed script.

Resources