Print lines indexed by a second file - bash

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2

If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.

This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).

Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)

In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Related

Change names of a columns using a mapping file

I have a file with 3 columns like this:
NC_0001 10 x
NC_0001 11 x
NC_0002 90 y
I want to change the names of the first column using another file .txt that contains the conversion, it's like:
NC_0001 1
NC_0001 1
NC_0002 2
...
So finally I should have:
1 10 x
1 11 x
2 90 y
How can I do that?
P.S. the first file is very huge (50 GB) so I must use a unix command like awk.
awk -f script.awk map_file data_file
NR == FNR { # for the first file
tab[$1] = $2 # create a k/v of the colname and rename value
}
NR != FNR { # for the second file
$1 = tab[$1] # set first column equal to the map value
print # print
}
As a one-liner
awk 'NR==FNR{t[$1]=$2} NR!=FNR{$1=t[$1];print}' map_file data_file
If possible, you should split the first file and run this command on each partition file in parallel. Then, join the results.

How can i use bash to find 2 values that appear on the same line of a file?

I have 3 files:
File 1:
1111111
2222222
3333333
4444444
5555555
File 2:
6666666
7777777
8888888
9999999
File 3
8888888 7777777
9999999 6666666
4444444 8888888
I want to search file 3 for lines that contain a string from both file 1 and file 2, so the result of this example would be:
4444444 8888888
because 444444 is in file 1 and 888888 is file 2.
I currently have a solution, however my files contain 500+ lines and it can take a very long time to run my script:
#!/bin/sh
cat file1 | while read line
do
cat file2 | while read line2
do
grep -w -m 1 "$line" file3 | grep -w -m 1 "$line2" >> results
done
done
How can i improve this script to run this faster?
The current process is going to be slow due to the repeated scans of file2 (once for each row in file1) and file3 (once for each row in the cartesian product of file1 and file2). The additional invocation of sub-processes(as a result of the pipes |) is also going to slow things down.
So, to speed this up we want to look at reducing the number of times each file is scanned and limit the number of sub-processes we spawn.
Assumptions:
there are only 2 fields (when using white space as delimiter) in each row of file3 (eg, we won't see a row like "field1 has several strings" "and field2 does, too") otherwise we will need to come back revisit the parsing of file3
First our data files (I've added a couple extra lines):
$ cat file1
1111111
2222222
3333333
4444444
5555555
5555555 # duplicate entry
$ cat file2
6666666
7777777
8888888
9999999
$ cat file3
8888888 7777777
9999999 6666666
4444444 8888888
8888888 4444444 # switch position of values
8888888XX 4444444XX # larger values; we want to validate that we're matching on exact values and not sub-strings
5555555 7777777 # want to make sure we get a single hit even though 5555555 is duplicated in `file1`
One solution using awk:
$ awk '
BEGIN { filenum=0 }
FNR==1 { filenum++ }
filenum==1 { array1[$1]++ ; next }
filenum==2 { array2[$1]++ ; next }
filenum==3 { if ( array1[$1]+array2[$2] >= 2 || array1[$2]+array2[$1] >= 2) print $0 }
' file1 file2 file3
Explanation:
this single awk script will process our 3 files in the order in which they're listed (on the last line)
in order to aply different logic for each file we need to know which file we're processing; we'll use the variable filenum to keep track of which file we're currently processing
BEGIN { filenum=0 } - initialize our filenum variable; while the variable should automatically be set to zero the first time it's referenced, it doesn't hurt to be explicit
FNR maintains a running count of the records processed for the current file; each time a new file is opened FNR is reset to 1
when FNR==1 we know we just started processing a new file, so increment our variable { filenum++ }
as we read values from file1 and file2 we're going to use said values as the indexes for the associative arrays array1[] and array2[], respectively
filenum==1 { array1[$1]++ ; next } - create entry in our first associative array (array1[]) with the index equal to field1 (from file1); value of the array will be a number > 0 (1 === field exists once in file, 2 == field exists twice in file); next says to skip the rest of processing and go to the next row in the current file
filenum==2 { array2[$1]++ ; next } - same as previous command except in this case we're saving fields from file2 in our second associative array (array2[])
filenum==3 - optional because if we get this far in this script we have to be on our third file (file3); again, doesn't hurt to be explicit (and makes this easier to read/understand)
{ if ( ... ) } - test if the fields from file3 exist in both file1 and file2
array1[$1]+array2[$2] >= 2 - if (file3) field1 is in file1 and field2 is in file2 then we should find matches in both arrays and the sum of the array element values should be >= 2
array1[$2]+array2[$1] >= 2- same as previous command except we're testing for our 2 fields (file3) being in the opposite source files/arrays
print $0 - if our test returns true (ie, the current fields from file3 exist in both file1 and file2) then print the current line (to stdout)
Running this awk script against my 3 files generates the following output:
4444444 8888888 # same as the desired output listed in the question
8888888 4444444 # verifies we still match if we swap positions; also verifies
# we're matching on actual values and not a sub-string (ie, no
# sign of the row `8888888XX 4444444XX`)
5555555 7777777 # only shows up in output once even though 5555555 shows up
# twice in `file1`
At this point we've a) limited ourselves to a single scan of each file and b) eliminated all sub-process calls, so this should run rather quickly.
NOTE: One trade-off of this awk solution is the requirement for memory to store the contents of file1 and file2 in the arrays; which shouldn't be an issue for the relatively small data sets referenced in the question.
You can do it faster if load all data first and than process it
f1=$(cat file1)
f2=$(cat file2)
IFSOLD=$IFS; IFS=$'\n'
f3=( $(cat file3) )
IFS=$IFSOLD
for item in "${f3[#]}"; {
sub=( $item )
test1=${sub[0]}; test1=${f1//[!$test1]/}
test2=${sub[1]}; test2=${f2//[!$test2]/}
[[ "$test1 $test2" == "$item" ]] && result+="$item\n"
}
echo -e "$result" > result

Retrive entire column to a new file if it matches from list of another file

I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

finding common rows in files based on one column

I have 15 files like
file1.csv
a,cg2,0,0,0,21,0
a,cq1,10,0,0,0,0
a,cm2,0,19,0,0,0
...
a,ad10,0,0,0,37,0
file2.csv
d,cm1,0,3,0,0,0
d,cs2,0,32,0,0,0
d,cg2,0,0,9,0,0
...
d,az2,0,0,0,21,0
.
.
.
.
file15.csv
s,sd1,0,23,0,0,0
s,cw1,0,0,7,0,0
s,c23,0,0,90,0,0
...
s,cg2,0,45,0,0,0
I have different number of lines in each file and I want to compare the second field of all 15 files and extract the lines which are common to second field of all 15 files.
in this above case
output is:
cg2
(taking it is common to second field of all 15 files)
I am little new to unix and shell scripting, please help
Do you want the full lines from each of the fifteen files where field 2 appears in all fifteen files? Or do you only want a list of the field 2 values that appear in all fifteen files.
The former:
a,cg2,0,0,0,21,0
d,cg2,0,0,9,0,0
. . .
s,cg2,0,45,0,0,0
. . .
The latter:
cg2
. . .
If the latter, then this should work
awk -F, '{arr[$2]++; if (FILENAME != prevfile) {c++; prevfile = FILENAME}} END {for (i in arr) {if (arr[i] == c) {print i}}}' file*.csv
Broken out on multiple lines:
awk -F, '{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv
Explanation:
increment the count of the number of times a field 2 value occurs
if the filename changes, increment the count of files (the first file changes from a null string to its filename and the count increments from 0 to 1)
save the current filename
once all the counting is done, iterate of the array by its keys
if the count contained in the array is greater than or equal to the number of files, then the field 2 value appeared in all the files (by checking for >= instead of == this will work in case a value appears more than once in a single file)
so print the key (which is a field 2 value)
a glob is used to get all the files, but you could list them explicitly
Edit:
Here's a way to print the full matching lines using a two-pass technique. It's a modification of the version above. Make sure to list the files twice.
awk -F, '
FILENAME == first && flag {
exit
}
! first {
first = FILENAME
}
FILENAME != first {
flag = 1
}
{
arr[$2]++;
if (FILENAME != prevfile) {
c++;
prevfile = FILENAME
}
}
END {
# print the matching lines
do {
if ($2 in arr) {
print;
}
} while (getline);
# print the list of words
for (i in arr) {
if (arr[i] >= c) {
print i
}
}
}' file*.csv file*.csv
It depends on the first file in the first group being the same name as the first file in the second group. Using globbing similar to what I've shown will take care of that requirement.
It prints the matching lines (not grouped, though), then it prints the list of words. If you want only one or the other, comment out or remove the loop that you don't want (do/while or for).
If you print only the full lines, you can pipe the output to:
sort -t , -k2,2
to have them grouped.
Piping only the list of words to:
sort
will put them in the same order for easier comparison.
Fun problem.
One way to do it, entirely in Bash, is as follows.
One thing you will need to invoke is join -t ',' -1 2 -2 2 file1 file2 to join on the second column of two files. Before you can join, though, you must sort on the second column.
Do successive joins in a for-loop, because join takes only two files as arguments.
ADDENDUM
Here is a little transcript showing successive joins. You can adapt it fairly easily, I think.
$ cat 1.csv
a,b,c,d
e,f,g,h
i,j,k,l
$ cat 2.csv
7,5,4,3
3,b,s,e
2,f,5,5
$ cat 3.csv
4,5,6,7
0,0,0,0
1,b,4,4
$ join -t ',' -1 2 -2 2 1.csv 2.csv | cut -f 1 -d ',' > temp
$ cat temp
b
f
$ join -t ',' -2 2 temp 3.csv | cut -f 1 -d ','
b
The first join (on the first two files) produces the joined value in the first column of the result. So as you join to file3, file4, file5, etc. You will be using the first column of the result you are generating, which is why you only need the -2 option. To keep things very efficient, always cut out all but the first column each time you do the join.

Resources