Bash - select lines of a file based on values in another file - bash

I have 2 files; let's call them file1 and file2. file1 contains a start and an end coordinate in each row, e.g.:
start end
2000 2696
3465 3688
8904 9546
etc.
file2 has several columns, of which the first is the most relevant for the question:
position v2 v3 v4
3546 value12 value13 value14
9847 value22 value23 value24
12000 value32 value33 value34
Now, I need to output a new file, which will contain only the lines of file2 for which the 'position' value (1st column) is in between the 'start' and the 'end' value of any of the columns of file1. In R I'd just make a double loop, but it takes too much time (the files are large), so need to do it in bash. In case the question is unclear, here's the R loop that would do the job:
for(i in 1:dim(file1)[1]){
for(j in 1:dim(file2)[1]){
if(file2[j,1]>file1$start[i] & file2[j,1]<file1$end[i]) file2$select=1 else file2$select=0
}
}
Very sure there's a simple way of doing this using bash / awk...

The awk will look like this, but you'll need to remove the first line from file1 and file2 first:
awk 'FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
The bit in curly braces after "FNR==NR" only applies to the processing of file1 and it says to store field1 in array x[] and field2 in array y[] so we have the upper and lower bounds of each range. The bit in the second set of curly braces applies to procesing file2 only. It says to iterate through all the bounds in array x[] and y[] and see if field1 is between the bounds, and print the whole reocrd if it is.
If you don't want to remove the header line at the start, you can make the awk a little more complicated and ignore it like this:
awk 'FNR==1{next}FNR==NR{x[i]=$1;y[i++]=$2;next}{for(j=0;j<i;j++){if($1>=x[j]&&$1<=y[j]){print $0}}}' file1 file2
EDITED
Ok, I have added code to check "chromosome" (whatever that is!) assuming it is in the first field in both files, like this:
File1
x 2000 2696
x 3465 3688
x 8904 9546
File2
x 3546 value12 value13 value14
y 3467 value12 value13 value14
x 9847 value22 value23 value24
x 12000 value32 value33 value34
So the code now stores the chromosome in array c[] as well and checks they are equal before outputting.
awk 'BEGIN{i=0}FNR==NR{c[i]=$1;x[i]=$2;y[i++]=$3;next}{for(j=0;j<i;j++){if(c[j]==$1&&$2>=x[j]&&$2<=y[j]){print $0;next}}}' file1 file2

Don't know how to do this in bash...
I would try a perl script, reading the first file and storing it in memory (if it's possible, it depends on its size) and then going through the second file line by line and doing the comparisons to output the line or not.
I think you can do this in R too... the same way: storing the first file, looping for each line of the second file .
Moreover if the intervals don't overlap, you can do a sort on the files to speed up your algorithm.

This should be faster than the for loop
res <- apply(file2, 1, function(row)
{
any(row$position > file1$start & row$position < file1$end)
})

Assuming the delimiters for the files are spaces (if not - change -d estting).
The script uses cut to extract the first field of file2.
Then a simple grep searches for the field in file1. If present, the line from file2 is printed.
#!/bin/bash
while read line
do
word=$(echo $line | cut -f1 -d" ")
grep -c $word file1 >/dev/null
if [ $? -eq 0 ];then
echo "$line"
fi
done < file2

Related

How can i use bash to find 2 values that appear on the same line of a file?

I have 3 files:
File 1:
1111111
2222222
3333333
4444444
5555555
File 2:
6666666
7777777
8888888
9999999
File 3
8888888 7777777
9999999 6666666
4444444 8888888
I want to search file 3 for lines that contain a string from both file 1 and file 2, so the result of this example would be:
4444444 8888888
because 444444 is in file 1 and 888888 is file 2.
I currently have a solution, however my files contain 500+ lines and it can take a very long time to run my script:
#!/bin/sh
cat file1 | while read line
do
cat file2 | while read line2
do
grep -w -m 1 "$line" file3 | grep -w -m 1 "$line2" >> results
done
done
How can i improve this script to run this faster?
The current process is going to be slow due to the repeated scans of file2 (once for each row in file1) and file3 (once for each row in the cartesian product of file1 and file2). The additional invocation of sub-processes(as a result of the pipes |) is also going to slow things down.
So, to speed this up we want to look at reducing the number of times each file is scanned and limit the number of sub-processes we spawn.
Assumptions:
there are only 2 fields (when using white space as delimiter) in each row of file3 (eg, we won't see a row like "field1 has several strings" "and field2 does, too") otherwise we will need to come back revisit the parsing of file3
First our data files (I've added a couple extra lines):
$ cat file1
1111111
2222222
3333333
4444444
5555555
5555555 # duplicate entry
$ cat file2
6666666
7777777
8888888
9999999
$ cat file3
8888888 7777777
9999999 6666666
4444444 8888888
8888888 4444444 # switch position of values
8888888XX 4444444XX # larger values; we want to validate that we're matching on exact values and not sub-strings
5555555 7777777 # want to make sure we get a single hit even though 5555555 is duplicated in `file1`
One solution using awk:
$ awk '
BEGIN { filenum=0 }
FNR==1 { filenum++ }
filenum==1 { array1[$1]++ ; next }
filenum==2 { array2[$1]++ ; next }
filenum==3 { if ( array1[$1]+array2[$2] >= 2 || array1[$2]+array2[$1] >= 2) print $0 }
' file1 file2 file3
Explanation:
this single awk script will process our 3 files in the order in which they're listed (on the last line)
in order to aply different logic for each file we need to know which file we're processing; we'll use the variable filenum to keep track of which file we're currently processing
BEGIN { filenum=0 } - initialize our filenum variable; while the variable should automatically be set to zero the first time it's referenced, it doesn't hurt to be explicit
FNR maintains a running count of the records processed for the current file; each time a new file is opened FNR is reset to 1
when FNR==1 we know we just started processing a new file, so increment our variable { filenum++ }
as we read values from file1 and file2 we're going to use said values as the indexes for the associative arrays array1[] and array2[], respectively
filenum==1 { array1[$1]++ ; next } - create entry in our first associative array (array1[]) with the index equal to field1 (from file1); value of the array will be a number > 0 (1 === field exists once in file, 2 == field exists twice in file); next says to skip the rest of processing and go to the next row in the current file
filenum==2 { array2[$1]++ ; next } - same as previous command except in this case we're saving fields from file2 in our second associative array (array2[])
filenum==3 - optional because if we get this far in this script we have to be on our third file (file3); again, doesn't hurt to be explicit (and makes this easier to read/understand)
{ if ( ... ) } - test if the fields from file3 exist in both file1 and file2
array1[$1]+array2[$2] >= 2 - if (file3) field1 is in file1 and field2 is in file2 then we should find matches in both arrays and the sum of the array element values should be >= 2
array1[$2]+array2[$1] >= 2- same as previous command except we're testing for our 2 fields (file3) being in the opposite source files/arrays
print $0 - if our test returns true (ie, the current fields from file3 exist in both file1 and file2) then print the current line (to stdout)
Running this awk script against my 3 files generates the following output:
4444444 8888888 # same as the desired output listed in the question
8888888 4444444 # verifies we still match if we swap positions; also verifies
# we're matching on actual values and not a sub-string (ie, no
# sign of the row `8888888XX 4444444XX`)
5555555 7777777 # only shows up in output once even though 5555555 shows up
# twice in `file1`
At this point we've a) limited ourselves to a single scan of each file and b) eliminated all sub-process calls, so this should run rather quickly.
NOTE: One trade-off of this awk solution is the requirement for memory to store the contents of file1 and file2 in the arrays; which shouldn't be an issue for the relatively small data sets referenced in the question.
You can do it faster if load all data first and than process it
f1=$(cat file1)
f2=$(cat file2)
IFSOLD=$IFS; IFS=$'\n'
f3=( $(cat file3) )
IFS=$IFSOLD
for item in "${f3[#]}"; {
sub=( $item )
test1=${sub[0]}; test1=${f1//[!$test1]/}
test2=${sub[1]}; test2=${f2//[!$test2]/}
[[ "$test1 $test2" == "$item" ]] && result+="$item\n"
}
echo -e "$result" > result

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

How to select a specific percentage of lines?

Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)
I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.
If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file
Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Bash: Sum fields of a line

I have a file with the following format:
a 1 2 3 4
b 7 8
c 120
I want it to be parsed into:
a 10
b 15
c 120
I know this can be easily done with awk, but I'm not familiar with the syntax and can't get it to work for me.
Thanks for any help
ok simple awk primer:
awk '{ for (i=2;i<=NF;i++) { total+=$i }; print $1,total; total=0 }' file
NF is an internal variable that is reset on each line and is equal to the number of fields on that line so
for (i=2;i<=NF;i++) starts a for loop starting at 2
total+=$i means the var total has the value of the i'th field added to it. and is performed for each iteration of the loop above.
print $1,total prints the 1st field followed by the contents of OFS variable (space by default) then the total for that line.
total=0 resets the totals var ready for the next iteration.
all of the above is done on each line of input.
For more info see grymoires intro here
Start from column two and add them:
awk '{tot=0; for(i=2;i<$NF;i++) tot+=$i; print $1, tot;}' file
A pure bash solution:
$ while read f1 f2
> do
> echo $f1 $((${f2// /+}))
> done < file
On running it, got:
a 10
b 15
c 120
The first field is read into variable f1 and the rest of the fields are i f2. In variable f2 , spaces are replaced in place with + and evaluated.
Here's a tricky way to use a subshell, positional parameters and IFS. Works with various amounts of whitespace between the fields.
while read label numbers; do
echo $label $(set -- $numbers; IFS=+; bc <<< "$*")
done < filename
This works because the shell expands "$*" into a single string of the positional parameters joined by the first char of $IFS (documentation)

Resources