Get variable value by finding keyword in unix environment - bash

In UNIX environment, I have a file.txt that contains following details:
Data recording started:
0001100 Matched at 412090
0001101 Mismatched at 414798
0001102 Matched at 420007
0001103 Mismatched at 420015
Job completed
How do I can get the first Matched value by searching "Matched" (line 2) word and also for the first "Mismatched" (line 3)
Find the difference between them and store as a variable, "dif"
The result is Matched minus Mismatched, so it cannot find the data by specify line number, i.e. find line 3 last integers minus line 2 last integers, because the mismatched may come at first like following:
Data recording started:
0001100 Mismatched at 412090
0001101 Matched at 414798
0001102 Mismatched at 420007
0001103 Matched at 420015
Job completed

One way:
echo $((
$(grep Matched input | head -1 | sed 's/.*at //')
- $(grep Mismatched input | head -1 | sed 's/.*at //')
))
or using only sed:
echo $((
$(sed -n 's/.*Matched.*at //p' input | head -1)
- $(sed -n 's/.*Mismatched.*at //p' input | head -1)
))
Output
-2708

We can use grep -m 1 to kick away head.
dif=$((
$(grep -m 1 'Matched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
- $(grep -m 1 'Mismatched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
))
echo $dif

Related

Compare some specific columns of lines within a file using bash script

I want to compare the 2nd and 4th columns of lines in a file. In detail, line 1 with 2,3,4...N, then line 2 with 3,4,5...N, and so on.
I have written a script, it worked but running so long, over 30 minutes.
Let the number of lines is 1733 with header, my code is:
for line1 in {2..1733}; do \
for line2 in {$((line1+1))..1733}; do \
i7_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 2) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 2) | wc -l);
i5_diff=$(cmp -bl \
<(sed -n "${line1}p" Lab_primers.tsv | cut -f 4) \
<(sed -n "${line2}p" Lab_primers.tsv | cut -f 4) | wc -l);
if [ $i7_diff -lt 3 ]; then
if [ $i5_diff -lt 3 ]; then
echo $(sed -n "${line1}p" Lab_primers.tsv)"\n" >> primer_collision.txt
echo $(sed -n "${line2}p" Lab_primers.tsv)"\n\n" >> primer_collision.txt
fi;
fi;
done
done
I used nested for loops then using sed to print exactly the $line, next using cut to extract the desired column. Finally, the cmp and wc command to count the number of differences of two columns of a pair lines.
If meeting the condition (both 2nd and 4th columns of pair of lines have the number of differences less than 3), the code will print a pair lines to output file.
Here is an excerpt of the input (it has 1733 lines):
I7_Index_ID index I5_Index_ID index2 primer
D703 CGCTCATT D507 ACGTCCTG 27
D704 GAGATTCC D507 ACGTCCTG 28
D701 ATTACTCG D508 GTCAGTAC 29
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
D705 ATTCAGAA D504 TCAGAGCC 45
D706 GAATTCGT D504 TCAGAGCC 46
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
D715 GGTGATGA D510 AACCTCTC 103
D716 AACCTACG D510 AACCTCTC 104
i787 TGCTTCCA i593 ATCGTCTC R35
Then the expected output is:
D703 CGCTCATT D507 ACGTCCTG 27
S6779 CGCTAATC S6579 ACGTCATA 559
D708 TAATGCGC D503 AGGATAGG 44
i796 ATATGCGC i585 AGGATAGC R100
D714 TGCTTGCT D510 AACCTCTC 102
i787 TGCTTCCA i593 ATCGTCTC R35
My question is what the better code to deal with it, how to reduce the running time?
Thank you for your help!
You could start to sort by fields 2 and 4.
Then, no need for double loop: if a pair exist, they should be adjacent.
sort -k 2,2 -k 4,4 myfile.txt
Then, we need to print only bunch of consecutive lines that share the same 2 and 4 fields.
first=yes
sort -k 2,2 -k 4,4 test.txt | while read l
do
fields=(${l})
new2=${fields[1]}
new4=${fields[3]} # Fields 2 and 4, bash-way
if [[ "$new2" = "$old2" ]] && [[ "$new4" = "$old4" ]]
then
if [[ $first ]]
then
# first time we print something for this series: we need
# to also print the previous line (the first of the series)
echo; echo "$oldl"
# But if the next line is identical (series of 3, no need to repeat this line)
first=
fi
echo "$l"
else
# This line is not identical to the previous. So nothing to print
# If the next one is identical to this one, then, this one will
# be the first of its series
first=yes
fi
old2=$new2
old4=$new4
oldl="${l}"
done
One frustrating thing: uniq -D almost does all the job we did in this script. Except that it is unable to filter on specific lines.
But we could also rewrites lines so that uniq can work.
Not very fluent in awk (if I were, I am pretty sure awk could do the uniq work for me), but well
sort -k 2,2 -k 4,4 test.txt | awk '{print $0" "$2" "$4}' | uniq -D -f 5 | awk '{printf "%-12s %-9s %-12s %-9s %s\n",$1,$2,$3,$4,$5}'
does the job.
sort sort the lines by fields 2 and 4. awk add to the end of each line a copy of fields 2 and 4. Which then make uniq usable, since uniq is able to ignore N first fields. So here, we use uniq ignoring the 5 1st fields, that is working only on the copies of fields 2 and 4. With -D uniq display only duplicate lines.
Then, the last awk remove the copies of field we don't need anymore.

Loop Script from Input File

I have a reference file with device names in them. For example WABEL8499IPM101. I'm using this script to set the base name (without the last 3 digits) to look at the reference file and see what is already used. If 101 is used it will create a file for me with 102, 103 if I request 2 total. I'm looking to use an input file to run it multiple times. I'm also trying to figure out how to start at 101 if there isn't a name found when searching the reference file
I would like to loop this using an input file instead of manually entering bash test.sh WABEL8499IPM 2 each time. I would like to be able to build an input file of all the names that need compared and then output. It would also be nice that if there isn't a match that it starts creating names at WABEL8499IPM101 instead of just WABEL8499IPM1.
Input file example:
ColumnA (BASE NAME) ColumnB (QUANTITY)
WABEL8499IPM 2
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum);
do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Usage:
bash test.sh WABEL8499IPM 2
Result in the log file:
WABEL8499IPM109
WABEL8499IPM110
Just wrap a loop around your code instead of assuming the args come in on the command line.
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
while read device_name quantityNum
do max_sequence_name=$( grep -o -e "$device_name[0-9]*" $SRCFILE |
sort --reverse | head -n 1)
max_sequence_num=${max_sequence_name: -3}
array_new_sequence_name=()
for i in $(seq 1 $quantityNum)
do cnum=$((max_sequence_num + i))
array_new_sequence_name+=("$device_name$cnum")
done
for sqn in ${array_new_sequence_name[#]};
do echo $sqn >> $LOGFILE
done
done < input.file
I'd maybe pass the input file as the parameter now.

Get the first real number from a series of files

I try to take the first number from each file.dat of the form:
5.01 1 56.413481000 -0.00063400 0.00095770
5.01 2 61.193808800 0.00102170 0.00078280
5.01 3 65.974136600 -0.00108170 0.00102620
5.01 4 70.754464300 0.00082490 0.00103630
and then use this number (5.01) as the title of a .png file.
I use a bash script and I know the command line=$(head -n 1 $f) as found in a question here, but this take to me the first line of the file $f.
In this case also the space in the line is saved and the .png file title became:
plot 5.01 1 56.413481000 -0.00063400 0.00095770.png
There is some way to take only 5.01 and have a trim title for the plot?
Thanks to all.
I'd probably just do it with perl:
VAL=$( echo "$line" | perl -pe 's/^[^\d]+//g;s/[^\d\.].*$//' )
Something like that anyway.
Should remove:
anything that isn't a digit from the start of line.
Anything not-digit or not . to the end of line.
Or with grep:
grep -o "[0-9]*\.[0-9]*" file.dat | head -1
Edit:
Testing without the head -1 for a oneline input:
echo " 5.01 2 61.193808800 0.00102170 0.00078280" | grep -o "[0-9]*\.[0-9]*"
5.01
61.193808800
0.00102170
0.00078280
Using head -1 will return the first match on the first line.
When you know the match will be on the first line, so can we ignore files with an incorrect first line (and don't grep through complete files):
Make a two-headed monster:
head -1 | grep -o "[0-9]*\.[0-9]*" file.dat | head -1
To extract the first field, assuming they are tab separated:
val=$(head -n 1 $f | cut -f 1)
or, if they are space separated instead:
val=$(head -n 1 $f | cut -f 1 -d ' ')
OR you can avoid calling any extra processes and keep all data manipulation in the bash shell with
while read realNum restOfLine ;
break
done < $f
echo $realNum
This grabs the first "word" and puts the remaining into "restOfLine".
The break ensures that you only read the first line of the file.
IHTH

Bash: Merging 4 lines in 4 files into one single file

I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC

How do I pick random unique lines from a text file in shell?

I have a text file with an unknown number of lines. I need to grab some of those lines at random, but I don't want there to be any risk of repeats.
I tried this:
jot -r 3 1 `wc -l<input.txt` | while read n; do
awk -v n=$n 'NR==n' input.txt
done
But this is ugly, and doesn't protect against repeats.
I also tried this:
awk -vmax=3 'rand() > 0.5 {print;count++} count>max {exit}' input.txt
But that obviously isn't the right approach either, as I'm not guaranteed even to get max lines.
I'm stuck. How do I do this?
This might work for you:
shuf -n3 file
shuf is one of GNU coreutils.
If you have Python accessible (change the 10 to what you'd like):
python -c 'import random, sys; print("".join(random.sample(sys.stdin.readlines(), 10)).rstrip("\n"))' < input.txt
(This will work in Python 2.x and 3.x.)
Also, (again change the 10 to the appropriate value):
sort -R input.txt | head -10
If jot is on your system, then I guess you're running FreeBSD or OSX rather than Linux, so you probably don't have tools like rl or sort -R available.
No worries. I had to do this a while ago. Try this instead:
$ printf 'one\ntwo\nthree\nfour\nfive\n' > input.txt
$ cat rndlines
#!/bin/sh
# default to 3 lines of output
lines="${1:-3}"
# default to "input.txt" as input file
input="${2:-input.txt}"
# First, put a random number at the beginning of each line.
while read line; do
printf '%8d%s\n' $(jot -r 1 1 99999999) "$line"
done < "$input" |
sort -n | # Next, sort by the random number.
sed 's/^.\{8\}//' | # Last, remove the number from the start of each line.
head -n "$lines" # Show our output
$ ./rndlines input.txt
two
one
five
$ ./rndlines input.txt
four
two
three
$
Here's a 1-line example that also inserts the random number a little more cleanly using awk:
$ printf 'one\ntwo\nthree\nfour\nfive\n' | awk 'BEGIN{srand()} {printf("%8d%s\n", rand()*10000000, $0)}' | sort -n | head -n 3 | cut -c9-
Note that different versions of sed (in FreeBSD and OSX) may require the -E option instead of -r to handle ERE instead or BRE dialect in the regular expression if you want to use that explictely, though everything I've tested works with escapted bounds in BRE. (Ancient versions of sed (HP/UX, etc) might not support this notation, but you'd only be using those if you already knew how to do this.)
This should do the trick, at least with bash and assuming your environment has the other commands available:
cat chk.c | while read x; do
echo $RANDOM:$x
done | sort -t: -k1 -n | tail -10 | sed 's/^[0-9]*://'
It basically outputs your file, placing a random number at the start of each line.
Then it sorts on that number, grabs the last 10 lines, and removes that number from them.
Hence, it gives you ten random lines from the file, with no repeats.
For example, here's a transcript of it running three times with that chk.c file:
====
pax$ testprog chk.c
} else {
}
newNode->next = NULL;
colm++;
====
pax$ testprog chk.c
}
arg++;
printf (" [%s] n", currNode->value);
free (tempNode->value);
====
pax$ testprog chk.c
char tagBuff[101];
}
return ERR_OTHER;
#define ERR_MEM 1
===
pax$ _
sort -Ru filename | head -5
will ensure no duplicates. Not all implementations of sort have the -R option.
To get N random lines from FILE with Perl:
perl -MList::Util=shuffle -e 'print shuffle <>' FILE | head -N
Here's an answer using ruby if you don't want to install anything else:
cat filename | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
for example, given a file (dups.txt) that looks like:
1 2
1 3
2
1 2
3
4
1 3
5
6
6
7
You might get the following output (or some permutation):
cat dups.txt| ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
4
6
5
1 2
2
3
7
1 3
Further example from the comments:
printf 'test\ntest1\ntest2\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test1
test
test2
Of course if you have a file with repeated lines of test you'll get just one line:
printf 'test\ntest\ntest\n' | ruby -e 'puts ARGF.read.split("\n").uniq.shuffle.join("\n")'
test

Resources