Output matching lines in linux - bash

I want to match the numbers in the first file with the 2nd column of second file and get the matching lines in a separate output file. Kindly let me know what is wrong with the code?
I have a list of numbers in a file IDS.txt
10028615
1003
10096344
10100
10107393
10113978
10163178
118747520
I have a second File called src1src22.txt
From src:'1' To src:'22'
CHEMBL3549542 118747520
CHEMBL548732 44526300
CHEMBL1189709 11740251
CHEMBL405440 44297517
CHEMBL310280 10335685
expected newoutput.txt
CHEMBL3549542 118747520
I have written this code
while read line; do cat src1src22.txt | grep -i -w "$line" >> newoutput.txt done<IDS.txt

Your command line works - except you're missing a semicolon:
while read line; do grep -i -w "$line" src1src22.txt; done < IDS.txt >> newoutput.txt

I have found an efficient way to perform the task. Instead of a loop try this -f gives the pattern in the file next to it and searches in the next file. The chance of invalid character length which can occur with grep is reduced and looping slows the process down.
grep -iw -f IDS.txt src1src22.tx >>newoutput.txt

Try this -
awk 'NR==FNR{a[$2]=$1;next} $1 in a{print a[$1],$0}' f2 f1
CHEMBL3549542 118747520
Where f2 is src1src22.txt

Related

How to get values in a line while looping line by line in a file (shell script)

I have a file which looks like this (file.txt)
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
I want to loop trough this line by line an extract the key values.
so the result should be like ,
AJGUIGIDH568
AJGUIGIDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
So I wrote a code like this to loop line by line and extract the value between {"key":" and ","rule": because key values is in between these 2 patterns.
while read p; do
echo $p | sed -n "/{"key":"/,/","rule":,/p"
done < file.txt
But this is not working. can someone help me to figure out me this. Thanks in advance.
Your sample input is almost valid json. You could tweak it to make it valid and then extract the values with jq with something like:
sed -e 's/squid/"squid/' -e 's/$/"}/' file.txt | jq -r .key
Or, if your actual input really is valid json, then just use jq:
jq -r .key file.txt
If the "random-txt" may include double quotes, making it difficult to massage the input to make it valid json, perhaps you want something like:
awk '{print $4}' FS='"' file.txt
or
sed -n '/{"key":"\([^"]*\).*/s//\1/p' file.txt
or
while IFS=\" read open_brace key colon val _; do echo "$val"; done < file.txt
For the shown data, you can try this awk:
awk -F '"[:,]"' '{print $2}' file
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
With the give example you can simple use
cut -d'"' -f4 file.txt
Assumptions:
there may be other lines in the file so we need to focus on just the lines with "key" and "rule"
the only text between "key" and "rule" is the desired string (eg, squid never shows up between the two patterns of interest)
Adding some additional lines:
$ cat file.txt
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
ignore this line}
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
ignore this line}
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
One sed idea:
$ sed -nE 's/^(.*"key":")([^"]*)(","rule".*)$/\2/p' file.txt
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
Where:
-E - enable extended regex support (and capture groups without need to escape sequences)
-n - suppress printing of pattern space
^(.*"key":") - [1st capture group] everything from start of line up to and including "key":"
([^"]*) - [2nd capture group] everything that is not a double quote (")
(","rule".*)$ - [3rd capture group] everything from ",rule" to end of line
\2/p - replace the line with the contents of the 2nd capture group and print

Trying to create a script that counts the length of a all the reads in a fastq file but getting no return

I am trying go count the length of each read in a fastq file from illumina sequencing and outputting this to a tsv or any sort of file so I can then later also look at this and count the number of reads per file. So I need to cycle down the file and eactract each line that has a read on it (every 4th line) then get its length and store this as an output
num=2
for file in *.fastq
do
echo "counting $file"
function file_length(){
wc -l $file | awk '{print$FNR}'
}
for line in $file_length
do
awk 'NR==$num' $file | chrlen > ${file}read_length.tsv
num=$((num + 4))
done
done
Currently all I get the counting $file and no other output but also no errors
Your script contains a lot of errors in both syntax and algorithm. Please try shellcheck to see what is the problem. The most issue will be the $file_length part.
You may want to call a function file_length() here but it is just
an undefined variable which is evaluated as null in the for loop.
If you just want to count the length of the 4th line of *.fastq files,
please try something like:
for file in *.fastq; do
awk 'NR==4 {print length}' "$file" > "${file}_length.tsv"
done
Or if you want to put the results together in a single tsv file, try:
tsvfile="read_lenth.tsv"
for file in *.fastq; do
echo -n -e "$file\t" >> "$tsvfile"
awk 'NR==4 {print length}' "$file" >> "$tsvfile"
done
Hope this helps.

Alternating output in bash for loop from two grep

I'm trying to search through files and extract two pieces of relevant information every time they appear in the file. The code I currently have:
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3
samples=$(grep $str2 $file | cut -d '/' -f 8
echo $samples $reads >> reads.txt
done
It is doing each line for the file (the files have varying numbers of instances of these phrases) and gives me the output per row for each file:
PopA_15.fq 1081264
PopA_16.fq PopA_17.fq 1008416 554791
PopA_18.fq PopA_20.fq PopA_21.fq 604610 531227 595129
...
I want it to match each instance (i.e. 1st instance of both greps next two each other):
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
...
How do I do this? Thank you
Considering that your Input_file is same as sample shown and number of columns are even on each line with 1 PopA value and other will be with digit values. Following awk may help you in same.
awk '{for(i=1;i<=(NF/2);i++){print $i,$((NF/2)+i)}}' Input_file
Output will be as follows.
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
In case you want to pass output of a command to awk command then you could do like your command | awk command... no need to add Input_file to above awk command.
This is what ended up working for me...any tips for more efficient code are definitely welcome
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3)
samples=$(grep $str2 $file | cut -d '/' -f 8)
paste <(echo "$samples" | column -t) <(echo "$reads" | column -t) >> reads.txt
done
This provides the desired output described above.

Reading recent entry from a file based on a key

Input file, fruits.txt:
JAN,APPLE
FEB,MANGO
JAN,ORANGE
MAR,APPLE
FEB,APPLE
Expected output file:
MAR,APPLE
FEB,APPLE
JAN,ORANGE
For getting the above output, below code is used:
#!/bin/sh
declare -A m_arr
cat fruits.txt > /tmp/ID.part
while read line
do
Month=$(echo $line | cut -d, -f1)
Fruits=$(echo $line | cut -d, -f2)
m_arr[${Month}]=${Fruits}
done < /tmp/ID.part
for i in ${!m_arr[#]}
do
echo "$i,${m_arr[$i]}"
done
This works fine for small number of data in input file. I have 200 000 entries and observed that cut command is very slow. Tried with awk as well, did not get a better result. My requirement is to read the file from row1, with the key as column1. I need to updated entry for each key.
I think this can be done pretty easily with Awk, you just need to hash the values of $1 in $2 once you delimit the file with a , separator
awk -v FS=, -v OFS=, '{key[$1]=$2; next}END{for (i in key) print i,key[i]}' file
Also if you want to speed up things while processing a million line file, you can change the localization settings to speed up the execution while parsing, you can pass LC_ALL=C locally to the command. See Stéphane Chazelas's answer on what "LC_ALL=C" does?
In bash version 4, you can declare an associative array and populate it with the result of read, splitting your lines with a custom IFS:
$ declare -A a
$ while IFS=, read key value; do a["$key"]="$value"; done < fruits.txt
$ declare -p a
declare -A a=([MAR]="APPLE" [FEB]="APPLE" [JAN]="ORANGE" )
If you want to generate that specific output from the array, you'll also require a loop:
$ for key in "${!a[#]}"; do printf '%s,%s\n' "$key" "${a[$key]}"; done
MAR,APPLE
FEB,APPLE
JAN,ORANGE
The shortest one using GNU datamash:
datamash -st, -g1 last 2 <file
g1 - group by the 1st column
last 2 - keep the last value of the group
The output:
FEB,APPLE
JAN,ORANGE
MAR,APPLE

UNIX : Deleting all the lines containing string & number

I have a TXT files with lines of about 1 Million.
#Test.txt
zs272
zs273
zs277
zs278
zs282
zs285
zs288
zs289
zs298
zs300
zs7
zsa
zsag
zsani179yukkie
zsani182zaide
zsaqgiw
zsb86581
zsbguepqtkcn
zscazn
zscfhlsv
zscgxadrwijl
zsclions111yuen
zscwqtk
zscz
zsder
zsdfdgdgg
I wanted to delete the line which has the numbers and keeping only strings.
I tried,
grep -v '^[1-9]' Test.txt > 1_Test.txt
Couldn't get the desired result.
Expected output:
#1_Test.txt
zsa
zsag
zsbguepqtkcn
zscazn
zscfhlsv
zscgxadrwijl
zscwqtk
zscz
zsder
zsdfdgdgg
sed '/[0-9]/d' file
If you want to edit your file "in place" use sed's option -i.
awk '!/[0-9]/' file
With bash:
while read -r line; do [[ ! $line =~ [0-9] ]] && printf "%s\n" "$line"; done < file
Just remove the start of the line anchor ^.
^[1-9] regex only matches the numbers 1-9 which exists at the start.
grep -v '[1-9]' Test.txt > 1_Test.txt
to work for all digits including 0.
grep -v '[0-9]' Test.txt > 1_Test.txt
A solution in AWK!
awk '!/[0-9]+/{print}' file

Resources