Match a particular letter and print word after that using SED - bash

I have a file "Log.txt" which look like this:
bla bla.. line1
bla bla.. line2
bla bla.. lineN
:000000 ... 239e670... A bla1.txt
:000000 ... 76fd777... M bla2.txt
:000000 ... e69de29... A bla3.txt
Let's say that I am looking for the letter 'A' and 'M'.
How would I look for it ONLY in the 4th field or line that contains this particular letter only. I need to Match the words "A" and "M" only and print the file name after that. i.e I need to get final output as below:
A bla1.txt
M bla2.txt
A bla3.txt
I used awk to match 4th column with A and M and print the next word. but not getting the expected output. I'm getting extra Bla Bla lines also.
Anyone has idea how to achieve this using sed?

awk for this:
awk '$4 ~ /^[AM]$/ { print $4," ",$5 }' Log.txt
sed for it:
sed -En '/^([^ ]+ ){3}[AM]/ { s/^([^ ]+ ){3}([AM] .*)/\2/; p; }' Log.txt
Both of these confirm that the A or M is in the 4th field.

Awk actually can do your job, just need to add a condition:
awk "/ (A|M) /{print $4,$5}" Log.txt
As for sed, you can do this:
sed -nr "/ (A|M) /{s/.*((A|M)\s+.*)$/\1/;p}" Log.txt
Not sure how are your real data looks like, but I guess you will get it and adjust the command to suit them.

As per your input file and your expected output, Please try below using awk:
awk '{if ($4 == "A" || $4 == "M") {print $4,$5}}' log.txt
Output:
A bla1.txt
M bla2.txt
A bla3.txt

This might work for you (GNU sed):
sed 's/^\(\S*\s\)\{3\}\([AM]\s\)/\2/p;d' file
Match the fourth field to be A or M and if so, remove the first three fields and print the remainder.

Related

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

AWK print if found three matches, one false

There is several lines in a file that looks like:
A B C H
A B C D
and, I want to print all lines that contain this RE:
/A\tB/
But, if the line contain and H in the fourth field, do not print, the output would be:
A B C D
It could be written in one line in sed, awk or grep?
The only thing that I know is:
awk '/^A\tB/'
This will work:
awk '$1$2 == "AB" && $4 != "H"' file
If all entries are single characters this will also work:
awk '$1$2$3$4 ~ /^AB.[^H]/' file
With awk one-liner:
awk -F'\t' '$1=="A" && $2=="B" && $4!="H"' file
-F'\t' - tab char \t is treated as field separator
The output:
A B C D
This might work for you (GNU sed):
sed '/^A\tB\t.\t[^H]/!d' file
If a line does not contain A ,B ,any character and a character other than H separated by tabs, delete it.
Could be written:
sed -n '/^A\tB\t.\t[^H]/p' file
Use this.
awk '/^A\tB/ { if ( $4 != "H" ) print }'

Concatenating characters on each field of CSV file

I am dealing with a CSV file which has the following form:
Dates;A;B;C;D;E
"1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
"1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
"1999-01-06";1435.12;3156.59;66.4375;86.32;441
Since the BLAS routine I need to implement on such data takes double-floats only, I guess the easiest way is to concatenate d0 at the end of each field, so that each line looks like:
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
In pseudo-code, that would be:
For every line except the first line
For every field except the first field
Substitute ; with d0; and Substitute newline with d0 newline
My imagination suggests me it should be something like
cat file.csv | awk -F; 'NR>1 & NF>1'{print line} | sed 's/;/d0\n/g' | sed 's/\n/d0\n/g'
Any input?
Could use this sed
sed '1!{s/\(;[^;]*\)/\1d0/g}' file
Skips the first line then replaces each field beginning with ;(skipping the first) with itself and d0.
Output
Dates;A;B;C;D;E
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
"1999-01-05";1404.86d0;3072.41d0;66.3125d0;86.17d0;440.63d0
"1999-01-06";1435.12d0;3156.59d0;66.4375d0;86.32d0;441d0
I would say:
$ awk 'BEGIN{FS=OFS=";"} NR>1 {for (i=2;i<=NF;i++) $i=$i"d0"} 1' file
Dates;A;B;C;D;E
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
"1999-01-05";1404.86d0;3072.41d0;66.3125d0;86.17d0;440.63d0
"1999-01-06";1435.12d0;3156.59d0;66.4375d0;86.32d0;441d0
That is, set the field separator to ;. Starting on line 2, loop through all the fields from the 2nd one appending d0. Then, use 1 to print the line.
Your data format looks a bit weird. Enclosing the first column in double quotes makes me think that it can contain the delimiter, the semicolon, itself. However, I don't know the application which produces that data but if this is the case, then you can use the following GNU awk command:
awk 'NR>1{for(i=2;i<=NF;i++){$i=$i"d0"}}1' OFS=\; FPAT='("[^"]+")|([^;]+)' file
The key here is the FPAT variable. Using it use are able to define how a field can look like instead of being limited to specify a set of field delimiters.
big-prices.csv
Dates;A;B;C;D;E
"1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
"1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
"1999-01-06";1435.12;3156.59;66.4375;86.32;441
preprocess script
head -n 1 big-prices.csv 1>output.txt; \
tail -n +2 big-prices.csv | \
sed 's/;/d0;/g' | \
sed 's/$/d0/g' | \
sed 's/"d0/"/g' 1>>output.txt;
output.txt
Dates;A;B;C;D;E
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
"1999-01-05";1404.86d0;3072.41d0;66.3125d0;86.17d0;440.63d0
"1999-01-06";1435.12d0;3156.59d0;66.4375d0;86.32d0;441d0
note: would have to make minor modification to second sed if file has trailing whitespaces at end of lines..
Using awk
Input
$ cat file
Dates;A;B;C;D;E
"1999-01-04";1391.12;3034.53;66.515625;86.2;441.39
"1999-01-05";1404.86;3072.41;66.3125;86.17;440.63
"1999-01-06";1435.12;3156.59;66.4375;86.32;441
gsub (any awk)
$ awk 'FNR>1{ gsub(/;[^;]*/,"&d0")}1' file
Dates;A;B;C;D;E
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
"1999-01-05";1404.86d0;3072.41d0;66.3125d0;86.17d0;440.63d0
"1999-01-06";1435.12d0;3156.59d0;66.4375d0;86.32d0;441d0
gensub (gawk)
$ awk 'FNR>1{ print gensub(/(;[^;]*)/,"\\1d0","g"); next }1' file
Dates;A;B;C;D;E
"1999-01-04";1391.12d0;3034.53d0;66.515625d0;86.2d0;441.39d0
"1999-01-05";1404.86d0;3072.41d0;66.3125d0;86.17d0;440.63d0
"1999-01-06";1435.12d0;3156.59d0;66.4375d0;86.32d0;441d0

extract text between two strings bash sed excluding

I am trying to extract some text between two strings (which appear only once in the file).
Suppose the file is,
....Some Data
Your name is:
Dean/Winchester
You are male. Some data .....
I want to extract the text between 'Your name is:' and 'You are male.' both are unique and occur only once.
So, the output should be,
Dean/Winchester
I tried using sed,
sed -n 's/Your name is:\(.*\)You are male./\1/' abcd
But it doesn’t output anything.
Any help will be appreciated.
Thanks
$ sed -n '0,/Your name is/ d; /You are male/,$ d; /^$/d; p' abcd
Dean/Winchester
For variety, here is an awk solution:
$ awk '/Your name is/ {p=1; next} /You are male/ {exit} /^$/ {next} p==1 {print}' abcd
Dean/Winchester
$ sed -n -e '/^Your name is:/,/^You are male/{ /^Your name is:/d; /^You are male/d; p; }' test
Dean/Winchester

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources