print text between pipes - bash

I have a text file with ids as below
ref|XP_029641976.1|
ref|XP_014779594.1|
gb|KOF78315.1|
I wish to print only the text between pipes and I've tried this sed -n '/|/,/|/p and also tried substituting but they don't work. The string in front of the first pipe varies. Any ideas?
Thank you

You can get text between pipes with sed regexp like this
#!/bin/bash
echo "ref|XP_029641976.1|
ref|XP_014779594.1|
gb|KOF78315.1|" | sed -E "s:.*\|(.*?)\|:\1:g"

As others have suggested, awk is a good candidate for this:
awk -F'|' '{ print $2 }' file
Specify the field delimiter with -F and then print the second pipe delimited field with $2

Related

Duplicate first column of multiple text files in bash

I have multiple text files each containing two columns and I would like to duplicate the first column in each file in bash to have three columns in the end.
File:
sP100227 1
sP100267 1
sP100291 1
sP100493 1
Output file:
sP100227 sP100227 1
sP100267 sP100267 1
sP100291 sP100291 1
sP100493 sP100493 1
I tried:
txt=path/to/*.txt
echo "$(paste <(cut -f1-2 $txt) > "$txt"
Could you please try following. Written and tested with shown samples in GNU awk. This will add fields to only those lines which have 2 fields in it.
awk 'NF==2{$1=$1 OFS $1} 1' Input_file
In case you don't care of number of fields and simply want to have value of 1st field 2 times then try following.
awk '{$1=$1 OFS $1} 1' Input_file
OR if you only have 2 fields in your Input_file then we need not to rewrite the complete line we could simply print them as follows.
awk '{print $1,$1,$2}' Input_file
To save output into same Input_file itself append > temp && mv temp Input_file for above solutions(after testing).
Use a temp file, with cut -f1 and paste, like so:
paste <(cut -f1 in_file) in_file > tmp_file
mv tmp_file in_file
Alternatively, use a Perl one-liner, like so:
perl -i.bak -lane 'print join "\t", $F[0], $_;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
The default delimiter in cut and paste is TAB, but your file looks to be space-separated.
You can't use the same file as input and output redirection, because when the shell opens the file for output it truncates it, so there's nothing for the program to read. Write to a new file and then rename it.
Your paste command is only being given one input file. And there's no need to use echo.
paste -d' ' <(cut -d' ' -f1 "$txt") "$txt" > "$txt.new" && mv "$txt.new" "$txt"
You can do this more easily using awk.
awk '{print $1, $0}' "$txt" > "$txt.new" && mv "$txt.new" "$txt"
GNU awk has an in-place extension, so you can use that if you like. See Save modifications in place with awk
Try sed -Ei 's/\s*(\S+)\s+/\1 \1 /1' $txt if your fields are separated by strings of one or more whitespace characters. This used the Stream Editor (sed) replaces (s///1) the first string of non-space characters (\S+) followed by a string of whitespace characters (\s+) with the same thing repeated with intervening spaces(\1 \1 ). It keeps the rest of the line. The -E to sed means use extended pattern matching (+, ( vs. \(). The -i means do it in-place, replacing the file with the output.
You could use awk and do awk '{ printf "%s %s\n",$1,$0 }'. This takes the first whitespace-delimited field ($1) and follows it with a space and the whole line ($0) followed by a newline. This is a little clearer than sed but it doesn't have the advantage of being in-place.
If you can guarantee they are delimited by only one space, with no leading spaces, you can use paste -d' ' <(cut -d' ' -f1 ${txt}) ${txt} > ${txt}.new; mv ${txt}.new ${txt}. The -d' ' sets the delimiter to space for both cut and paste. You know this but for others -f1 means extract the first -d-delimited field. The mv command replaces the input with the output.

Awk strings enclosed in brackets

I'm trying to make a script to make reading logs easier. I'm having trouble extracting a string enclosed in brackets.
I want to extract the thread ID of a log which looks like this:
[CURRENT_DATE][THREAD_ID][PROCESS_NAME]Some random text here
I have tried this but it prints the CURRENT_DATE:
awk -F '[][]' '{print $2}'
If I use print $3 it prints the Some random text here part.
Is there any way that I could somehow read the string enclosed in brackets?
You may use this awk:
s='[CURRENT_DATE][THREAD_ID][PROCESS_NAME]Some random text here'
awk -F '\\]\\[' '{print $2}' <<< "$s"
THREAD_ID
-F '\\]\\[' will make text ][ as delimiter.
How about this? (Note that multiple character delimiters seem not to be available in GNU awk 4 respectively in the awk version the OP is using.)
pattern='[CURRENT_DATE][THREAD_ID][PROCESS_NAME]Some random text here'
echo $pattern
awk -F '[' '{print substr($3, 1, length($3)-2)}' <<< "$pattern"
Different versions of awk behave in different ways. Without knowing what you're running, it's difficult to say why your existing code behaves as it does.
You already know that with a field separator of [][] or just [, you have an empty field at the beginning of each line. Instead, I'd try this:
awk -F']' '{gsub(/\[/,""); print $2}' input.log
This simply strips out the left-square-bracket and uses its fellow as your field delimiter. The advantage of using ] instead of [ is that it makes $1 your first field.

Extract only a part of data from a file

My input is test.txt which contains data in this format:
'X'=>'ABCDEF',
'X'=>'XYZ',
'X'=>'GHIJKLMN',
I want to get something like:
'ABCDEF',
'XYZ',
'GHIJKLMN',
How do I go about this in bash?
Thanks!
If the input never contains the character > elsewhere than in the "fat arrow", you can use cut:
cut -f2 -d\> file
-d specifies the delimiter, here > (backslash needed to prevent the shell from interpreting it as the redirection operator)
-f specifies which field to extract
Here's a solution using sed:
curl -sL https://git.io/fjeX4 | sed 's/^.*>//'
Sed is passed a single command: s///. is a regex that matches any characters (.*) from the beginning of the line (^) to the last '>'. The is an empty string, so essentially sed is just deleting all the characters on the line up to the last >. As with the other solutions, this solution assumes that there is only one '>' on the line.
If the data is really uniform, then you could just run cut (on example input):
$ curl -sL https://git.io/fjeX4 | cut -d '>' -f 2
'ABCDEF',
'XYZ',
'GHIJKLMN',
You can see flag explanations on explainshell.
With awk, it would look similar:
$ curl -sL https://git.io/fjeX4 | awk -F '>' '{ print $2 }'
'ABCDEF',
'XYZ',
'GHIJKLMN',
Using awk
awk 'BEGIN{FS="=>"}{print $2}' file
'ABCDEF',
'XYZ',
'GHIJKLMN',
FS in awk stands for field separator. The code inside BEGIN is executed only at the beginning, ie, before processing the first record. $2 prints the second field.
A more idiomatic way of putting the above stuff would be
awk 'BEGIN{FS="=>"}$2' file
'ABCDEF',
'XYZ',
'GHIJKLMN',
The default action in awk is to print the record. Here we explicitly mention what to print. ie $2.

unix bash - extract a line from file

need your help!!! I tried looking for this but to no avail.
How can I achieve the following using bash?
I've a flat file called "cube.mdl" that contains:
[...]
bla bla bla bla lots of lines above
Cube 8007841 "BILA_" MdcFile "BILA_CO_PM_MKT_BR_CUBE.mdc"
bla bla bla more lines below
[...]
I need to open that file, look for the word "MdcFile" and get the string that follows between quotes, which would be BILA_CO_PM_MKT_BR_CUBE.mdc
I know AWK or grep are powerful enough to do this in one line, but I couldn't find an example that could help me do it on my own.
Thanks in advance!
JMA
You can use:
grep -o -P "MdcFile.*" cube.mdl | awk -F\" '{ print $2 }'
This will use grep's regex to only return MdcFile and everything after it in the current line. Then, awk will use the " as a delimiter and print only the second word - which would be your "in-quotes" word(s), returned without the quotes of course.
The option -o, --only-matching specifies to return only the text matching that matches and the -P, --perl-regexp specifies that the pattern is a Perl-Regex pattern. It appears that some versions of grep do not contain these options. The OP's version is a version that does not include them, but the following appears to work for him instead:
grep "MdcFile.*" cube.mdl | awk -F\" '{ print $2 }'
grep MdcFile cube.mdl | awk '{print $5}'
would do it, assuming there's no spaces in any of those bits to throw off the position count.
This might do it.
sed -n '/MdcFile / s/.*MdcFile "\(\[^"\]\+\)".*/\1/;/MdcFile / p' INPUTFILE
Use awk for the whole thing with " as record separator:
awk -v RS='"' '/MdcFile/ { getline; print }' cube.mdl

How do I print a field from a pipe-separated file?

I have a file with fields separated by pipe characters and I want to print only the second field. This attempt fails:
$ cat file | awk -F| '{print $2}'
awk: syntax error near line 1
awk: bailing out near line 1
bash: {print $2}: command not found
Is there a way to do this?
Or just use one command:
cut -d '|' -f FIELDNUMBER
The key point here is that the pipe character (|) must be escaped to the shell. Use "\|" or "'|'" to protect it from shell interpertation and allow it to be passed to awk on the command line.
Reading the comments I see that the original poster presents a simplified version of the original problem which involved filtering file before selecting and printing the fields. A pass through grep was used and the result piped into awk for field selection. That accounts for the wholly unnecessary cat file that appears in the question (it replaces the grep <pattern> file).
Fine, that will work. However, awk is largely a pattern matching tool on its own, and can be trusted to find and work on the matching lines without needing to invoke grep. Use something like:
awk -F\| '/<pattern>/{print $2;}{next;}' file
The /<pattern>/ bit tells awk to perform the action that follows on lines that match <pattern>.
The lost-looking {next;} is a default action skipping to the next line in the input. It does not seem to be necessary, but I have this habit from long ago...
The pipe character needs to be escaped so that the shell doesn't interpret it. A simple solution:
$ awk -F\| '{print $2}' file
Another choice would be to quote the character:
$ awk -F'|' '{print $2}' file
Another way using awk
awk 'BEGIN { FS = "|" } ; { print $2 }'
And 'file' contains no pipe symbols, so it prints nothing. You should either use 'cat file' or simply list the file after the awk program.

Resources