Grep and returning only column of match - bash

if i want to search from a file with various number of columns like this:
ppl:apple age:5 F add:blabla love:dog
ppl:tom M add:blablaa love:cat
ppl:jay age:3 M love:apple
ppl:jenny acc:jen age:8 F add:blabla
...
the file is tab separated, and the output i want is:
age:5
age:3
age:8
...
using grep age: will return the entire row, while
using cut -f2 will return some unwanted column:
age:5
M
age:3
acc:jen
and neither cut -f2|grep age: nor grep age|cut -f2: work
My data may range from 11-23 columns,
will there be any simpler way to handle it using grep sed or awk,
many thanks

grep itself can do this, with no additional tools, by using the -o/--only-matching switch. You should be able to just do:
grep -o '\<age:[0-9]\+'
To explain the less common parts of the regex:
\< is a zero-width assertion that you're at the beginning of a word (that is, age is preceded by a non-word character or occurs at the beginning of the line, but it's not actually matching that non-word character); this prevents you from matching, say image:123. It doesn't technically require whitespace, so it would match :age: or the like; if that's a problem, match \t itself and use cut or tr to remove it later.
\+ means "match 1 or more occurrences of the preceding character class" (which is [0-9], so it matches one or more digits). \+ is equivalent to repeating the class twice, with the second copy followed by *, e.g. [0-9][0-9]*, except it's shorter, and some regex engines can optimize \+ better.

You can use script below:
cat file|grep age|awk '{for(i=1;i<22;i++){if($i ~ /^age:/)print $i}}'

You can also use sed
sed -nr 's/^.*(age:.).*$/\1/p' input_pattern.txt
Where input_pattern.txt contains you data.

ShadowRanger's simple grep-based answer is probably the best choice.
A solution that works with both GNU sed and BSD/OSX sed:
sed -nE 's/^.*[[:blank:]](age:[0-9]+).*$/\1/p' file
With GNU sed you can simplify to:
sed -nr 's/^.*\t(age:[0-9]+).*$/\1/p' file
Both commands match the entire input line, if it contains an age: field of interest, replace it with that captured field (\1), and print the result; other lines are ignored.
Original answer, before the requirements were clarified:
Assuming that on lines where age: is present, it is always the 2nd tab-separated field, awk is the best solution:
awk '$2 ~ /^age:/ { print $2 }' file
$2 ~ /^age:/ only matches lines whose 2nd whitespace-separate field starts with literal age:
{ print $2 } simply prints that field.

Limit search for regexp to columns 11 to 23:
awk '{ for(i = 11; i <= 23; i++) { if ($i ~ /^age:/) print $i } }' file

Related

Ignore comma after backslash in a line in a text file using awk or sed

I have a text file containing several lines of the following format:
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
I need to parse the text file and print the output of fields ignoring the escaped commas. Here those will be fields 2 or 3 like this:
science, social
tennis, ping_pong, chess
I do not know how to ignore escaped characters. How can I do it with awk or sed in terminal?
Substitute \, with a character that your records do not contain normally (e.g. \n), and restore it before printing. For example:
$ awk -F',' 'NR>1{ if(gsub(/\\,/,"\n")) gsub(/\n/,",",$2); print $2 }' file
science,social
painting
Since first gsub is performed on the whole record (i.e $0), awk is forced to recompute fields. But the second one is performed on only second field (i.e $2), so it will not affect other fields. See: Changing Fields.
To be able to extract multiple fields with properly escaped commas you need to gsub \ns in all fields with a for loop as in the following example:
$ awk 'BEGIN{ FS=OFS="," } NR>1{ if(gsub(/\\,/,"\n")) for(i=1;i<=NF;++i) gsub(/\n/,"\\,",$i); print $2,$3 }' file
science\,social,football
painting,tennis\,ping_pong\,chess
See also: What's the most robust way to efficiently parse CSV using awk?.
You could replace the \, sequences by another character that won't appear in your text, split the text around the remaining commas then replace the chosen character by commas :
sed $'s/\\\,/\31/g' input | awk -F, '{ printf "Name: %s\nSubjects : %s\nSports: %s\nSchool: %s\n\n", $1, $2, $3, $4 }' | tr $'\31' ','
In this case using the ASCII control char "Unit Separator" \31 which I'm pretty sure your input won't contain.
You can try it here.
Why awk and sed when bash with coreutils is just enough:
# Sorry my cat. Using `cat` as input pipe
cat <<EOF |
name,list_of_subjects,list_of_sports,school
Eg1: john,science\,social,football,florence_school
Eg2: james,painting,tennis\,ping_pong\,chess,highmount_school
EOF
# remove first line!
tail -n+2 |
# substitute `\,` by an unreadable character:
sed 's/\\\,/\xff/g' |
# read the comma separated list
while IFS=, read -r name list_of_subjects list_of_sports school; do
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_subjects < <(printf "%s" "$list_of_subjects")
# read the \xff separated list into an array
IFS=$'\xff' read -r -d '' -a list_of_sports < <(printf "%s" "$list_of_sports")
echo "list_of_subjects : ${list_of_subjects[#]}"
echo "list_of_sports : ${list_of_sports[#]}"
done
will output:
list_of_subjects : science social
list_of_sports : football
list_of_subjects : painting
list_of_sports : tennis ping_pong chess
Note that this will be most probably slower then solution using awk.
Note that the principle of operation is the same as in other answers - substitute \, string by some other unique character and then use that character to iterate over the second and third field elemetns.
This might work for you (GNU sed):
sed -E 's/\\,/\n/g;y/,\n/\n,/;s/^[^,]*$//Mg;s/\n//g;/^$/d' file
Replace quoted commas by newlines and then revert newlines to commas and commas to newlines. Remove all lines that do not contain a comma. Delete empty lines.
Using Perl. Change the \, to some control char say \x01 and then replace it again with ,
$ cat laxman.txt
john,science\,social,football,florence_school
james,painting,tennis\,ping_pong\,chess,highmount_school
$ perl -ne ' s/\\,/\x01/g and print ' laxman.txt | perl -F, -lane ' for(#F) { if( /\x01/ ) { s/\x01/,/g ; print } } '
science,social
tennis,ping_pong,chess
You can perhaps join columns with a function.
function joincol(col, i) {
$col=$col FS $(col+1)
for (i=col+1; i<NF; i++) {
$i=$(i+1)
}
NF--
}
This might get used thusly:
{
for (col=1; col<=NF; col++) {
if ($col ~ /\\$/) {
joincol(col)
}
}
}
Note that decrementing NF is undefined behaviour in POSIX. It may delete the last field, or it may not, and still be POSIX compliant. This works for me in BSDawk and Gawk. YMMV. May contain nuts.
Use gawk's FPAT:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print $3}' file
#list_of_sports
#football
#tennis\,ping_pong\,chess
then use gnusub to replace the backslashes:
awk -v FPAT='(\\\\.|[^,\\\\]*)+' '{print gensub("\\\\", "", "g", $3)}' file
#list_of_sports
#football
#tennis,ping_pong,chess

awk: copy from A to B and output..?

my file is bookmarks, backup-6.session
inside file is long long letters, i need copy all url (many) see here example inside
......"charset":"UTF-8","ID":3602197775,"docshellID":0,"originalURI":"https://www.youtube.com/watch?v=axxxxxxxxsxsx","docIdentifier":470,"structuredCloneState":"AAAAA.....
result to output text.txt
https://www.youtube.com/watch?v=axxxxxxxxsxsx
https://www.youtube.com/watch?v=bxxxxxxxxsxsx
https://www.youtube.com/watch?v=cxxxxxxxxsxsx
https://www.youtube.com/watch?v=dxxxxxxxxsxsx
....
....
there are start before than A "originalURI":" to end "
comand to be: AWK, SED.. (i dont know what is best command for me)
thank you
With GNU awk for multi-char RS and RT:
$ awk -v RS='"originalURI":"[^"]+' 'sub(/.*"/,"",RT){print RT}' file
https://www.youtube.com/watch?v=axxxxxxxxsxsx
You could also use grep, for example:
grep -oh "https://www\.youtube\.com/watch?v=[A-Za-z0-9]*" backup-6.session > text.txt
That is if the axxxxxxxxsxsx part contains only letters from A-Z, a-z or digits 0-9, and is not followed by any of those.
Notice the flags for grep:
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
The awk solution would be as follows:
awk -F, '{ for (i=1;i<=NF;i++) { if ( $i ~ "originalURI") { spit($i,add,":");print gensub("\"","","g",add[2])":"gensub("\"","","g",add[3])} } }' filename
We loop through each field separated by "," and then pattern match against "originalURI" Then we split this string using ":" and the function split and remove the quotation marks with the function gensub.
The sed solution would be as follows:
sed -rn 's/^.*originalURI":"(.*)","docIdentifier.*$/\1/p' filename
Run sed with extended regular expression (-r) and suppress the output (-n) Substitute the string with the regular expression enclosed in brackets (/1) printing the result.

How to grep the last occurrence of a line pattern

I have a file with contents
x
a
x
b
x
c
I want to grep the last occurrence,
x
c
when I try
sed -n "/x/,/b/p" file
it lists all the lines, beginning x to c.
I'm not sure if I got your question right, so here are some shots in the dark:
Print last occurence of x (regex):
grep x file | tail -1
Alternatively:
tac file | grep -m1 x
Print file from first matching line to end:
awk '/x/{flag = 1}; flag' file
Print file from last matching line to end (prints all lines in case of no match):
tac file | awk '!flag; /x/{flag = 1};' | tac
grep -A 1 x file | tail -n 2
-A 1 tells grep to print one line after a match line
with tail you get the last two lines.
or in a reversed way:
tac fail | grep -B 1 x -m1 | tac
Note: You should make sure your pattern is "strong" enough so it gets you the right lines. i.e. by enclosing it with ^ at the start and $ at the end.
This might work for you (GNU sed):
sed 'H;/x/h;$!d;x' file
Saves the last x and what follows in the hold space and prints it out at end-of-file.
not sure how to do it using sed, but you can try awk
awk '{a=a"\n"$0; if ($0 == "x"){ a=$0}} END{print a}' file
POSIX vi (or ex or ed), in case it is useful to someone
Done in Command mode, of course
:set wrapscan
Go to the first line and just search Backwards!
1G?pattern
Slower way, without :set wrapscan
G$?pattern
Explanation:
G go to the last line
Move to the end of that line $
? search Backwards for pattern
The first backwards match will be the same as the last forward match
Either way, you may now delete all lines above current (match)
:1,.-1d
or
kd1G
You could also delete to the beginning of the matched line prior to the line deletions with d0 in case there were multiple matches on the same line.
POSIX awk, as suggested at
get last line from grep search on multiple files
awk '(FNR==1)&&s{print s; s=""}/PATTERN/{s=$0}END{if(s) print s}'
if you wanna do awk in truly hideous one-liner fashion but getting awk to resemble closer to functional programming paradigm syntax without having to keep track when the last occurrence is
mawk/mawk2/gawk 'BEGIN { FS = "=7713[0-9]+="; RS = "^$";
} END { print ar0[split($(0 * sub(/\n.+$/,"",$NF)), ar0, ORS)] }'
Here i'm employing multiple awk short-hands :
sub(/[\n.+$/, "", $NF) # trimming all extra rows after pattern
g/sub() returns # of substitutions made, so multiplying that by 0 forces the split() to be splitting $0, the full file, instead.
split() returns # of items in the array (which is another way of saying the position of last element), so even though I've already trimmed out the trailing \n, i still can directly print ar0[split()], knowing that ORS will fill in the missing trailing \n.
That's why this code looks like i'm trying to extract array items before the array itself is defined, but due to flow of logic needed, the array will become defined by the time it reaches print.
Now if you want something simpler, these 2 also work
mawk/gawk 'BEGIN { FS="=7713[0-9]+="; RS = "^$"
} END { $NF = substr($NF, 1, index($NF, ORS));
FS = ORS; $0 = $0; print $(NF-1) }'
or
mawk/gawk '/=7713[0-9]+=/ { lst = $0 } END { print lst }'
I didn't use the same x|c requirements as OP just to showcase these work regardless of whether you need fixed-strings or regex based matches.
The above solutions only work for one single file, to print the last occurrence for many files (say with suffix .txt), use the following bash script
#!/bin/bash
for fn in `ls *.txt`
do
result=`grep 'pattern' $fn | tail -n 1`
echo $result
done
where 'pattern' is what you would like to grep.

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Trim leading and trailing spaces from a string in awk

I'm trying to remove leading and trailing space in 2nd column of the below input.txt:
Name, Order  
Trim, working
cat,cat1
I have used the below awk to remove leading and trailing space in 2nd column but it is not working. What am I missing?
awk -F, '{$2=$2};1' input.txt
This gives the output as:
Name, Order  
Trim, working
cat,cat1
Leading and trailing spaces are not removed.
If you want to trim all spaces, only in lines that have a comma, and use awk, then the following will work for you:
awk -F, '/,/{gsub(/ /, "", $0); print} ' input.txt
If you only want to remove spaces in the second column, change the expression to
awk -F, '/,/{gsub(/ /, "", $2); print$1","$2} ' input.txt
Note that gsub substitutes the character in // with the second expression, in the variable that is the third parameter - and does so in-place - in other words, when it's done, the $0 (or $2) has been modified.
Full explanation:
-F, use comma as field separator
(so the thing before the first comma is $1, etc)
/,/ operate only on lines with a comma
(this means empty lines are skipped)
gsub(a,b,c) match the regular expression a, replace it with b,
and do all this with the contents of c
print$1","$2 print the contents of field 1, a comma, then field 2
input.txt use input.txt as the source of lines to process
EDIT I want to point out that #BMW's solution is better, as it actually trims only leading and trailing spaces with two successive gsub commands. Whilst giving credit I will give an explanation of how it works.
gsub(/^[ \t]+/,"",$2); - starting at the beginning (^) replace all (+ = zero or more, greedy)
consecutive tabs and spaces with an empty string
gsub(/[ \t]+$/,"",$2)} - do the same, but now for all space up to the end of string ($)
1 - ="true". Shorthand for "use default action", which is print $0
- that is, print the entire (modified) line
remove leading and trailing white space in 2nd column
awk 'BEGIN{FS=OFS=","}{gsub(/^[ \t]+/,"",$2);gsub(/[ \t]+$/,"",$2)}1' input.txt
another way by one gsub:
awk 'BEGIN{FS=OFS=","} {gsub(/^[ \t]+|[ \t]+$/, "", $2)}1' infile
Warning by #Geoff: see my note below, only one of the suggestions in this answer works (though on both columns).
I would use sed:
sed 's/, /,/' input.txt
This will remove on leading space after the , .
Output:
Name,Order
Trim,working
cat,cat1
More general might be the following, it will remove possibly multiple spaces and/or tabs after the ,:
sed 's/,[ \t]\?/,/g' input.txt
It will also work with more than two columns because of the global modifier /g
#Floris asked in discussion for a solution that removes trailing and and ending whitespaces in each colum (even the first and last) while not removing white spaces in the middle of a column:
sed 's/[ \t]\?,[ \t]\?/,/g; s/^[ \t]\+//g; s/[ \t]\+$//g' input.txt
*EDIT by #Geoff, I've appended the input file name to this one, and now it only removes all leading & trailing spaces (though from both columns). The other suggestions within this answer don't work. But try: " Multiple spaces , and 2 spaces before here " *
IMO sed is the optimal tool for this job. However, here comes a solution with awk because you've asked for that:
awk -F', ' '{printf "%s,%s\n", $1, $2}' input.txt
Another simple solution that comes in mind to remove all whitespaces is tr -d:
cat input.txt | tr -d ' '
I just came across this. The correct answer is:
awk 'BEGIN{FS=OFS=","} {gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2)} 1'
just use a regex as a separator:
', *' - for leading spaces
' *,' - for trailing spaces
for both leading and trailing:
awk -F' *,? *' '{print $1","$2}' input.txt
Simplest solution is probably to use tr
$ cat -A input
^I Name, ^IOrder $
Trim, working $
cat,cat1^I
$ tr -d '[:blank:]' < input | cat -A
Name,Order$
Trim,working$
cat,cat1
The following seems to work:
awk -F',[[:blank:]]*' '{$2=$2}1' OFS="," input.txt
If it is safe to assume only one set of spaces in column two (which is the original example):
awk '{print $1$2}' /tmp/input.txt
Adding another field, e.g. awk '{print $1$2$3}' /tmp/input.txt will catch two sets of spaces (up to three words in column two), and won't break if there are fewer.
If you have an indeterminate (large) number of space delimited words, I'd use one of the previous suggestions, otherwise this solution is the easiest you'll find using awk.

Resources