Extracting a word from the middle of a sentence - bash

I have a log file like this:
2013-07-10 21:40:54 [INFO] Janus_Mesca joined the game
2013-07-10 21:40:54 [INFO] Fenlig joined the game
2013-07-10 21:41:21 [INFO] BigRedHoodie joined the game
I'm trying to print whatever appears in between "[INFO]" and "joined".
With my attempts I've only been able to remove the two words themselves.
tail -500 $rfile | grep "INFO.*joined the game" | \
sed -e 's/\[INFO\]\(.*\)joined/\1/'
Can you help?

Pure grep version with lookahead/lookbehind.
P.S. Option -P might not be available everywhere, but I thought it was clever.
tail test.log | grep -Po '(?<=\[INFO\] ).*(?= joined .*)'

You're almost there. You just need to make the pattern match the entire line, and replace it with the name you've captured.
You can also eliminate the need for grep by using a lesser-known feature of sed: Use the -n flag to prevent it from printing each line by default, and add a p command to make it print the matching lines:
tail -n 500 $rfile | sed -n 's/.*INFO] \(.*\)joined .*/\1/p'

This is an awk answer:
awk -F" " '{print $4}' data
where data is the input file. Provided the delimiter is a space, the output is like:
Janus_Mesca
Fenlig
BigRedHoodie
If you want to stick more strictly to the between [INFO] and joined here's an alternative:
awk -F"\\[INFO\\] " '{ split( $2, arr, " joined" ); print arr[1] }' data
for which I had to check out this answer to find out how to escape the square brackets. If you want the leading and trailing spaces left in the user name, take them out of each respective pattern.

Related

How to grep only the first string in a line

I'm writing a script that checks a list of all the users connected to the server (using who) and writes to the file Information the list of usernames of only those having letters a, b, c or d. This is what I have so far:
who | grep '[a-d]' >> Information
However, the command who displays this:
username pts/148 2019-01-29 16:09 (IP address)
What I don't understand is why my grep search is also displaying the pts/148, date, time, and IP address. I just want it to send the username to the file Information.
Any help is appreciated.
Another way is to use the command cut to get the first part of the string only.
who | cut -f 1 -d ' ' | grep '[a-d]' >> Information
Using awk to output records where the first clumn matches [a-d]:
$ who | awk '$1~/[a-d]/' >> Information
Using grep to search for lines with [a-d] before the first space:
$ who | grep -o "^[^ ]*[a-d][^ ]*" >> Information
You need to get the first word, otherwise grep will display the entire line that has the matching text. You could use awk:
who | awk '{ if (substr($1,1,1) ~ /^[a-d]/ ) print $1 }' >>Information

Remove certain characters or keywords from a TXT file in bash

I was wondering if there was a way to remove certain keywords from a text file, say I have a large file with lines saying
My name is John
My name is Peter
My name is Joe
Would there be a way to remove "My name is" without removing the entire line? Could this be done with grep somehow? I tried to find a solution but pretty much all of the ones I came across simply focus on deleting entire lines. Even if I could delete the text up until a certain column, that would fix my issue.
You need a text processing tool like sed or awk to do this, but not grep.
Try this:
sed 's/My name is//g' file
EDIT
Purpose of grep:
$ man grep | grep -A2 DESCRIPTION
DESCRIPTION
grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a
match to the given PATTERN. By default, grep prints the matching lines.
With GNU grep:
grep -Po "My name is\K.*" file
Output with a leading white space:
John
Peter
Joe
-P: Interpret PATTERN as a Perl regular expression
-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
\K: Remove matched part before \K.
try with one more simple grep.
grep -o '[^ ]*$' Input_file
-o will print only matched part of line, now in regex where it will look for text from last space to till last of the line.
An awk solution which first removes empty
lines and then prints last field.
awk '!/^$/{print $NF}' file
John
Peter
Joe
Using cut:
cut -d' ' -f4 input_file
GNU cut features a complement option, used to remove the area specified with -f. If the input_file had surnames such as "My name is John Doe", the previous code would print "John", and this would print "John Doe":
cut --complement -d' ' -f1-3 input_file
cut needs less memory, compared to other utils:
# these numbers will vary by *nix version and disto...
wc -c `which cut sed awk grep` | head -n -1 | sort -n
43224 /usr/bin/cut
109000 /bin/sed
215360 /bin/grep
662240 /usr/bin/awk

How do I seperate a link to get the end of a URL in shell?

I have some data that looks like this
"thumbnailUrl": "http://placehold.it/150/adf4e1"
I want to know how I can get the trailing part of the URL, I want the output to be
adf4e1
I was trying to grep when starting with / and ending with " but I'm only a beginner in shell scripting and need some help.
I came up with a quick and dirty solution, using grep (with perl regex) and cut:
$ cat file
"thumbnailUrl": "http://placehold.it/150/adf4e1"
"anotherUrl": "http://stackoverflow.com/questions/3979680"
"thumbnailUrl": "http://facebook.com/12f"
"randortag": "http://google.com/this/is/how/we/roll/3fk19as1"
$ cat file | grep -o '/\w*"$' | cut -d'/' -f2- | cut -d'"' -f1
adf4e1
3979680
12f
3fk19as1
We could kill this with a thousand little cuts, or just one blow from Awk:
awk -F'[/"]' '{ print $(NF-1); }'
Test:
$ echo '"thumbnailUrl": "http://placehold.it/150/adf4e1"' \
| awk -F'[/"]' '{ print $(NF-1); }'
adf4e1
Filter thorugh Awk using double quotes and slashes as field separators. This means that the trailing part ../adf4e1" is separated as {..}</>{adf4e1}<">{} where curly braces denote fields and angle brackets separators. The Awk variable NF gives the 1-based number of fields and so $NF is the last field. That's not the one we want, because it is blank; we want $(NF-1): the second last field.
"Golfed" version:
awk -F[/\"] '$0=$(NF-1)'
If the original string is coming from a larger JSON object, use something like jq to extract the value you want.
For example:
$ jq -n '{thumbnail: "http://placehold.it/150/adf4e1"}' |
> jq -r '.thumbnail|split("/")[-1]'
adf4e1
(The first command just generates a valid JSON object representing the original source of your data; the second command parses it and extracts the desired value. The split function splits the URL into an array, from which you only care about the last element.)
You can also do this purely in bash using string replacement and substring removal if you wrap your string in single quotes and assign it to a variable.
#!/bin/bash
string='"thumbnailUrl": "http://placehold.it/150/adf4e1"'
string="${string//\"}"
echo "${string##*/}"
adf4e1 #output
You can do that using 'cut' command in linux. Cut it using '/' and keep the last cut. Try it, its fun!
Refer http://www.thegeekstuff.com/2013/06/cut-command-examples

Using sed/awk to limit/parse output of LDAP DN's

I have a large list of LDAP DN's that are all related in that they failed to import into my application. I need to query these against my back-end database based on a very specific portion of the CN, but I'm not entirely sure on how I can restrict down the strings to a very specific value that is not necessarily located in the same position every time.
Using the following bash command:
grep 'Failed to process entry' /var/log/tomcat6/catalina.out | awk '{print substr($0, index($0,$14))}'
I am able to return a list of DN's similar to: (sorry for the redacted nature, security dictates)
"cn=[Last Name] [Optional Middle Initial or Suffix] [First Name] [User name],ou=[value],ou=[value],o=[value],c=[value]".
The CN value can be confusing as the order of surname, given name, middle initial, prefix or suffix can be displayed in any order if the values even exist, but one thing does remain consistent, the username is always the last field in the cn (followed by a "," then the first of many potential OU's). I need to parse out that user name for querying, preferably into a comma separated list for easy copy and paste for use in a SQL IN() query or use in a bash script. So as an example, imagine the following short list of abbreviated DNs, only showing the CN value (since the rest of the DN is irrelevant):
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
I would like to have a csv list returned that looks like:
john.doe,jane.a.doe,bsmith,richard.powers1
Can a mix of awk and/or sed accomplish this?
sed -e 's/"^[^,]* \([^ ,]*\),.*/\1/'
will parse the username part of the common name and isolate the username. Follow up with
| tr '\n' , | sed -e 's/,$/\n/'
to convert the one-per-line username format into comma-separated form.
Here is one quick and dirty way of doing it -
awk -v FS="[\"=,]" '{ print $3}' file | awk -v ORS="," '{print $NF}' | sed 's/,$//'
Test:
[jaypal:~/Temp] cat ff
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
[jaypal:~/Temp] awk -v FS="[\"=,]" '{ print $3}' ff | awk -v ORS="," '{print $NF}' | sed 's/,$//'
john.doe,jane.a.doe,bsmith,richard.powers1
OR
If you have gawk then
gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' filename | sed ':a;{N;s/\n/,/}; ba'
Test:
[jaypal:~/Temp] gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' ff | sed ':a;{N;s/\n/,/}; ba'
john.doe,jane.a.doe,bsmith,richard.powers1
Given a file "Document1.txt" containing
cn=Smith Jane batty.cow,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Marley Bob reggae.boy,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Clinton J Bill ex.president,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
you can do a
cat Document1.txt | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p"
which gets you
batty.cow
reggae.boy
ex.president
using tr to transalate the end of line character
cat Document1.txt | sed -n "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | tr '\n' ','
produces
batty.cow,reggae.boy,ex.president,
you will need to deal with the last comma
but if you want it in a database say oracle for example, a script containing:
#!/bin/bash
doc=$1
cat ${doc} | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | while read username
do
sqlplus -s username/password#instance <<+++ insert into mytable (user_name) values ('${username}'\;)
exit
+++
done
N.B.
The A-Za-z0-9._ in the sed expression is every type of character you expect in the username - you may need to play with that one.
caveat - I did't test the last bit with the database insert in it!
Perl regex solution that I consider more readable than the alternatives, in case you're interested:
perl -ne 'print "$1," if /(([[:alnum:]]|[[:punct:]])+),ou/' input.txt
Prints the string preceding 'ou', accepts alphanumeric and punctuation chars (but no spaces, so it stops at the username).
Output:
john.doe,jane.a.doe,bsmith,
It has been over a year since there has been an idea posted to this, but wanted a place to refer to in the future when this class of question comes up again. Also, I did not see a similar answer posted.
Of the pattern of data provided, my interpretation is that we can strip away everything after the first comma, leaving us with a true CN rather than a DN that starts with a CN.
In the CN, we strip everything before and including the last white space.
This will leave us with the username.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{print $NF}' >> usernames
Passing your ldap file to awk, with the field separator set to comma, and the match string set to cn= at the beginning of a line, we print everything up to the first comma. Then we pipe that output into an awk with the default field separator and print only the last field, resulting in just the username. We redirect and append this to a file in the current directory named usernames, and we end up with one username per line.
To convert this into a single comma separated line of usernames, we change the last print command to printf, leaving out the \n newline character, but adding a comma.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{printf $NF","}' >> usersnames
This leaves the only line in the file with a trailing comma, but since it is only intended to be used for cut and paste, simply do not cut the last character. :)

Awk replace a column with its hash value

How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
So, you don't really want to be doing this with awk. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.
Given input like this:
this is a test
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
$2=cksum
print
}' < sample
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses openssl to generate the md5 checksum. We first build a shell command line in tmp to pass the selected column to the md5 command. Then we pipe the output into the cksum variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:
this 7e1b6dbfa824d5d114e96981cededd00 a test
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample
This might work using Bash/GNU sed:
<<<"this is a test" sed -r 's/(\S+\s)(\S+)(.*)/echo "\1 $(md5sum <<<"\2") \3"/e;s/ - //'
this 7e1b6dbfa824d5d114e96981cededd00 a test
or a mostly sed solution:
<<<"this is a test" sed -r 'h;s/^\S+\s(\S+).*/md5sum <<<"\1"/e;G;s/^(\S+).*\n(\S+)\s\S+\s(.*)/\2 \1 \3/'
this 7e1b6dbfa824d5d114e96981cededd00 a test
Replaces is from this is a test with md5sum
Explanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string G to the pattern space (md5sum result) and using substitution arrange to suit.
You can also do that with perl :
echo "aze qsd wxc" | perl -MDigest::MD5 -ne 'print "$1 ".Digest::MD5::md5_hex($2)." $3" if /([^ ]+) ([^ ]+) ([^ ]+)/'
aze 511e33b4b0fe4bf75aa3bbac63311e5a wxc
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with read than awk, though I haven't done any benchmarking.
the input (scratch001.txt):
foo|bar|foobar|baz|bang|bazbang
baz|bang|bazbang|foo|bar|foobar
transformed using read:
while IFS="|" read -r one fish twofish red fishy bluefishy; do
twofish=`echo -n $twofish | md5sum | tr -d " -"`
echo "$one|$fish|$twofish|$red|$fishy|$bluefishy"
done < scratch001.txt
produces the output:
foo|bar|3858f62230ac3c915f300c664312c63f|baz|bang|bazbang
baz|bang|19e737ea1f14d36fc0a85fbe0c3e76f9|foo|bar|foobar

Resources