Using sed/awk to limit/parse output of LDAP DN's - bash

I have a large list of LDAP DN's that are all related in that they failed to import into my application. I need to query these against my back-end database based on a very specific portion of the CN, but I'm not entirely sure on how I can restrict down the strings to a very specific value that is not necessarily located in the same position every time.
Using the following bash command:
grep 'Failed to process entry' /var/log/tomcat6/catalina.out | awk '{print substr($0, index($0,$14))}'
I am able to return a list of DN's similar to: (sorry for the redacted nature, security dictates)
"cn=[Last Name] [Optional Middle Initial or Suffix] [First Name] [User name],ou=[value],ou=[value],o=[value],c=[value]".
The CN value can be confusing as the order of surname, given name, middle initial, prefix or suffix can be displayed in any order if the values even exist, but one thing does remain consistent, the username is always the last field in the cn (followed by a "," then the first of many potential OU's). I need to parse out that user name for querying, preferably into a comma separated list for easy copy and paste for use in a SQL IN() query or use in a bash script. So as an example, imagine the following short list of abbreviated DNs, only showing the CN value (since the rest of the DN is irrelevant):
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
I would like to have a csv list returned that looks like:
john.doe,jane.a.doe,bsmith,richard.powers1
Can a mix of awk and/or sed accomplish this?

sed -e 's/"^[^,]* \([^ ,]*\),.*/\1/'
will parse the username part of the common name and isolate the username. Follow up with
| tr '\n' , | sed -e 's/,$/\n/'
to convert the one-per-line username format into comma-separated form.

Here is one quick and dirty way of doing it -
awk -v FS="[\"=,]" '{ print $3}' file | awk -v ORS="," '{print $NF}' | sed 's/,$//'
Test:
[jaypal:~/Temp] cat ff
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
[jaypal:~/Temp] awk -v FS="[\"=,]" '{ print $3}' ff | awk -v ORS="," '{print $NF}' | sed 's/,$//'
john.doe,jane.a.doe,bsmith,richard.powers1
OR
If you have gawk then
gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' filename | sed ':a;{N;s/\n/,/}; ba'
Test:
[jaypal:~/Temp] gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' ff | sed ':a;{N;s/\n/,/}; ba'
john.doe,jane.a.doe,bsmith,richard.powers1

Given a file "Document1.txt" containing
cn=Smith Jane batty.cow,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Marley Bob reggae.boy,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Clinton J Bill ex.president,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
you can do a
cat Document1.txt | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p"
which gets you
batty.cow
reggae.boy
ex.president
using tr to transalate the end of line character
cat Document1.txt | sed -n "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | tr '\n' ','
produces
batty.cow,reggae.boy,ex.president,
you will need to deal with the last comma
but if you want it in a database say oracle for example, a script containing:
#!/bin/bash
doc=$1
cat ${doc} | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | while read username
do
sqlplus -s username/password#instance <<+++ insert into mytable (user_name) values ('${username}'\;)
exit
+++
done
N.B.
The A-Za-z0-9._ in the sed expression is every type of character you expect in the username - you may need to play with that one.
caveat - I did't test the last bit with the database insert in it!

Perl regex solution that I consider more readable than the alternatives, in case you're interested:
perl -ne 'print "$1," if /(([[:alnum:]]|[[:punct:]])+),ou/' input.txt
Prints the string preceding 'ou', accepts alphanumeric and punctuation chars (but no spaces, so it stops at the username).
Output:
john.doe,jane.a.doe,bsmith,

It has been over a year since there has been an idea posted to this, but wanted a place to refer to in the future when this class of question comes up again. Also, I did not see a similar answer posted.
Of the pattern of data provided, my interpretation is that we can strip away everything after the first comma, leaving us with a true CN rather than a DN that starts with a CN.
In the CN, we strip everything before and including the last white space.
This will leave us with the username.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{print $NF}' >> usernames
Passing your ldap file to awk, with the field separator set to comma, and the match string set to cn= at the beginning of a line, we print everything up to the first comma. Then we pipe that output into an awk with the default field separator and print only the last field, resulting in just the username. We redirect and append this to a file in the current directory named usernames, and we end up with one username per line.
To convert this into a single comma separated line of usernames, we change the last print command to printf, leaving out the \n newline character, but adding a comma.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{printf $NF","}' >> usersnames
This leaves the only line in the file with a trailing comma, but since it is only intended to be used for cut and paste, simply do not cut the last character. :)

Related

Remove certain characters or keywords from a TXT file in bash

I was wondering if there was a way to remove certain keywords from a text file, say I have a large file with lines saying
My name is John
My name is Peter
My name is Joe
Would there be a way to remove "My name is" without removing the entire line? Could this be done with grep somehow? I tried to find a solution but pretty much all of the ones I came across simply focus on deleting entire lines. Even if I could delete the text up until a certain column, that would fix my issue.
You need a text processing tool like sed or awk to do this, but not grep.
Try this:
sed 's/My name is//g' file
EDIT
Purpose of grep:
$ man grep | grep -A2 DESCRIPTION
DESCRIPTION
grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a
match to the given PATTERN. By default, grep prints the matching lines.
With GNU grep:
grep -Po "My name is\K.*" file
Output with a leading white space:
John
Peter
Joe
-P: Interpret PATTERN as a Perl regular expression
-o: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
\K: Remove matched part before \K.
try with one more simple grep.
grep -o '[^ ]*$' Input_file
-o will print only matched part of line, now in regex where it will look for text from last space to till last of the line.
An awk solution which first removes empty
lines and then prints last field.
awk '!/^$/{print $NF}' file
John
Peter
Joe
Using cut:
cut -d' ' -f4 input_file
GNU cut features a complement option, used to remove the area specified with -f. If the input_file had surnames such as "My name is John Doe", the previous code would print "John", and this would print "John Doe":
cut --complement -d' ' -f1-3 input_file
cut needs less memory, compared to other utils:
# these numbers will vary by *nix version and disto...
wc -c `which cut sed awk grep` | head -n -1 | sort -n
43224 /usr/bin/cut
109000 /bin/sed
215360 /bin/grep
662240 /usr/bin/awk

How can I determine the number of fields in a CSV, from the shell?

I have a well-formed CSV file, which may or may not have a header line; and may or may not have quoted data. I want to determine the number of columns in it, using the shell.
Now, if I can be sure there are no quoted commas in the file, the following seems to work:
x=$(tail -1 00-45-19-tester-trace.csv | grep -o , | wc -l); echo $((x + 1))
but what if I can't make that assumption? That is, what if I can't assume a comma is always a field separator? How do I do it then?
If it helps, you're allowed to make the assumption of there being no quoted quotes (i.e. \"s between within quoted strings); but better not to make that one either.
If you cannot make any optimistic assumptions about the data, then there won't be a simple solution in Bash. It's not trivial to parse a general CSV format with possible embedded newlines and embedded separators. You're better off not writing that in bash, but using an existing proper CSV parse. For example Python has one built in its standard library.
If you can assume that there are no embedded newlines and no embedded separators, than it's simple to split in commas using awk:
awk -F, '{ print NF; exit }' input.csv
-F, tells awk to use comma as the field separator, and the automatic NF variable is the number of fields on the current line.
If you want to allow embedded separators, but you can assume no embedded double quotes, then you can eliminate the embedded separators with a simple filter, before piping to the same awk as earlier:
head -n 1 input.csv | sed -e 's/"[^"]*"//g' | awk ...
Note that both of these examples use the first line to decide the number of fields. If the input has a header line, this should work quite well, as the header should not contain embedded newlines
count fields in first row, then verify all rows have same number
CNT=$(head -n1 hhdata.csv | awk -F ',' '{print NF}')
cat hhdata.csv | awk -F ',' '{print NF}' | grep -v $CNT
Doesn't cope with embedded commas but will highlight if they exist
If File has not double quotes then use below command:
awk -F"," '{ print NF }' filename| sort -u
If File has every column enclosed with double quotes then use below command:
awk -F, '{gsub(/"[^"]*"/,x);print NF}' filename | sort -u

Using sed to extract strings from a text file

I have text data in this form:
^Well/Well[ADV]+ADV ^John/John[N]+N ^has/have[V]+V+3sg+PRES ^a/a[ART]
^quite/quite[ADV]+ADV ^different/different[ADJ]+ADJ ^not/not[PART]
^necessarily/necessarily[ADV]+ADV ^more/more[ADV]+ADV
^elaborated/elaborate[V]+V+PPART ^theology/theology[N]+N *edu$
And I want it to be processed to this form:
Well John have a quite different not necessarily more elaborate theology
Basically, I need every string between the starting character / and the ending character [.
Here is what I tried, but I just get empty files...
#!/bin/bash
for file in probe/*.txt
do sed '///,/[/d' $file > $file.aa
mv $file.aa $file
done
awk to the rescue!
$ awk -F/ -v RS=^ -v ORS=' ' '{print $1}' file
Well John has a quite different not necessarily more elaborated theology
Explanation set record separator (RS) to ^ to separate your logical groups, also set the field separator (FS) to / and print the first field as your requirement. Finally, setting the output field separator (OFS) to space (instead of the default new line) keeps the extracted fields on the same line.
With GNU grep and Perl compatible regular expressions (-P):
$ echo $(grep -Po '(?<=/)[^[]*' infile)
Well John have a quite different not necessarily more elaborate theology
-o retains just the matches, (?<=/) is a positive look-behind ("make sure there is a /, but don't include it in the match"), and [^[]* is "a sequence of characters other than [".
grep -Po prints one match per line; by using the output of grep as arguments to echo, we convert the newlines into spaces (could also be done by piping to tr '\n' ' ').
cat file|grep -oE "\/[^\[]*\[" |sed -e 's#^/##' -e 's/\[$//' | tr -s "\n" " "

bash, text file remove all text in each line before the last space

I have a file with a format like this:
First Last UID
First Middle Last UID
Basically, some names have middle names (and sometimes more than one middle name). I just want a file that only as UIDs.
Is there a sed or awk command I can run that removes everything before the last space?
awk
Print the last field of each line using awk.
The last field is indexed using the NF variable which contains the number of fields for each line. We index it using a dollar sign, the resulting one-liner is easy.
awk '{ print $NF }' file
rs, cat & tail
Another way is to transpose the content of the file, then grab the last line and transpose again (this is fairly easy to see).
The resulting pipe is:
cat file | rs -T | tail -n1 | rs -T
cut & rev
Using cut and rev we could also achieve this goal by reversing the lines, cutting the first field and then reverse it again.
rev file | cut -d ' ' -f1 | rev
sed
Using sed we simply remove all chars until a space is found with the regex ^.* [^ ]*$. This regex means match the beginning of the line ^, followed by any sequence of chars .* and a space . The rest is a sequence of non spaces [^ ]* until the end of the line $. The sed one-liner is:
sed 's/^.* \([^ ]*\)$/\1/' file
Where we capture the last part (in between \( and \)) and sub it back in for the entire line. \1 means the first group caught, which is the last field.
Notes
As Ed Norton cleverly pointed out we could simply not catch the group and remove the former part of the regex. This can be as easily achieved as
sed 's/.* //' file
Which is remarkably less complicated and more elegant.
For more information see man sed and man awk.
Using grep:
$ grep -o '[^[:blank:]]*$' file
UID
UID
-o tells grep to print only the matching part. The regex [^[:blank:]]*$ matches the last word on the line.

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. VoilĂ !
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Resources