I have a well-formed CSV file, which may or may not have a header line; and may or may not have quoted data. I want to determine the number of columns in it, using the shell.
Now, if I can be sure there are no quoted commas in the file, the following seems to work:
x=$(tail -1 00-45-19-tester-trace.csv | grep -o , | wc -l); echo $((x + 1))
but what if I can't make that assumption? That is, what if I can't assume a comma is always a field separator? How do I do it then?
If it helps, you're allowed to make the assumption of there being no quoted quotes (i.e. \"s between within quoted strings); but better not to make that one either.
If you cannot make any optimistic assumptions about the data, then there won't be a simple solution in Bash. It's not trivial to parse a general CSV format with possible embedded newlines and embedded separators. You're better off not writing that in bash, but using an existing proper CSV parse. For example Python has one built in its standard library.
If you can assume that there are no embedded newlines and no embedded separators, than it's simple to split in commas using awk:
awk -F, '{ print NF; exit }' input.csv
-F, tells awk to use comma as the field separator, and the automatic NF variable is the number of fields on the current line.
If you want to allow embedded separators, but you can assume no embedded double quotes, then you can eliminate the embedded separators with a simple filter, before piping to the same awk as earlier:
head -n 1 input.csv | sed -e 's/"[^"]*"//g' | awk ...
Note that both of these examples use the first line to decide the number of fields. If the input has a header line, this should work quite well, as the header should not contain embedded newlines
count fields in first row, then verify all rows have same number
CNT=$(head -n1 hhdata.csv | awk -F ',' '{print NF}')
cat hhdata.csv | awk -F ',' '{print NF}' | grep -v $CNT
Doesn't cope with embedded commas but will highlight if they exist
If File has not double quotes then use below command:
awk -F"," '{ print NF }' filename| sort -u
If File has every column enclosed with double quotes then use below command:
awk -F, '{gsub(/"[^"]*"/,x);print NF}' filename | sort -u
I am trying to use awk to remove first three fields in a text file. Removing the first three fields is easy. But the rest of the line gets messed up by awk: the delimiters are changed from tab to space
Here is what I have tried:
head pivot.threeb.tsv | awk 'BEGIN {IFS="\t"} {$1=$2=$3=""; print }'
The first three columns are properly removed. The Problem is the output ends up with the tabs between columns $4 $5 $6 etc converted to spaces.
Update: The other question for which this was marked as duplicate was created later than this one : look at the dates.
first as ED commented, you have to use FS as field separator in awk.
tab becomes space in your output, because you didn't define OFS.
awk 'BEGIN{FS=OFS="\t"}{$1=$2=$3="";print}' file
this will remove the first 3 fields, and leave rest text "untouched"( you will see the leading 3 tabs). also in output the <tab> would be kept.
awk 'BEGIN{FS=OFS="\t"}{print $4,$5,$6}' file
will output without leading spaces/tabs. but If you have 500 columns you have to do it in a loop, or use sub function or consider other tools, cut, for example.
Actually this can be done in a very simple cut command like this:
cut -f4- inFile
If you don't want the field separation altered then use sed to remove the first 3 columns instead:
sed -r 's/(\S+\s+){3}//' file
To store the changes back to the file you can use the -i option:
sed -ri 's/(\S+\s+){3}//' file
awk '{for (i=4; i<NF; i++) printf $i " "; print $NF}'
I have a file as show below
1.2.3.4.ask
sanma.nam.sam
c.d.b.test
I want to remove the last field from each line, the delimiter is . and the number of fields are not constant.
Can anybody help me with an awk or sed to find out the solution. I can't use perl here.
Both these sed and awk solutions work independent of the number of fields.
Using sed:
$ sed -r 's/(.*)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
Note: -r is the flag for extended regexp, it could be -E so check with man sed. If your version of sed doesn't have a flag for this then just escape the brackets:
sed 's/\(.*\)\..*/\1/' file
1.2.3.4
sanma.nam
c.d.b
The sed solution is doing a greedy match up to the last . and capturing everything before it, it replaces the whole line with only the matched part (n-1 fields). Use the -i option if you want the changes to be stored back to the files.
Using awk:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file
1.2.3.4
sanma.nam
c.d.b
The awk solution just simply prints n-1 fields, to store the changes back to the file use redirection:
$ awk 'BEGIN{FS=OFS="."}{NF--; print}' file > tmp && mv tmp file
Reverse, cut, reverse back.
rev file | cut -d. -f2- | rev >newfile
Or, replace from last dot to end with nothing:
sed 's/\.[^.]*$//' file >newfile
The regex [^.] matches one character which is not dot (or newline). You need to exclude the dot because the repetition operator * is "greedy"; it will select the leftmost, longest possible match.
With cut on the reversed string
cat youFile | rev |cut -d "." -f 2- | rev
If you want to keep the "." use below:
awk '{gsub(/[^\.]*$/,"");print}' your_file
I have a large list of LDAP DN's that are all related in that they failed to import into my application. I need to query these against my back-end database based on a very specific portion of the CN, but I'm not entirely sure on how I can restrict down the strings to a very specific value that is not necessarily located in the same position every time.
Using the following bash command:
grep 'Failed to process entry' /var/log/tomcat6/catalina.out | awk '{print substr($0, index($0,$14))}'
I am able to return a list of DN's similar to: (sorry for the redacted nature, security dictates)
"cn=[Last Name] [Optional Middle Initial or Suffix] [First Name] [User name],ou=[value],ou=[value],o=[value],c=[value]".
The CN value can be confusing as the order of surname, given name, middle initial, prefix or suffix can be displayed in any order if the values even exist, but one thing does remain consistent, the username is always the last field in the cn (followed by a "," then the first of many potential OU's). I need to parse out that user name for querying, preferably into a comma separated list for easy copy and paste for use in a SQL IN() query or use in a bash script. So as an example, imagine the following short list of abbreviated DNs, only showing the CN value (since the rest of the DN is irrelevant):
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
I would like to have a csv list returned that looks like:
john.doe,jane.a.doe,bsmith,richard.powers1
Can a mix of awk and/or sed accomplish this?
sed -e 's/"^[^,]* \([^ ,]*\),.*/\1/'
will parse the username part of the common name and isolate the username. Follow up with
| tr '\n' , | sed -e 's/,$/\n/'
to convert the one-per-line username format into comma-separated form.
Here is one quick and dirty way of doing it -
awk -v FS="[\"=,]" '{ print $3}' file | awk -v ORS="," '{print $NF}' | sed 's/,$//'
Test:
[jaypal:~/Temp] cat ff
"cn=Doe Jr. John john.doe,ou=...".
"cn=Doe A. Jane jane.a.doe,ou=...".
"cn=Smith Bob J bsmith,ou=...".
"cn=Powers Richard richard.powers1,ou=...".
[jaypal:~/Temp] awk -v FS="[\"=,]" '{ print $3}' ff | awk -v ORS="," '{print $NF}' | sed 's/,$//'
john.doe,jane.a.doe,bsmith,richard.powers1
OR
If you have gawk then
gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' filename | sed ':a;{N;s/\n/,/}; ba'
Test:
[jaypal:~/Temp] gawk '{ print gensub(/.* (.*[^,]),.*/,"\\1","$0")}' ff | sed ':a;{N;s/\n/,/}; ba'
john.doe,jane.a.doe,bsmith,richard.powers1
Given a file "Document1.txt" containing
cn=Smith Jane batty.cow,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Marley Bob reggae.boy,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
cn=Clinton J Bill ex.president,ou=ou1_value,ou=oun_value,o=o_value,c=c_value
you can do a
cat Document1.txt | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p"
which gets you
batty.cow
reggae.boy
ex.president
using tr to transalate the end of line character
cat Document1.txt | sed -n "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | tr '\n' ','
produces
batty.cow,reggae.boy,ex.president,
you will need to deal with the last comma
but if you want it in a database say oracle for example, a script containing:
#!/bin/bash
doc=$1
cat ${doc} | sed -e "s/^cn=.* \([A-Za-z0-9._]*\),ou=.*/\1/p" | while read username
do
sqlplus -s username/password#instance <<+++ insert into mytable (user_name) values ('${username}'\;)
exit
+++
done
N.B.
The A-Za-z0-9._ in the sed expression is every type of character you expect in the username - you may need to play with that one.
caveat - I did't test the last bit with the database insert in it!
Perl regex solution that I consider more readable than the alternatives, in case you're interested:
perl -ne 'print "$1," if /(([[:alnum:]]|[[:punct:]])+),ou/' input.txt
Prints the string preceding 'ou', accepts alphanumeric and punctuation chars (but no spaces, so it stops at the username).
Output:
john.doe,jane.a.doe,bsmith,
It has been over a year since there has been an idea posted to this, but wanted a place to refer to in the future when this class of question comes up again. Also, I did not see a similar answer posted.
Of the pattern of data provided, my interpretation is that we can strip away everything after the first comma, leaving us with a true CN rather than a DN that starts with a CN.
In the CN, we strip everything before and including the last white space.
This will leave us with the username.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{print $NF}' >> usernames
Passing your ldap file to awk, with the field separator set to comma, and the match string set to cn= at the beginning of a line, we print everything up to the first comma. Then we pipe that output into an awk with the default field separator and print only the last field, resulting in just the username. We redirect and append this to a file in the current directory named usernames, and we end up with one username per line.
To convert this into a single comma separated line of usernames, we change the last print command to printf, leaving out the \n newline character, but adding a comma.
awk -F',' /^cn=/{print $1}' ldapfile | awk '{printf $NF","}' >> usersnames
This leaves the only line in the file with a trailing comma, but since it is only intended to be used for cut and paste, simply do not cut the last character. :)
I want to use bash to process a tab delimited file. I only need the second column and third to a new file.
cut(1) was made expressly for this purpose:
cut -f 2-3 input.txt > output.txt
Cut is probably the best choice here, second to that is awk
awk -F"\t" '{print $2 "\t" $3}' input > out
expanding on the answer of carl-norum, using only tab as a delimiter, not all blanks:
cut -d$'\t' -f 2-3 input.txt > output.txt
don't put a space between d and $