How to transpose data after when match found in array - shell

I want data to be in csv format from xml data
xml data to test is available in this question
extract data between tags <t> </t>
Note : The data in xml file might differ and Headers will differ as well
In this xml their are 3 headers -> NAME,AGE,COURSE
I have Used below command to get data in horizontal format ,all in one line :
awk -F'(</*t>|</*t>)' 'NF>1{for(i=2;i<NF; i=i+2) printf("%s%s", $i, (i+1==NF)?ORS:OFS)}' OFS=',' demo.xml
After running above command , below is the output
"NAME","Vikas","Vijay","Vilas","AGE","24","34","35","COURSE","MCA","MMS","MBA"
How i am trying to implement logic
Take parameter from user , how many header values will be their
In above xml their are 3 headers -> NAME,AGE,COURSE
header_count=3
So 3 headers means will have 3 values like : HEADER + 3 values
NAME,Vikas,Prabhas,Arjun -> this will be transpose to below
Output :
NAME,
Vikas,
Prabhas,
Arjun,
Same goes with header value AGE -> AGE + 3 values
AGE,25,34,21 will be transpose to vertical
AGE
25
34
21
Same goes with header value COURSE -> COURSE + 3 values
COURSE,MCA,MBA,MMS will be transpose to vertical
COURSE
MCA
MBA
MMS
**Expected Output after combing all data for NAME,AGE,COURSE **
NAME,AGE,Course
Vikas,"25",MCA
Prabhash,"34",MBA
Arjun,"21",MMS

With your shown samples please try following. Written and tested in GNU awk, should work in any awk.
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i~/^"NAME"/){
found1=1
found2=found3=""
}
if($i~/^"AGE"$/){
found1=found2=""
found2=1
}
if($i~/^"COURSE"$/){
found1=found2=""
found3=1
}
if(found1){
name[++count1]=$i
}
if(found2){
age[++count2]=$i
}
if(found3){
course[++count3]=$i
}
}
}
END{
if(count1>=count2 && count1>=count3){ max=count1 }
if(count2>=count1 && count2>=count3){ max=count2 }
if(count3>=count1 && count3>=count2){ max=count3 }
for(i=1;i<=max;i++){
print (name[i]?name[i]:"NA",age[i]?age[i]:"NA",course[i]?course[i]:"NA")
}
}
' Input_file

Related

Awk Standard deviation for each unique identifier

I have the following dataset with multiple different ids in column 1 and I wish to calculate the mean and standard deviation for column 2 for each id
123456 0.1234
123456 0.5673
123456 0.0011
123456 -0.0947
123457 0.9938
123457 0.0001
123457 0.2839
I have the following code to get the mean per id but struggling to amend this to get the SD as well
awk '{sum4[$1] += $2; count4[$1]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' < want3.txt > mean_id.txt
The desired output is a file of id mean and sd
123456 0.149275 0.2926
123457 0.425933 0.5118
Any advice would be much appreciated.
Thanks
here is another approach which is more memory efficient but possibly less precision for large mean.
$ awk -v t=1 '{s[$1]+=$2; ss[$1]+=$2*$2; c[$1]++}
END {for(k in s) print k,m=s[k]/c[k],sqrt((ss[k]-m^2*c[k])/(c[k]-t))}' file
123456 0.149275 0.292628
123457 0.425933 0.51185
this computes the sample standard deviation, if you have the full distribution not just the samples you can set t=0 to get population standard deviation which will be slightly lower but for large N they are practically equivalent (within the error of margin due to measurement errors).
With GNU awk. Derived from Ivan's answer with standard deviation of the population (division by n). I switched to sample standard deviation (division by n-1).
awk '
{
numrec[$1] += 1
sum[$1] += $2
array[numrec[$1]] = $2
array[$1,numrec[$1]] = $2
}
END {
for(w in numrec) {
for(x=1; x<=numrec[w]; x++)
sumsq[w] += ((array[w,x]-(sum[w]/numrec[w]))^2)
printf("%d %.6f %.4f\n", w, sum[w]/numrec[w], sqrt(sumsq[w]/(numrec[w]-1)))
}
}
' file
Output:
123456 0.149275 0.2926
123457 0.425933 0.5118

how to read NON-ASCII Char in the file as ASCII using AWK

My input file:
000000000 vélIstine IOBAN 00000004960
000000000 shankargu kumar 00000000040
TTTTTTTTT 0000000200000000050000000000000000000000
whenever I have non Ascii character in the file like above,
my below code snippet not calculating the sum (d_amt_sum+=substr($0,27,10)) properly sometimes its skiping that row and sometime its giving incorrect value instead of 496 its returning 49 for substr($0,27,10)?
besides I want to know how to add print statement inside AWK, example i need to print the value of "substr($0,27,10)" inside the if block how to do that?
set -A out_result -- `LC_ALL=en_US.UTF-8 awk 'BEGIN{
d_amt_sum=d_rec_count=d_trailer_out_amt_sum=d_trailer_rec_count=0;
}
{
if(substr($0,1,9) != "TTTTTTTTT")
{
d_amt_sum+=substr($0,27,10); d_rec_count+=1
}
else if(substr($0,1,9) == "TTTTTTTTT")
{
d_trailer_out_amt_sum+=substr($0,39,12);
d_trailer_rec_count+=substr($0,31,8);
}
}
END{print d_amt_sum, d_rec_count,d_trailer_out_amt_sum,d_trailer_rec_count}' ${OUTDIR}/${OUT_FILE}
Expected output
500,2,500,2
you have a logic error on the ordering of the if/else statements, another error on checking 1 char length against 9 char length. Fixing both gives...
awk '{k=substr($0,1,9)
if(k=="TTTTTTTTT")
{d_trailer_out_amt_sum+=substr($0,39,12)
d_trailer_rec_count+=substr($0,31,8)}
else if(k!="999999999")
{d_amt_sum+=substr($0,27,10);
d_rec_count+=1}}
END {print d_amt_sum, d_rec_count,d_trailer_out_amt_sum,d_trailer_rec_count}' file
500 2 500 2

Parsing CSV file with \n in double quoted fields

I'm parsing a CSV file that has a break line in double quoted fields. I'm reading the file line by line with a groovy script but I get an ArrayIndexOutBoundException when I tried to get access the missing tokens.
I was trying to pre-process the file to remove those characters and I was thinking to do that with some bash script or with groovy itself.
Could you, please suggest any approach that I can use to resolve the problem?
This is how the CSV looks like:
header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"
This is the groovy script I'm using
def csv = new File(args[0]).text
def bufferString = ""
def parsedFile = new File("Parsed_" + args[0]);
csv.eachLine { line, lineNumber ->
def splittedLine = line.split(',');
retString += new Date(splittedLine[0]) + ",${splittedLine[1]},${splittedLine[2]},${splittedLine[3]}\n";
if(lineNumber % 1000 == 0){
parsedFile.append(retString);
retString = "";
}
}
parsedFile.append(retString);
UPDATE:
Finally I did this and it works, (I needed format the first column from timestamp to a human readable date):
gawk -F',' '{print strftime("%Y-%m-%d %H:%M:%S", substr( $1, 0, length($1)-3 ) )","($2)","($3)","($4)}' TobeParsed.csv > Parsed.csv
Thank you #karakfa
If you use a proper CSV parser rather than trying to do it with split (which as you can see doesn't work with any form of quoting), then it works fine:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"'''
def data = parseCsv(csv)
data.eachWithIndex { line, index ->
println """Line $index:
| 1:$line.header1
| 2:$line.header2
| 3:$line.header3
| 4:$line.header4""".stripMargin()
}
Which prints:
Line 0:
1:timestamp
2:abcdefghi
3:abcdefghi
4:sdsd
Line 1:
1:timestamp
2:zxcvb
fffffgfg
3:asdasdasadsd
4:sdsdsd
awk to the rescue!
this will merge the newline split fields together, you process can take it from there
$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}1' splitted.csv
header1,header2,header3
xxxxxx, "abcdefghi", "abcdefghi"
yyyyyy, "zxcvb fffffgfg","asdasdasadsd"
assumes that odd number of quotes mean split field and replace new line with OFS. If you want to simple delete new line (the split parts will combine) remove OFS.

Manipulating contents of a file for determining deadlock

I've been working on a short assignment to determine deadlocks in database soon after it is identified. For this i will be regularly monitoring the process details once i dump the same to a text file.
After certain manipulation The output file (a miniature version) to be monitored is:
INPUT FILE:-
SPID BLOCKING SPID
23 50
71 46
50 60
60 96
This means that process 23 is being blocked by process 50 and in turn by 60.
So in this case the process chain dependency is 23->50->60->96 and the other 71->46
In short I want the final culprit spids.. in this case 96 and 46 along with the chain count ( 4 and 2)
OUTPUT FILE:-
CULPRIT SPID CHAIN_COUNT
1) 96 4
2) 46 2
For this the entire file will have to be looped and compare column 2 matching with any column 1 in the file,. This has to be repeated over until there is no match for column 2
I need to acheive this via awk or sed, however I believe this can be achieved via linked
list.
Any kind of suggesstion are welcom
****JS just one more question, for sample file
SPID BLOCKING SPID
45 11
12 34
34 35
23 60
71 45
60 71
OUTPUT I RECEIVE:
71=>45
23=>60->71
12=>34->35
OUTPUT I AM SUPPOSED TO GET
23=>60=>71=>45
12=>34->35
I hope you got what I am trying to say.
Let's try to solve this in two steps. Step one is where we shall create the chain using your input file (assuming that it might be useful for you to go back and check the trail) :
awk '
NR>1 {
for(value in values) {
if ($1 == values[value]) {
chain[$1] = chain[value] "->" $2;
values[$1] = $2;
delete chain[value];
delete values[value];
next
}
}
values[$1] = $2;
chain[$1] = $1 "->" $2
}
END {
for(left in values) {
for(c in chain) {
x = split(chain[c],t,/->/)
if (left == t[x]) {
chain[left] = chain[c] "->" values[left]
delete chain[c]
values[left]
}
}
}
for (trail in chain) print chain[trail]
}' file
For your input file, this gives the output as:
23->60->71->45->11
12->34->35
Now we will pipe our above script to another awk for formatting.
awk -F'->' '
BEGIN {
OFS="\t";
print "CULPRIT SPID", "CHAIN_COUNT"
}
{
print $NF, NF
}'
This will give the output as:
CULPRIT SPID CHAIN_COUNT
11 5
35 3
As I said before assuming you need to look at the trail. If trail is of no use to you then you can simply do the following:
awk '
BEGIN {
OFS="\t";
print "CULPRIT SPID", "CHAIN_COUNT"
}
NR>1 {
for(value in values) {
if ($1 == values[value]) {
chain[$1] = chain[value] FS $2;
values[$1] = $2;
delete chain[value];
next
}
}
values[$1] = $2;
chain[$1] = $1 FS $2
}
END {
for(left in values) {
for(c in chain) {
x = split (chain[c],t)
if (left == t[x]) {
chain[left] = chain[c] FS values[left]
delete chain[c]
values[left]
}
}
}
for(v in chain) {
num = split(chain[v], culprit)
print culprit[num], num
}
}' file
Output will be:
CULPRIT SPID CHAIN_COUNT
11 5
35 3
You can re-direct it to an output file as you please.

Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.
Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).
In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.
cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8
comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2
paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv
Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.
An example input file would be something like:
User1
Active Directory
Internet
S: Drive
Sales Records
User2
Active Directory
Internet
Pricing Lookup
S: Drive
User3
Active Directory
Internet
Novell
Sales Records
where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.
Can someone give me a hand in what I'm missing?
Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
pcount = asort(perms)
ucount = asort(users)
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u = 1; u <= ucount; u++) {
printf("%-*s", colwidth, users[u])
}
printf("\n")
for (p = 1; p <= pcount; p++) {
printf("%-*s", maxplen, perms[p])
for (u = 1; u <= ucount; u++) {
if (array[users[u], perms[p]]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
Save this file, perhaps calling it "correlate", then set it to be executable:
$ chmod u+x correlate
Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:
$ ./correlate user*
and you would get the following output based on your sample input:
user1 user2 user3
Active Directory X X X
Internet X X X
Novell X
Pricing Lookup X
S: Drive X X
Sales Records X X
Edit:
This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u in users) {
printf("%-*s", colwidth, u)
}
printf("\n")
for (p in perms) {
printf("%-*s", maxplen, p)
for (u in users) {
if (array[u, p]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
You can use the diff3 program. From the man page:
diff3 - compare three files line by line
Given your sample inputs, above, running diff3 results in:
====
1:3,4c
S: Drive
Sales Records
2:3,4c
Pricing Lookup
S: Drive
3:3,4c
Novell
Sales Records
Does this get you any closer to what you're looking for?
I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

Resources