Shell programming --Extract values of 2 key words - shell

Input file (HTTP request log file):
GET /dynamic_branding_playlist.fmil?domain=915oGLbNZhb&pluginVersion=3.2.7_2.6&pubchannel=usa&sdk_ver=2.4.6.3&width=680&height=290&embeddedIn=http%3A%2F%2Fviewster.com%2Fsplash%2FOscar-Videos-1.aspx%3Futm_source%3Dadon_272024_113535_24905_24905%26utm_medium%3Dcpc%26utm_campaign%3DUSYME%26adv %3D573900%26req%3D5006e9ce1ca8b26347b88a7.1.825&sdk_url=http%3A%2F%2Fdivaag.vo.llnwd.net%2Fo42%2Fhtt p_only%2Fviewster_com%2Fv25%2Fyume%2F&viewport=42
Out put file:
domain sdk_version
915oGLbNZhb 2.4.6.3
Thousands of logs similar to the example above, so I need to find a way to extract the value of domain&sdk_version. And the positions of domain and sdk_version are not fixed. sometimes appear in the 2 field, sometimes apprear in the last field (if split by &).
Could anyone help me in this problem (using sed command)? Thanks so much in advance

Using sed:
sed -n 's/.*domain=\([^&]*\).*sdk_ver=\([^&]*\).*/\1 \2/p' input_file

This might work for you (GNU sed):
sed 's/.*\<\(domain\)=\([^&]*\).*\<\(sdk_ver\)=\([^&]*\).*/\1 \3sion\n\2 \4/p;d' file

Using awk:
BEGIN {
FS = "[&?]"
printf "domain\tsdk_version\n"
}
{
for (i = 1; i <= NF; i++) {
split ($i, array, "=")
if (array[1] == "domain") {
printf array[2]
}
if (array[1] == "sdk_ver") {
printf "\t%s", array[2]
}
}
printf "\n"
}
Or as a one-liner:
awk -F "[&?]" 'BEGIN { printf "domain\tsdk_version\n" } { for (i = 1; i <= NF; i++) { split ($i, array, "="); if (array[1] == "domain") printf array[2]; if (array[1] == "sdk_ver") printf "\t%s", array[2]; } printf "\n"; }' file.txt
Results:
domain sdk_version
915oGLbNZhb 2.4.6.3

Related

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

awk to process the first two lines then the next two and so on

Suppose i have a very file which i created from two files one is old & another is the updated file by using cat & sort on the primary key.
File1
102310863||7097881||6845193||271640||06007709532577||||
102310863||7097881||6845123||271640||06007709532577||||
102310875||7092992||6840808||023740||10034500635650||||
102310875||7092992||6840818||023740||10034500635650||||
So pattern of this file is line 1 = old value & line 2 = updated value & so on..
now I want to process the file in such a way that awk first process the first two lines of the file & find out the difference & then move on two the next two lines.
now the process is
if($[old record]!=$[new record])
i= [new record]#[old record];
Desired output
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
$ cat tst.awk
BEGIN { FS="[|][|]"; OFS="||" }
NR%2 { split($0,old); next }
{
for (i=1;i<=NF;i++) {
if (old[i] != $i) {
$i = $i "#" old[i]
}
}
print
}
$
$ awk -f tst.awk file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
This awk could help:
$ awk -F '\\|\\|' '{
getline new;
split(new, new_array, "\\|\\|");
for(i=1;i<=NF;i++) {
if($i != new_array[i]) {
$i = new_array[i]"#"$i;
}
}
} 1' OFS="||" < input_file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
I think, you are good enough in awk to understand above code. Skipping the explanation.
Updated version, and thanks #martin for the double | trick:
$ cat join.awk
BEGIN {new=0; FS="[|]{2}"; OFS="||"}
new==0 {
split($0, old_data, "[|]{2}")
new=1
next
}
new==1 {
split($0, new_data, "[|]{2}")
for (i = 1; i <= 7; i++) {
if (new_data[i] != old_data[i]) new_data[i] = new_data[i] "#" old_data[i]
}
print new_data[1], new_data[2], new_data[3], new_data[4], new_data[5], new_data[6], new_data[7]
new = 0
}
$ awk -f join.awk data.txt
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

How do I convert the following log file to CSV type file?

I have a log file formatted like this:
timestamp=123;
data1=value1;
data2=value2;
<-- empty line
timestamp=456;
data3=value3;
data4=value4;
What unix commands can I use to convert it to this format:
timestamp=123,data1=value1,data2=value2
timestamp=456,data3=value3,data4=value4
How about awk?
#!/bin/bash
awk '
BEGIN {
FS = ";"; # $1 will contain everything but the semicolon
first_item = 1;
} {
if ($1 == "") { # handle empty line
printf "\n";
first_item = 1;
next;
}
if (first_item != 1) { # avoid comma at the end of the line
printf ",";
} else {
first_item = 0;
}
printf "%s", $1; # print the item
} END {
printf "\n";
}'
If the input is saved in input.txt and the above script is named to_csv.sh,
the following command will produce the desired output:
./to_csv.sh < input.txt
This might work for you (GNU sed):
sed -r ':a;$!N;s/;\n/,/;ta;s/,(\n)/\1/;$s/;//;P;D' file
or this:
sed -r ':a;$!N;s/;\n(timestamp)/\n\1/;s/;\n/,/;ta;s/,(\n)/\1/;$s/;//;P;D' file

how to sum each column in a file using bash

I have a file on the following format
id_1,1,0,2,3,lable1
id_2,3,2,2,1,lable1
id_3,5,1,7,6,lable1
and I want the summation of each column ( I have over 300 columns)
9,3,11,10,lable1
how can I do that using bash.
I tried using what described here but didn't work.
Using awk:
$ awk -F, '{for (i=2;i<NF;i++)a[i]+=$i}END{for (i=2;i<NF;i++) printf a[i]",";print $NF}' file
9,3,11,10,lable1
This will print the sum of each column (from i=2 .. i=n-1) in a comma separated file followed the value of the last column from the last row (i.e. lable1).
If the totals would need to be grouped by the label in the last column, you could try this:
awk -F, '
{
L[$NF]
for(i=2; i<NF; i++) T[$NF,i]+=$i
}
END{
for(i in L){
s=i
for(j=NF-1; j>1; j--) s=T[i,j] FS s
print s
}
}
' file
If the labels in the last column are sorted then you could try without arrays and save memory:
awk -F, '
function labelsum(){
s=p
for(i=NF-1; i>1; i--) s=T[i] FS s
print s
split(x,T)
}
p!=$NF{
if(p) labelsum()
p=$NF
}
{
for(i=2; i<NF; i++) T[i]+=$i
}
END {
labelsum()
}
' file
Here's a Perl one-liner:
<file perl -lanF, -E 'for ( 0 .. $#F ) { $sums{ $_ } += $F[ $_ ]; } END { say join ",", map { $sums{ $_ } } sort keys %sums; }'
It will only do sums, so the first and last column in your example will be 0.
This version will follow your example output:
<file perl -lanF, -E 'for ( 1 .. $#F - 1 ) { $sums{ $_ } += $F[ $_ ]; } END { $sums{ $#F } = $F[ -1 ]; say join ",", map { $sums{ $_ } } sort keys %sums; }'
A modified version based on the solution you linked:
#!/bin/bash
colnum=6
filename="temp"
for ((i=2;i<$colnum;++i))
do
sum=$(cut -d ',' -f $i $filename | paste -sd+ | bc)
echo -n $sum','
done
head -1 $filename | cut -d ',' -f $colnum
Pure bash solution:
#!/usr/bin/bash
while IFS=, read -a arr
do
for((i=1;i<${#arr[*]}-1;i++))
do
((farr[$i]=${farr[$i]}+${arr[$i]}))
done
farr[$i]=${arr[$i]}
done < file
(IFS=,;echo "${farr[*]}")

Swap the 2 colums in output ----Sed or Awk?

Input file:
GET /static_register_ad_request_1_2037_0_0_0_1_1_4_8335086462.gif?pa=99439_50491&country=US&state_fips_code=US_CA&city_name=Los%2BAngeles&dpId=2&dmkNm=apple&dmlNm=iPod%2Btouch&osNm=iPhone%2BOS&osvNm=5.1.1&bNm=Safari&bvNm=null&spNm=SBC%2BInternet%2BServices&kv=0_0&sessionId=0A80187E0138A0AE42E4DE3F783E7A08&sdk_version=4.0.5.6%20&domain=805AOEtUaMu&ad_catalog=99439_50491&make=APPLE&width=320&height=460&slot_type=PREROLL&model=iPod%20touch%205.1.1&iabcat=artsandentertainment&iabsubcat=music&age=113&gender=2&zip=92869 HTTP/1.1
Output file:
domain sdk_version
805AOEtUaMu 4.0.5.6%20
I could use sed -n 's/.*sdk_version=\([^&]*\).*domain=\([^&]*\).*/\1 \2/p' to get the result, but sdk_version in first column, what I need is swap the sdk_version and domain columns in outputfile.
Could anyone help me with this? Thank you so much in advance:)
Just swap your backreferences:
sed -n 's/.*sdk_version=\([^&]*\).*domain=\([^&]*\).*/\2 \1/p'
One way using awk:
awk '
BEGIN {
FS = "&";
}
{
for ( i = 15; i <= 16; i++ ) {
split( $i, f, /=/ );
printf( "%s ", f[2] );
}
}
END {
printf "\n";
}
' infile
Output:
4.0.5.6%20 805AOEtUaMu
If you want to handle arbitrary order, I would suggest switching to awk or Perl.
perl -ne 'm/[?&]domain=([^&]+)/ && $d = $1;
m/[?&]sdk_version=([^&]+) && $s = $1;
print "$d\t$s\n"' logfile

Resources