finding maximum from partial string

finding maximum from partial string - shell

I have a list where first 6 digit is date in format yyyymmdd. The next 4 digits are part of timestamp. I want to select only those numbers which are maximum timestamp for any day.
20160905092900
20160905212900
20160906092900
20160906213000
20160907093000
20160907213000
20160908093000
20160908213000
20160910093000
20160910213100
20160911093100
20160911213100
20160912093100
Means from the above list the output should give the below list.
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100

$ sort -r file | awk '!seen[substr($0,1,8)]++' | sort
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
If the file's already sorted you can use tac instead of sort.

You can use awk:
awk '{
dt = substr($0, 1, 8)
ts = substr($0, 9, 12)
}
ts > max[dt] {
max[dt] = ts
rec[dt] = $0
}
END {
for (i in rec)
print rec[i]
}' file
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
We are using associative array max that uses first 8 characters as key and next 4 characters as value. This array is being used to store max timestamp value for a given date. Another array rec is used to store full line for a date when we encounter timestamp value greater than stored value in max array.

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.

To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order

Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Awk printing out smallest and highest number, in a time format

I'm fairly new to linux/bash shell and I'm really having trouble printing two values (the highest and lowest) from a particular column in a text file. The file is formatted like this:
Geoff Audi 2:22:35.227
Bob Mercedes 1:24:22.338
Derek Jaguar 1:19:77.693
Dave Ferrari 1:08:22.921
As you can see the final column is a timing, I'm trying to use awk to print out the highest and lowest timing in the column. I'm really stumped, I've tried:
awk '{print sort -n < $NF}' timings.txt
However that didn't even seem to sort anything, I just received an output of:
1
0
1
0
...
Repeating over and over, it went on for longer but I didn't want a massive line of it when you get the point after the first couple iterations.
My desired output would be:
Min: 1:08:22.921
Max: 2:22:35.227

After question clarifications: if the time field always has a same number of digits in the same place, e.g. h:mm:ss.ss, the solution can be drastically simplified. Namely, we don't need to convert time to seconds to compare it anymore, we can do a simple string/lexicographical comparison:
$ awk 'NR==1 {m=M=$3} {$3<m&&m=$3; $3>M&&M=$3} END {printf("min: %s\nmax: %s",m,M)}' file
min: 1:08:22.921
max: 2:22:35.227
The logic is the same as in the (previous) script below, just using a simpler string-only based comparison for ordering values (determining min/max). We can do that since we know all timings will conform to the same format, and if a < b (for example "1:22:33" < "1:23:00") we know a is "smaller" than b. (If values are not consistently formatted, then by using the lexicographical comparison alone, we can't order them, e.g. "12:00:00" < "3:00:00".)
So, on first value read (first record, NR==1), we set the initial min/max value to the timing read (in the 3rd field). For each record we test if the current value is smaller than the current min, and if it is, we set the new min. Similarly for the max. We use short circuiting instead if to make expressions shorter ($3<m && m=$3 is equivalent to if ($3<m) m=$3). In the END we simply print the result.
Here's a general awk solution that accepts time strings with variable number of digits for hours/minutes/seconds per record:
$ awk '{split($3,t,":"); s=t[3]+60*(t[2]+60*t[1]); if (s<min||NR==1) {min=s;min_t=$3}; if (s>max||NR==1) {max=s;max_t=$3}} END{print "min:",min_t; print "max:",max_t}' file
min: 1:22:35.227
max: 10:22:35.228
Or, in a more readable form:
#!/usr/bin/awk -f
{
split($3, t, ":")
s = t[3] + 60 * (t[2] + 60 * t[1])
if (s < min || NR == 1) {
min = s
min_t = $3
}
if (s > max || NR == 1) {
max = s
max_t = $3
}
}
END {
print "min:", min_t
print "max:", max_t
}
For each line, we convert the time components (hours, minutes, seconds) from the third field to seconds which we can later simply compare as numbers. As we iterate, we track the current min val and max val, printing them in the END. Initial values for min and max are taken from the first line (NR==1).

Given your statements that the time field is actually a duration and the hours component is always a single digit, this is all you need:
$ awk 'NR==1{min=max=$3} {min=(min<$3?min:$3); max=(max>$3?max:$3)} END{print "Min:", min ORS "Max:", max}' file
Min: 1:08:22.921
Max: 2:22:35.227

You don't want to run sort inside of awk (even with the proper syntax).
Try this:
sed 1d timings.txt | sort -k3,3n | sed -n '1p; $p'
where
the first sed will remove the header
sort on the 3rd column numerically
the second sed will print the first and last line

Sort numberic in a string of text

I tried some sort examble but can't find the way to solve this.I think i should find the right seperator and then sort it by numberic but it don't work as my desire.
This is my file:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg3_bla_reg_26_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
And this is my desire result:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0

$ sort -t_ -k5,5 -k8,8n file
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
That may or may not produce the output you expect if the regN value in the 5th column can include 2-digit numbers.

Using awk
$awk -F"_" 'function print_array(arr,max){ for(i=1; i<=max; i++) if(a[i]){print a[i], a[i]="";} } key==$5{a[$8]=$0; key=$5; max=$8>max?$8:max} key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} END{print_array(a,max)}' file
Output:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
Explanation:
awk -F"_" '
function print_array(arr,max) #Simply prints the hashed array from i=1 to max value array is holding
{
for(i=1; i<=max; i++)
if(a[i])
{print a[i], a[i]="";}
}
key==$5{a[$8]=$0; max=$8>max?$8:max} #Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} #If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) #To print the last record set
}' file
key==$5{a[$8]=$0; max=$8>max?$8:max} : Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field $5 matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results. This will work irrespective of the number of digits in $8.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) Just to print the last record set

sort -V file
-V, --version-sort
natural sort of (version) numbers within text

Awk substring doesnt yield expected result

I've a file whose content is below:
C2:0301,353458082243570,353458082243580,0;
C2:0301,353458082462440,353458082462450,0;
C2:0301,353458082069130,353458082069140,0;
C2:0301,353458082246230,353458082246240,0;
C2:0301,353458082559320,353458082559330,0;
C2:0301,353458080153530,353458080153540,0;
C2:0301,353458082462670,353458082462680,0;
C2:0301,353458081943950,353458081943960,0;
C2:0301,353458081719070,353458081719080,0;
C2:0301,353458081392470,353458081392490,0;
Field 2 and Field 3 (considering , as separator), contains 15 digit IMEI number ranges and not individual IMEI numbers. Usual format of IMEI is 8-digits(TAC)+6-digits(Serial number)+0(padded). The 6 digits(Serial number) part in the IMEI defines the start and end range, everything else remaining same. So in order to find individual IMEIs in the ranges (which is exactly what I want), I need a unary increment loop from 6 digits(Serial number) from the starting IMEI number in Field-2 till 6 digits(Serial number) from the ending IMEI number in Field-3. I am using the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
It gives me the below result:
353458082243570,0
353458082243580,0
353458082462440,0
353458082462450,0
353458082069130,0
353458082069140,0
353458082246230,0
353458082246240,0
353458082559320,0
353458082559330,0
353458080153530,0
353458082462670,0
353458082462680,0
353458081943950,0
353458081943960,0
353458081719070,0
353458081719080,0
353458081392470,0
353458081392480,0
353458081392490,0
The above is as expected except for the below line in the result:
353458080153530,0
The result is actually from the below line in the input file:
C2:0301,353458080153530,353458080153540,0;
But the expected output for the above line in input file is:
353458080153530,0
353458080153540,0
I need to know whats going wrong in my script.

The problem with your script is you start with 2 string variables, v and t, (typed as strings since they are the result of a string operation, substr()) and then convert one to a number with v++ which would strip leading zeros but then you're doing a string comparison with v <= t since a string (t) compared to a number or string or numeric string is always a string comparison. Yes you can add zero to each of the variables to force a numeric comparison but IMHO this is more like what you're really trying to do:
$ cat tst.awk
BEGIN { FS=","; re="(.{8})(.{6})(.*)" }
{
match($2,re,beg)
match($3,re,end)
for (i=beg[2]; i<=end[2]; i++) {
printf "%s%06d%s\n", end[1], i, end[3]
}
}
$ gawk -f tst.awk file
353458082243570
353458082243580
353458082462440
353458082462450
353458082069130
353458082069140
353458082246230
353458082246240
353458082559320
353458082559330
353458080153530
353458080153540
353458082462670
353458082462680
353458081943950
353458081943960
353458081719070
353458081719080
353458081392470
353458081392480
353458081392490
and when done with appropriate variables like that no conversion is necessary. Note also that with the above you don't need to repeatedly state the same or relative numbers to extract the part of the strings you care about, you just state the number of characters to skip (8) and the number to select (6) once. The above uses GNU awk for the 3rd arg to match().

The problem was in the while(v <= t) part of the script. I believe with leading 0s the match was not happening properly. So I ensured that they are casted into int while doing the comparison in the while loop. The AWK documentation says you can cast a value to int by using value+0. So my while(v <= t) in the awk script needed to change to while(v+0 <= t+0) . So the below AWK script:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
was changed to :
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v+0 <= t+0) printf "%s%0"6"s%s,%s\n", substr($3,1,8),v++,substr($3,15,2),$4;}' TEMP.OUT.merge_range_part1_21
That only change got me the expected value for the failure case. For example this in my input file:
C2:0301,353458080153530,353458080153540,0;
Now gives me individual IMEIs as :
353458080153530,0
353458080153540,0

Use an if statement that checks for leading zeros in variable v setting y accordingly:
awk -F"," '{v = substr($2,9,6); t = substr($3,9,6); while(v <= t) { if (substr(v,1,1)=="0") { v++;y="0"v } else { v++;y=v } ;printf %s%0"6"s%s,%s\n", substr($3,1,8),y,substr($3,15,2),$4;v=y } }' TEMP.OUT.merge_range_part1_21
Make sure that the while condition is contained in braces and also that v is incremented WITHIN the if conditions.
Set v=y at the end of the statement to allow this to work on additional increments.

Subtrack fields from duplicate lines

I have file with two columns. First column is string, second is positive number. in If first field (string) doesn't have double in file (so, first field is unique for the file), I want to copy that unique line to (let's say) result.txt. If first field does have duplicate in file, then I want to subtract second field (number) in those duplicated lines. By the way, file will have one duplicate max, no more than that. I want to save that also in result.txt. So, output file will have all lines with unique values of first field and lines in which first field is duplicated name and second is subtracted value from those duplicates. Files are not sorted. Here is example:
INPUT FILE:
hello 7
something 8
hey 9
hello 8
something 12
nathanforyou 23
OUTPUT FILE that I need (result.txt):
hello 1
something 4
hey 9
nathanforyou 23
I can't have negative numbers in ending file, so I have to subtract smaller number from bigger. What have I tried so far? All kinds of sort (I figure out how to find non-duplicate lines and put them in separate file, but choked on duplicate substraction), arrays in awk (I saved all lines in array, and do "for" clause... problem is that I don't know how to get second field from array element that is line) etc. By the way, problem is more complicated than I described (I have four fields, first two are the same and so on), but at the end - it comes to this.

$ cat tst.awk
{ val[$1,++cnt[$1]] = $2 }
END {
for (name in cnt) {
if ( cnt[name] == 1 ) {
print name, val[name,1]
}
else {
val1 = val[name,1]
val2 = val[name,2]
print name, (val1 > val2 ? val1 - val2 : val2 - val1)
}
}
}
$ awk -f tst.awk file
hey 9
hello 1
nathanforyou 23
something 4

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

finding maximum from partial string - shell

$ sort -r file | awk '!seen[substr($0,1,8)]++' | sort 20160905212900 20160906213000 20160907213000 20160908213000 20160910213100 20160911213100 20160912093100 If the file's already sorted you can use tac instead of sort.

Related

Copy columns of a file to specific location of another pipe delimited file

Awk printing out smallest and highest number, in a time format

Sort numberic in a string of text

Awk substring doesnt yield expected result

Subtrack fields from duplicate lines

Categories

Resources