Sort numberic in a string of text - sorting

I tried some sort examble but can't find the way to solve this.I think i should find the right seperator and then sort it by numberic but it don't work as my desire.
This is my file:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg3_bla_reg_26_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
And this is my desire result:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0

$ sort -t_ -k5,5 -k8,8n file
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
That may or may not produce the output you expect if the regN value in the 5th column can include 2-digit numbers.

Using awk
$awk -F"_" 'function print_array(arr,max){ for(i=1; i<=max; i++) if(a[i]){print a[i], a[i]="";} } key==$5{a[$8]=$0; key=$5; max=$8>max?$8:max} key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} END{print_array(a,max)}' file
Output:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
Explanation:
awk -F"_" '
function print_array(arr,max) #Simply prints the hashed array from i=1 to max value array is holding
{
for(i=1; i<=max; i++)
if(a[i])
{print a[i], a[i]="";}
}
key==$5{a[$8]=$0; max=$8>max?$8:max} #Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} #If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) #To print the last record set
}' file
key==$5{a[$8]=$0; max=$8>max?$8:max} : Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field $5 matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results. This will work irrespective of the number of digits in $8.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) Just to print the last record set

sort -V file
-V, --version-sort
natural sort of (version) numbers within text

Related

bash: identifying the first value list that also exists in another list

I have been trying to come up with a nice way in BASH to find the first entry in list A that also exists in list B. Where A and B are in separate files.
A B
1024dbeb 8e450d71
7e474d46 8e450d71
1126daeb 1124dae9
7e474d46 7e474d46
1124dae9 3217a53b
In the example above, 7e474d46 is the first entry in A also appearing in B, So I would return 7e474d46.
Note: A can be millions of entries, and B can be around 300.
awk is your friend.
awk 'NR==FNR{a[$1]++;next}{if(a[$1]>=1){print $1;exit}}' file2 file1
7e474d46
Note : Check the [ previous version ] of this answer too which assumed that values are listed in a single file as two columns. This one is wrote after you have clarified that values are fed as two files in [ this ] comment.
Though few points are not clear, like how about if a number in A list is coming 2 times or more?(IN your given example itself d46 comes 2 times). Considering that you need all the line numbers of list A which are present in List B, then following will help you in same.
awk '{col1[$1]=col1[$1]?col1[$1]","FNR:FNR;col2[$2];} END{for(i in col1){if(i in col2){print col1[i],i}}}' Input_file
OR(NON-one liner form of above solution)
awk '{
col1[$1]=col1[$1]?col1[$1]","FNR:FNR;
col2[$2];
}
END{
for(i in col1){
if(i in col2){
print col1[i],i
}
}
}
' Input_file
Above code will provide following output.
3,5 7e474d46
6 1124dae9
creating array col1 here whose index is first field and array col2 whose index is $2. col1's value is current line's value and it will be concatenating it's own value too. Now in END section of awk traversing through col1 array and then checking if any value of col1 is present in array col2 too, if yes then printing col1's value and it's index.
If you have GNU grep, you can try this:
grep -m 1 -f B A

finding maximum from partial string

I have a list where first 6 digit is date in format yyyymmdd. The next 4 digits are part of timestamp. I want to select only those numbers which are maximum timestamp for any day.
20160905092900
20160905212900
20160906092900
20160906213000
20160907093000
20160907213000
20160908093000
20160908213000
20160910093000
20160910213100
20160911093100
20160911213100
20160912093100
Means from the above list the output should give the below list.
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
$ sort -r file | awk '!seen[substr($0,1,8)]++' | sort
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
If the file's already sorted you can use tac instead of sort.
You can use awk:
awk '{
dt = substr($0, 1, 8)
ts = substr($0, 9, 12)
}
ts > max[dt] {
max[dt] = ts
rec[dt] = $0
}
END {
for (i in rec)
print rec[i]
}' file
20160905212900
20160906213000
20160907213000
20160908213000
20160910213100
20160911213100
20160912093100
We are using associative array max that uses first 8 characters as key and next 4 characters as value. This array is being used to store max timestamp value for a given date. Another array rec is used to store full line for a date when we encounter timestamp value greater than stored value in max array.

Edit fields in csv files using bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.
Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

Change date and data cells in .csv file progressively

I have a file that I'm trying to get ready for my boss in time for his manager's meeting tomorrow morning at 8:00AM -8GMT. I want to retroactively change the dates in non consecutive rows in this .csv file: (truncated)
,,,,,
,,,,,sideshow
,,,
date_bob,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bob_available,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383
bob_used,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312
,,,
date_mel,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
mel_available,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537
mel_used,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159
,,,
date_sideshow-ws2,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
sideshow-ws2_available,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239
sideshow-ws2_used,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441
,,,
,,,,,simpsons
,,,
date_bart,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bart_available,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559
bart_used,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117
,,,
date_homer,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
homer_available,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799
homer_used,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877
,,,
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
lisa_available,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899
lisa_used,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777
In other words a row that now reads:
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
would desirably read:
date_lisa,09-04-14,09-05-14,09-06-14,09-07-14,09-08-14,09-09-14,09-10-14,09-11-14,09-12-14,09-13-14,09-14-14,09-15-14,09-16-14,09-17-14
I'd like to make the daily available numbers less at the beginning and then get progressively bigger day by day. This will mean that the used rows will have to be proportionately smaller at the beginning and then get progressively bigger in lock step with the available rows as they shrink.
Not by a large amount don't make it look obvious just a few GB here and there. I plan to make pivot tables and graphs out of this and so it has to vary a little. BTW the numbers are all in MB as I generated them using df -m.
Thanks in advance if anyone can help me.
The following awk does what you need:
awk -F, -v OFS=, '
/^date/ {
split ($2, date, /-/);
for (i=2; i<=NF; i++) {
$i = date[1] "-" sprintf ("%02d", date[2] - NF + i) "-" date[3]
}
}
/available|used/ {
for (i=2; i<=NF; i++) {
$i = int (($i*i)/NF)
}
}1' csv
Set the Input and Output Field Separator to ,
All the lines that start with date, we split the second column to find the date part.
We iterate from second column to the end of the line and set the column to new calculated start date which basically uses the current date and the total number of fields.
All other lines remain as is and gets printed along with modified lines.
This has a caveat of not rolling over different months correctly.
For data fields we iterate from second column to the end of line and do a calculation to make them progressively greater than the previous one to match the original value for last field.

Removing the subscript/Index of an array in awk

I am using the awk's concept of storing the values as a subscript/Index of an array. Please have a look at the code below
stringVariable="hi,bye,cool.hot,how,see";
split(stringVariable,stringArray,",");
#This loop will iterate and stores the RIDs in the requestIds variable into an array
for(tr=1;tr<=length(stringArray);tr++)
{
Count++;
referenceIdArray[stringArray[tr]]++;
}
So in my referenceId array I will be having hi,bye,cool,hot,how,see
let me consider a sample file which has the following values
hi
bye
gone
My aim is to get the values from the file and to match with the array declared previously and if any of the values matches print the value from a file
awk script
awk '{BEGIN (Array loading done previously)} {if($0 in referenceIdArray) {print $0}}'
So this will give me the desired result. But assuming that the "hi" will appear only once in an array and hence when the action block finds the value, the value should be printed and also the corresponding entry in the array which is referenceIdArray["hi"] should also be removed in order to make the search effecient. Since they are stored as subscript I am not sure how to remove the entry. Any suggesions regarding this. Thank you.
You can remove an individual element of an array using the delete statement:
delete array[index]
ref: http://www.math.utah.edu/docs/info/gawk_12.html

Resources