Trying to sort a text file with dates in brackets with "sort"

Trying to sort a text file with dates in brackets with "sort" - bash

I'm trying to sort a text by date.
My file format is:
...
[15/08/2019 - 01:58:49] some text here
[15/08/2019 - 02:21:23] more text here
[15/08/2019 - 02:56:11] blah blah blah
...
I've tried multiple different methods with the sort command.
One attempt: "sort -b --key=1n --debug Final_out.txt"
sort: using ‘en_US.UTF-8’ sorting rules
sort: key 1 is numeric and spans multiple fields
sort: option '-b' is ignored
^ no match for key
^ no match for key
...
__
.?
^ no match for key
__
.?
^ no match for key
__
sort: write failed: 'standard output': Input/output error
sort: write error
Second attempt: "sort -n -b --key=10,11 --debug Final_out.txt"
Produced same output above
Just about to tear my hair out. This has to be possible, it's Linux! Come someone kindly give me pointers?

As Shawnn suggests, how about a bash solution:
#!/bin/bash
pat='^\[([0-9]{2})/([0-9]{2})/([0-9]{4})[[:blank:]]+-[[:blank:]]+([0-9]{2}:[0-9]{2}:[0-9]{2})\]'
while IFS= read -r line; do
if [[ $line =~ $pat ]]; then
m=( "${BASH_REMATCH[#]}" ) # make a copy just to shorten the variable name
echo -e "${m[3]}${m[2]}${m[1]}_${m[4]}\t$line"
fi
done < file.txt | sort -t $'\t' -k1,1 | cut -f2-
The variable pat is a regular expression to match the date and time field
and assigns bash variable BASH_REMATCH[#] to day, month, year and time
in order.
After extracting the date and time field, it generates a new string
composed of year, month, day and time in a sortable order and prepend
the string to the current line delimited with a tab
Then the whole lines are piped to sort keyed on the 1st field.
Finally the 1st field is cut off.
The input file file.txt:
[10/01/2020 - 01:23:45] lorem ipsum
[15/08/2019 - 02:21:23] more text here
[15/08/2019 - 02:56:11] blah blah blah
[15/08/2019 - 01:58:49] some text here
[14/08/2019 - 12:34:56] dolor sit amet
Output:
[14/08/2019 - 12:34:56] dolor sit amet
[15/08/2019 - 01:58:49] some text here
[15/08/2019 - 02:21:23] more text here
[15/08/2019 - 02:56:11] blah blah blah
[10/01/2020 - 01:23:45] lorem ipsum

Here is an alternative but shorter sorting way using gnu awk:
cat file
[10/01/2020 - 01:23:45] lorem ipsum
[15/08/2019 - 02:21:23] more text here
[15/08/2019 - 02:56:11] blah blah blah
[15/08/2019 - 01:58:49] some text here
[14/08/2019 - 12:34:56] dolor sit amet
Use this awk:
awk -v FPAT='[0-9:]+' '{ map[$3,$2,$1,$4] = $0 }
END { PROCINFO["sorted_in"]="#ind_str_asc"; for (k in map) print map[k] }' file
[14/08/2019 - 12:34:56] dolor sit amet
[15/08/2019 - 01:58:49] some text here
[15/08/2019 - 02:21:23] more text here
[15/08/2019 - 02:56:11] blah blah blah
[10/01/2020 - 01:23:45] lorem ipsum

I've the same issue with my HISTORY with HISTTIMEFORMAT="%d/%m/%y %T "
To sort according to year, month and day, I used this options in sort:
before
history | awk '/0[78]\/06/{print" "$1" "$2" "$3" command number "NR}'|head -20
1921 07/06/22 09:21:05 command number 925
1922 07/06/22 13:23:31 command number 926
1923 07/06/22 13:24:16 command number 927
1924 07/06/22 13:23:31 command number 928
1925 07/06/22 13:24:16 command number 929
1926 08/06/22 10:59:12 command number 930
1927 08/06/22 10:59:21 command number 931
1928 08/06/22 10:59:26 command number 932
1929 08/06/22 10:59:27 command number 933
1930 08/06/22 10:59:34 command number 934
1931 08/06/22 10:59:44 command number 935
1932 08/06/22 11:01:47 command number 936
1933 08/06/22 11:03:35 command number 937
1934 08/06/22 11:03:44 command number 938
1935 08/06/22 11:03:48 command number 939
1936 08/06/22 11:04:02 command number 940
1937 08/06/22 11:12:17 command number 941
1938 07/06/22 13:24:16 command number 942
1939 08/06/22 09:22:10 command number 943
1940 08/06/22 09:29:41 command number 944
after
history | awk '/0[78]\/06/{print" "$1" "$2" "$3" command number "NR}'|head -20|sort -bn -k2.7,2.8 -k2.4,2.5 -k2.1,2.2 -k3.1,3.2 -k3.4,3.5 -k3.7,3.8 -k1
1921 07/06/22 09:21:05 command number 925
1922 07/06/22 13:23:31 command number 926
1924 07/06/22 13:23:31 command number 928
1923 07/06/22 13:24:16 command number 927
1925 07/06/22 13:24:16 command number 929
1938 07/06/22 13:24:16 command number 942
1939 08/06/22 09:22:10 command number 943
1940 08/06/22 09:29:41 command number 944
1926 08/06/22 10:59:12 command number 930
1927 08/06/22 10:59:21 command number 931
1928 08/06/22 10:59:26 command number 932
1929 08/06/22 10:59:27 command number 933
1930 08/06/22 10:59:34 command number 934
1931 08/06/22 10:59:44 command number 935
1932 08/06/22 11:01:47 command number 936
1933 08/06/22 11:03:35 command number 937
1934 08/06/22 11:03:44 command number 938
1935 08/06/22 11:03:48 command number 939
1936 08/06/22 11:04:02 command number 940
1937 08/06/22 11:12:17 command number 941
Explainations in sort -bn -k2.7,2.8 -k2.4,2.5 -k2.1,2.2 -k3.1,3.2 -k3.4,3.5 -k3.7,3.8 -k1 command :
d is for remove leading blanks
n is for numeric
k2.7,2.8 is for 2nd key (the date) from 7th to 8th char (yy)
etc for keys 2 and 3 (the time)
And, for #Ventus, the solution can be sort -n -k1.9,1.12 -k1.5,1.6 -k1.2,1.3 -k3.1,3.2 -k3.4,3.5 -k3.7,3.8

Related

Replacing the value of specific field in a table-like string stored as bash variable

I am looking for a way to replace (with 0) a specific value (1043252782) in a "table-like" string stored as a bash variable. The output of echo "$var"looks like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 1043252782
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
After the replacement echo "$var" should look like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
Is there a way to do this without saving the content of $var to a file and directly manipulating it within the bash (shell script)?
Maby with awk? I can select the value in the 10th field of the second record with awk and pattern matching ("7 Seek_Error_Rate ....") like this:
echo "$var" | awk '/^ 7/{print $10}'
Maby there is some way doing it with awk (or other cli-tool) to replace it and store it back into $var? Also, the value changes over time, but the structure remains the same (some record at the 10th field).

You can change a specific string directly in the shell:
var=${var/1043252782/0}
To replace final number of second line, you could use awk or sed:
var=$(awk 'NR==2 { sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '2s/[0-9][0-9]*$/0/' <<<"$var")
If you don't know which line it will be, you can match a known string:
var=$(awk '/Seek_Error_Rate/{ sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '/Seek_Error_Rate/s/[0-9][0-9]*$/0/' <<<"$var")

You can use a here-string to feed the variable as input to awk.
Use sub() to perform a regular expression replacement.
var=$(awk '{sub(/1043252782$/, "0")}1' <<<"$var")

Using sed
$ var=$(sed '/1043252782$/s//0/' <<< "$var")
$ echo "$var"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

if you don't wanna ruin formatting of tabs and spaces :
{m,g}wk NF=NF FS=' 1043252782$' OFS=' 0'
:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
or doing the whole file in one single shot :
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS='^$' ORS=
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS= -- (This might work too but I'm not too well versed in any side effects for blank RS)

Display row/column data for csv with max value in another column, same row (bash)

I'm trying to make a script that sorts column 2 for highest value, prints said value, and prints column 3 for every row matching this value. Here's an example of unsorted csv:
Argentina,4.6,2016,some data
Argentina,4.2,2018,some data
Argentina,4.6,1998,some data
Argentina,4.5,2001,some data
Desired output would be:
4.6
2016
1998
Here's what I've got so far, but I'm feeling unsure if I'm going about it correctly:
grep "$2*" "$1"> new.csv
sort -t, -k2,2nr new.csv > new2.csv
cut -f3 -d"," new2.csv
Wherein $2 is the name of country in first column and $1 is the filename. While it sorts the values in the 2nd column just fine, I'd like to show the years for only the rows with max value in column 2. This route just prints the years for all of the rows, and I understand why that's happening, but not sure the best course to get the intended result from there. What are some ways to go about this? Thanks in advance

You could do something like that:
declare maxvalue_found=no
declare maxvalue=''
while read -r line; do
IFS=',' read -r <<< "$line" country value year data
if [[ "${maxvalue_found}" == no ]]; then
echo "$value"
maxvalue="${value}"
maxvalue_found=yes
fi
if [[ "${value}" == "${maxvalue}" ]]; then
echo "$year"
fi
done < new2.csv
new2.csv is your sorted file: we simply read it line by line, then read said line by splitting using ',' (https://www.gnu.org/software/bash/manual/bash.html#Word-Splitting):
The first value should be the highest due to sort.
The next value must be tested because you want only those that matches.
The year are print in same order than in new2.csv

Assumptions:
comma only shows up as a field delimiter (ie, comma is not part of any data)
no sorting requirement has been defined for the final result
One awk idea requiring 2 passes through the unsorted file:
awk -F, ' # set input field delimiter as comma
FNR==NR { max=($2>max ? $2 : max); next} # 1st pass of file (all rows): keep track of max value from field #2
FNR==1 { print max } # 2nd pass of file (1st row ): print max
$2==max { print $3 } # 2nd pass of file (all rows): if field #2 matches "max" then print field #3
' unsorted.csv unsorted.csv
This generates:
4.6
2016
1998
Another GNU awk idea that requires a single pass through the unsorted file:
awk -F, ' # set input field delimiter as comma
{ arr[$2][$3] # save fields #2 and #3 as indices in array "arr[]"
max = ( $2 > max ? $2 : max) # keep track of max value from field #2
}
END { print max # after file has been processed ... print max and then ...
for (i in arr[max]) # loop through indices of 2nd dimension where 1st dimension == max
print i # print 2nd dimension index (ie, field #3)
}
' unsorted.csv
This generates:
4.6
1998
2016
NOTES:
GNU awk required for arrays of arrays (ie, multidimensional arrays)
while field #3 appeara to be sorted this is not guaranteed unless we modify the code to explicitly sort the 2nd dimension of the array

How about single-pass in awk instead of multi-pass ? I've generated this synthetic version of the file, plus randomizing some data, to create a 6.24 mn row version of it :
INPUT
out9: 177MiB 0:00:01 [ 105MiB/s] [ 105MiB/s] [ <=> ]
rows = 6243584. | UTF8 chars = 186289540. | bytes = 186289540.
CODE
default value initialized to gigantically negative value of
-2^512, or more elegantly, -4^4^4, *to ensure it'll always take on the value on row 1
if you really want to play it safe, then make it something very close to* negative infinity :
e.g. -(3+4+1)^341, -16^255, -256^127, or -1024^102
=
{m,g}awk '
BEGIN {
1 _= -(_^= __= _+= _^= FS= OFS = ",")^_^_
1 ___= split("",____)
}
# Rule(s)
6243584 +_ <= +$__ { # 2992
2992 __= $(NF = ++__)
2992 if ((+_)< +$--NF) {
7 _= $NF
7 ___= split("",____)
}
2992 ____[++___]=__
2992 __=NF
}
END {
1 print _
2984 for (__^=_<_; __<=___; __++) {
2984 print ____[__]
}
}
OUTPUT (column 3 printed exactly in input row order)
.
53.6 1834 1999 1866 1938 1886 1973 1968 1921 1984 1957 1891 1864 1992
1998 1853 1950 1985 1962 2018 1897 1979 2020 1954 1995 1980 1900 1997
1856 1975 1851 1853 1988 1897 1973 1875 1917 1861 1912 1912 1954 1871
1952 1877 2003 1886 1863 1899 1897 1853 2013 1956 1965 1854 1873 1915
1983 1961 1965 1979 1919 1970 1946 1843 1856 1954 1965 1831 1926 1964
1994 1969 1831 1945 1942 1971 1988 1879 1998 1986 1844 1846 1994 1894
2008 1851 1877 1979 1970 1852 1942 1889 1986 2013 1905 1932 2021 1944
1866 1892 1940 1989 1907 1982 2016 1966 1975 1831 1851 2003 1980 1963
1869 1983 1972 2013 1972 1948 1843 1928 1959 1911 1844 1920 1943 1864
1985 1978 1855 1986 1975 1880 2001 1914 1877 1900 1964 1995 1992 1968
1868 1974 2012 1827 1849 1849 1992 1942 1884 1876 2021 1866 1977 1857
1866 1937 1920 1983 1915 1887 1890 1852 1871 1972 1903 1944 1943 1957
1844 1932 1854 1890 1891 1866 1923 1924 1941 1845 1907 2019
(further rows truncated for readability)

A single pass awk:
$ awk -F, '{
if($2>=m||m=="") {
b= ($2==m?b:$2) ORS $3 # b is the record buffer
m=$2 # m holds the maximum of $2 so far
}
}
END {
print b
}' file
Output:
4.6
2016
1998

reading a file into an array in bash

Here is my code
#!bin/bash
IFS=$'\r\n'
GLOBIGNORE='*'
command eval
'array=($(<'$1'))'
sorted=($(sort <<<"${array[*]}"))
for ((i = -1; i <= ${array[-25]}; i--)); do
echo "${array[i]}" | awk -F "/| " '{print $2}'
done
I keep getting an error that says "line 5: array=($(<)): command not found"
This is my problem.
As a whole my code should read in a file as a command line argument, sort the elements, then print out column 2 of the last 25 lines. I haven't been able to test this far so if there's a problem there too any help would be appreciated.
This is some of what the file contains:
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
16227 nicole
15308 daniel
15163 babygirl
14726 monkey
14331 lovely
14103 jessica
13984 654321
13981 michael
13488 ashley
13456 qwerty
13272 111111
13134 iloveu
13028 000000
12714 michelle
11761 tigger
11489 sunshine
11289 chocolate
11112 password1
10836 soccer
10755 anthony
10731 friends
10560 butterfly
10547 purple
10508 angel
10167 jordan
9764 liverpool
9708 justin
9704 loveme
9610 fuckyou
9516 123123
9462 football
9310 secret
9153 andrea
9053 carlos
8976 jennifer
8960 joshua
8756 bubbles
8676 1234567890
8667 superman
8631 hannah
8537 amanda
8499 loveyou
8462 pretty
8404 basketball
8360 andrew
8310 angels
8285 tweety
8269 flower
8025 playboy
7901 hello
7866 elizabeth
7792 hottie
7766 tinkerbell
7735 charlie
7717 samantha
7654 barbie
7645 chelsea
7564 lovers
7536 teamo
7518 jasmine
7500 brandon
7419 666666
7333 shadow
7301 melissa
7241 eminem
7222 matthew

In Linux you can simply do a
sort -nbr file_to_sort | head -n 25 | awk '{print $2}'

read in a file as a command line argument, sort the elements, then
print out column 2 of the last 25 lines.
From that discription of the problem, I suggest:
#! /bin/sh
sort -bn $1 | tail -25 | awk '{print $2}'
As a rule, use the shell to operate on filenames, and never use the
shell to operate on data. Utilities like sort and awk are far
faster and more powerful than the shell when it comes to processing a
file.

How can I redirect fixed lines to a new file with shell

I know we can use > to redirect IO to a file. While I want to write fixed line to a file.
For example,
more something will output 3210 lines, then I want
line 1~1000 in file1
line 1001~2000 in file2
line 2001~3000 in file3
line 3001~3210 in file4.
How can I do it with SHELL script?
Thx.

The split command is what you need.
split -l 1000 your_file.txt "prefix"
Where:
-l - split in lines.
1000 - The number of lines to split.
your_file.txt - The file you want to split.
prefix - A prefix to the output files' names.
Example for a file of 3210 lines:
# Generate the file
$ seq 3210 > your_file.txt
# Split the file
$ split -l 1000 your_file.txt "prefix"
# Check the output files' names
$ ls prefix*
prefixaa prefixab prefixac prefixad
# Check all files' ending
$ tail prefixa*
==> prefixaa <==
991
992
993
994
995
996
997
998
999
1000
==> prefixab <==
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
==> prefixac <==
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
==> prefixad <==
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210

Read the number of columns using awk/sed

I have the following test file
Kmax Event File - Text Format
1 4 1000
65 4121 9426 12312
56 4118 8882 12307
1273 4188 8217 12309
1291 4204 8233 12308
1329 4170 8225 12303
1341 4135 8207 12306
63 4108 8904 12300
60 4106 8897 12307
731 4108 8192 12306
...
ÿÿÿÿÿÿÿÿ
In this file I want to delete the first two lines and apply some mathematical calculations. For instance each column i will be $i-(i-1)*number. A script that does this is the following
#!/bin/bash
if test $1 ; then
if [ -f $1.evnt ] ; then
rm -f $1.dat
sed -n '2p' $1.evnt | (read v1 v2 v3
for filename in $1*.evnt ; do
echo -e "Processing file $filename"
sed '$d' < $filename > $1_tmp
sed -i '/Kmax/d' $1_tmp
sed -i '/^'"$v1"' '"$v2"' /d' $1_tmp
cat $1_tmp >> $1.dat
done
v3=`wc -l $1.dat | awk '{print $1}' `
echo -e "$v1 $v2 $v3" > .$1.dat
rm -f $1_tmp)
else
echo -e "\a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
echo -e " Event file $1.evnt doesn't exist !!!!!!"
echo -e "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
fi
else
echo -e "\a!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
echo -e "!!!!! Give name for event files !!!!!"
echo -e "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
fi
awk '{print $1, $2-4096, $3-(2*4096), $4-(3*4096)}' $1.dat >$1_Processed.dat
rm -f $1.dat
exit 0
The file won't always have 4 columns. Is there a way to read the number of columns, print this number and apply those calculations?
EDIT The idea is to have an input file (*.evnt), convert it to *.dat or any other ascii file(it doesn't matter really) which will only include the number in columns and then apply the calculation $i=$i-(i-1)*number. In addition it will keep the number of columns in a variable, that will be called in another program. For instance in the above file, number=4096 and a sample output file is the following
65 25 1234 24
56 22 690 19
1273 92 25 21
1291 108 41 20
1329 74 33 15
1341 39 15 18
63 12 712 12
60 10 705 19
731 12 0 18
while in the console I will get the message There are 4 detectors.
Finally a new file_processed.dat will be produced, where file is the initial name of awk's input file.
The way it should be executed is the following
./myscript <filename>
where <filename> is the name without the format. For instance, the files will have the format filename.evnt so it should be executed using
./myscript filename

Let's start with this to see if it's close to what you're trying to do:
$ numdet=$( awk -v num=4096 '
NR>2 && NF>1 {
out = FILENAME "_processed.dat"
for (i=1;i<=NF;i++) {
$i = $i-(i-1)*num
}
nf = NF
print > out
}
END {
printf "There are %d detectors\n", nf | "cat>&2"
print nf
}
' file )
There are 4 detectors
$ cat file_processed.dat
65 25 1234 24
56 22 690 19
1273 92 25 21
1291 108 41 20
1329 74 33 15
1341 39 15 18
63 12 712 12
60 10 705 19
731 12 0 18
$ echo "$numdet"
4
Is that it?

Using awk
awk 'NR<=2{next}{for (i=1;i<=NF;i++) $i=$i-(i-1)*4096}1' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Trying to sort a text file with dates in brackets with "sort" - bash

Related

Replacing the value of specific field in a table-like string stored as bash variable

Display row/column data for csv with max value in another column, same row (bash)

reading a file into an array in bash

How can I redirect fixed lines to a new file with shell

Read the number of columns using awk/sed

Categories

Resources