Sort a text file by line length including spaces

Sort a text file by line length including spaces - bash

I have a CSV file that looks like this
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
I need to sort it by line length including spaces. The following command doesn't
include spaces, is there a way to modify it so it will work for me?
cat $# | awk '{ print length, $0 }' | sort -n | awk '{$1=""; print $0}'

Answer
cat testfile | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2-
Or, to do your original (perhaps unintentional) sub-sorting of any equal-length lines:
cat testfile | awk '{ print length, $0 }' | sort -n | cut -d" " -f2-
In both cases, we have solved your stated problem by moving away from awk for your final cut.
Lines of matching length - what to do in the case of a tie:
The question did not specify whether or not further sorting was wanted for lines of matching length. I've assumed that this is unwanted and suggested the use of -s (--stable) to prevent such lines being sorted against each other, and keep them in the relative order in which they occur in the input.
(Those who want more control of sorting these ties might look at sort's --key option.)
Why the question's attempted solution fails (awk line-rebuilding):
It is interesting to note the difference between:
echo "hello awk world" | awk '{print}'
echo "hello awk world" | awk '{$1="hello"; print}'
They yield respectively
hello awk world
hello awk world
The relevant section of (gawk's) manual only mentions as an aside that awk is going to rebuild the whole of $0 (based on the separator, etc) when you change one field. I guess it's not crazy behaviour. It has this:
"Finally, there are times when it is convenient to force awk to rebuild the entire record, using the current value of the fields and OFS. To do this, use the seemingly innocuous assignment:"
$1 = $1 # force record to be reconstituted
print $0 # or whatever else with $0
"This forces awk to rebuild the record."
Test input including some lines of equal length:
aa A line with MORE spaces
bb The very longest line in the file
ccb
9 dd equal len. Orig pos = 1
500 dd equal len. Orig pos = 2
ccz
cca
ee A line with some spaces
1 dd equal len. Orig pos = 3
ff
5 dd equal len. Orig pos = 4
g

The AWK solution from neillb is great if you really want to use awk and it explains why it's a hassle there, but if what you want is to get the job done quickly and don't care what you do it in, one solution is to use Perl's sort() function with a custom caparison routine to iterate over the input lines. Here is a one liner:
perl -e 'print sort { length($a) <=> length($b) } <>'
You can put this in your pipeline wherever you need it, either receiving STDIN (from cat or a shell redirect) or just give the filename to perl as another argument and let it open the file.
In my case I needed the longest lines first, so I swapped out $a and $b in the comparison.

Benchmark results
Below are the results of a benchmark across solutions from other answers to this question.
Test method
10 sequential runs on a fast machine, averaged
Perl 5.24
awk 3.1.5 (gawk 4.1.0 times were ~2% faster)
The input file is a 550MB, 6 million line monstrosity (British National Corpus txt)
Results
Caleb's perl solution took 11.2 seconds
my perl solution took 11.6 seconds
neillb's awk solution #1 took 20 seconds
neillb's awk solution #2 took 23 seconds
anubhava's awk solution took 24 seconds
Jonathan's awk solution took 25 seconds
Fritz's bash solution takes 400x longer than the awk solutions (using a truncated test case of 100000 lines). It works fine, just takes forever.
Another perl solution
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file

Try this command instead:
awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-

Pure Bash:
declare -a sorted
while read line; do
if [ -z "${sorted[${#line}]}" ] ; then # does line length already exist?
sorted[${#line}]="$line" # element for new length
else
sorted[${#line}]="${sorted[${#line}]}\n$line" # append to lines with equal length
fi
done < data.csv
for key in ${!sorted[*]}; do # iterate over existing indices
echo -e "${sorted[$key]}" # echo lines with equal length
done

Python Solution
Here's a Python one-liner that does the same, tested with Python 3.9.10 and 2.7.18. It's about 60% faster than Caleb's perl solution, and the output is identical (tested with a 300MiB wordlist file with 14.8 million lines).
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
Benchmark:
python -c 'import sys; sys.stdout.writelines(sorted(sys.stdin.readlines(), key=len))'
real 0m5.308s
user 0m3.733s
sys 0m1.490s
perl -e 'print sort { length($a) <=> length($b) } <>'
real 0m8.840s
user 0m7.117s
sys 0m2.279s

The length() function does include spaces. I would make just minor adjustments to your pipeline (including avoiding UUOC).
awk '{ printf "%d:%s\n", length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]*://'
The sed command directly removes the digits and colon added by the awk command. Alternatively, keeping your formatting from awk:
awk '{ print length($0), $0;}' "$#" | sort -n | sed 's/^[0-9]* //'

I found these solutions will not work if your file contains lines that start with a number, since they will be sorted numerically along with all the counted lines. The solution is to give sort the -g (general-numeric-sort) flag instead of -n (numeric-sort):
awk '{ print length, $0 }' lines.txt | sort -g | cut -d" " -f2-

With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example

1) pure awk solution. Let's suppose that line length cannot be more > 1024
then
cat filename | awk 'BEGIN {min = 1024; s = "";} {l = length($0); if (l < min) {min = l; s = $0;}} END {print s}'
2) one liner bash solution assuming all lines have just 1 word, but can reworked for any case where all lines have same number of words:
LINES=$(cat filename); for k in $LINES; do printf "$k "; echo $k | wc -L; done | sort -k2 | head -n 1 | cut -d " " -f1

using Raku (formerly known as Perl6)
~$ cat "BinaryAve.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};'
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Atlantis,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Ternary ave.,Some City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mr. Plain Example, 110 Binary ave.,Liberty City,RI,12345,(999)123-5555,1.56
AS2345,ASDF1232, Mrs. Plain Example, 1121110 Ternary st. 110 Binary ave..,Atlantis,RI,12345,(999)123-5555,1.56
To reverse the sort, add .reverse in the middle of the chain of method calls--immediately after .sort(). Here's code showing that .chars includes spaces:
~$ cat "number_triangle.txt" | raku -e 'given lines() {.map(*.chars).say};'
(1 3 5 7 9 11 13 15 17 19 0)
~$ cat "number_triangle.txt"
1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
1 2 3 4 5 6
1 2 3 4 5 6 7
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 0
Here's a time comparison between awk and Raku using a 9.1MB txt file from Genbank:
~$ time cat "rat_whole_genome.txt" | raku -e 'given lines() {.sort(*.chars).join("\n").say};' > /dev/null
real 0m1.308s
user 0m1.213s
sys 0m0.173s
~$ #awk code from neillb
~$ time cat "rat_whole_genome.txt" | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- > /dev/null
real 0m1.189s
user 0m1.170s
sys 0m0.050s
HTH.
https://raku.org

Here is a multibyte-compatible method of sorting lines by length. It requires:
wc -m is available to you (macOS has it).
Your current locale supports multi-byte characters, e.g., by setting LC_ALL=UTF-8. You can set this either in your .bash_profile, or simply by prepending it before the following command.
testfile has a character encoding matching your locale (e.g., UTF-8).
Here's the full command:
cat testfile | awk '{l=$0; gsub(/\047/, "\047\"\047\"\047", l); cmd=sprintf("echo \047%s\047 | wc -m", l); cmd | getline c; close(cmd); sub(/ */, "", c); { print c, $0 }}' | sort -ns | cut -d" " -f2-
Explaining part-by-part:
l=$0; gsub(/\047/, "\047\"\047\"\047", l); ← makes of a copy of each line in awk variable l and double-escapes every ' so the line can safely be echoed as a shell command (\047 is a single-quote in octal notation).
cmd=sprintf("echo \047%s\047 | wc -m", l); ← this is the command we'll execute, which echoes the escaped line to wc -m.
cmd | getline c; ← executes the command and copies the character count value that is returned into awk variable c.
close(cmd); ← close the pipe to the shell command to avoid hitting a system limit on the number of open files in one process.
sub(/ */, "", c); ← trims white space from the character count value returned by wc.
{ print c, $0 } ← prints the line's character count value, a space, and the original line.
| sort -ns ← sorts the lines (by prepended character count values) numerically (-n), and maintaining stable sort order (-s).
| cut -d" " -f2- ← removes the prepended character count values.
It's slow (only 160 lines per second on a fast Macbook Pro) because it must execute a sub-command for each line.
Alternatively, just do this solely with gawk (as of version 3.1.5, gawk is multibyte aware), which would be significantly faster. It's a lot of trouble doing all the escaping and double-quoting to safely pass the lines through a shell command from awk, but this is the only method I could find that doesn't require installing additional software (gawk is not available by default on macOS).

Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE):
cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1

Related

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4

In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.

You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66

The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

how to find maximum and minimum values of a particular column using AWK [duplicate]

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!

Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`

a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}

Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.

late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat

A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)

#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

awk: find minimum and maximum in column

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!

Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`

a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}

Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.

late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat

A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)

#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

how to extract columns from a text file with bash

I have a text file like this.
res ABS sum
SER A 1 161.15 138.3
CYS A 2 66.65 49.6
PRO A 3 21.48 15.8
ALA A 4 77.68 72.0
ILE A 5 15.70 9.0
HIS A 6 10.88 5.9
I would like to extract the names of first column(res) based on the values of last column(sum). I have to print resnames if sum >25 and sum<25. How can I get the output like this?

This should do it:
awk 'BEGIN{FS=OFS=" "}{if($5 != 25) print $1}' bla.txt

While you can do this with a while read loop in bash, it's easier, and most likely faster, to use awk
awk '$5 != 25 { print $1 }'
Note that your logic print resnames if sum >25 and sum<25 is the same as print if sum != 25.

Consider using awk. Its a simple tool for processing columns of text (and much more). Here's a simple awk tutorial which will give you an overview. If you want to use it within a bash script, then this tutorial should help.
Run this on the command line to give you an idea of how you could do it:
> echo "SER A 1 161.15 138.3" | awk '{ if($5 > 25) print $1}'
> SER
> echo "SER A 1 161.15 138.3" | awk '{ if($5 > 140) print $1}'
>

while read line
do
v=($line)
sum=${v[4]}
((${sum/.*/} >= 25)) && echo ${v[0]}
done < file
You need to skip the first line.
Since bash doesn't handle floating point values, this will print 25 which isn't exactly bigger than 25.
This can be handled with calling bc for arithmetics.
tail -n +2 ser.dat | while read line
do
v=($line)
sum=${v[4]}
gt=$(echo "$sum > 25" | bc) && echo ${v[0]}
done

what about the good old cut?
:)
say you would like to have the second column,
cat your_file.txt | sed 's, +, ,g' | cut -d" " -f 2
what is doing sed in this command?
cut expects columns to be separated by a character or a string of fixed length (see documentation).

How to print all the columns after a particular number using awk?

On shell, I pipe to awk when I need a particular column.
This prints column 9, for example:
... | awk '{print $9}'
How can I tell awk to print all the columns including and after column 9, not just column 9?

awk '{ s = ""; for (i = 9; i <= NF; i++) s = s $i " "; print s }'

When you want to do a range of fields, awk doesn't really have a straight forward way to do this. I would recommend cut instead:
cut -d' ' -f 9- ./infile
Edit
Added space field delimiter due to default being a tab. Thanks to Glenn for pointing this out

awk '{print substr($0, index($0,$9))}'
Edit:
Note, this doesn't work if any field before the ninth contains the same value as the ninth.

sed -re 's,\s+, ,g' | cut -d ' ' -f 9-
Instead of dealing with variable width whitespace, replace all whitespace as single space. Then use simple cut with the fields of interest.
It doesn't use awk so isn't germane but seemed appropriate given the other answers/comments.

Generally perl replaces awk/sed/grep et. al., and is much more portable (as well as just being a better penknife).
perl -lane 'print "#F[8..$#F]"'
Timtowtdi applies of course.

awk -v m="\x01" -v N="3" '{$N=m$N ;print substr($0, index($0,m)+1)}'
This chops what is before the given field nr., N, and prints all the rest of the line, including field nr.N and maintaining the original spacing (it does not reformat). It doesn't mater if the string of the field appears also somewhere else in the line, which is the problem with Ascherer's answer.
Define a function:
fromField () {
awk -v m="\x01" -v N="$1" '{$N=m$N; print substr($0,index($0,m)+1)}'
}
And use it like this:
$ echo " bat bi iru lau bost " | fromField 3
iru lau bost
$ echo " bat bi iru lau bost " | fromField 2
bi iru lau bost
Output maintains everything, including trailing spaces
For N=0 it returns the whole line, as is, and for n>NF the empty string

Here is an example of ls -l output:
-rwxr-----# 1 ricky.john 1493847943 5610048 Apr 16 14:09 00-Welcome.mp4
-rwxr-----# 1 ricky.john 1493847943 27862521 Apr 16 14:09 01-Hello World.mp4
-rwxr-----# 1 ricky.john 1493847943 21262056 Apr 16 14:09 02-Typical Go Directory Structure.mp4
-rwxr-----# 1 ricky.john 1493847943 10627144 Apr 16 14:09 03-Where to Get Help.mp4
My solution to print anything post $9 is awk '{print substr($0, 61, 50)}'

Using cut instead of awk and overcoming issues with figuring out which column to start at by using the -c character cut command.
Here I am saying, give me all but the first 49 characters of the output.
ls -l /some/path/*/* | cut -c 50-
The /*/*/ at the end of the ls command is saying show me what is in subdirectories too.
You can also pull out certain ranges of characters ala (from the cut man page). E.g., show the names and login times of the currently logged in users:
who | cut -c 1-16,26-38

To display the first 3 fields and print the remaining fields you can use:
awk '{s = ""; for (i=4; i<= NF; i++) s= s $i : "; print $1 $2 $3 s}' filename
where $1 $2 $3 are the first 3 fields.

function print_fields(field_num1, field_num2){
input_line = $0
j = 1;
for (i=field_num1; i <= field_num2; i++){
$(j++) = $(i);
}
NF = field_num2 - field_num1 + 1;
print $0
$0 = input_line
}

Usually it is desired to pass the remaining columns unmodified. That is, without collapsing contiguous white space.
Imagine the case of processing the output of ls -l or ps faux (not recommended, just giving examples where the last column may contain sequences of whitespace)). We'd want any contiguous white space in the remaning columns preserved so that a file named my file.txt doesn't become my file.txt.
Preserving white space for the remainder of the line is surprisingly difficult using awk. The accepted awk-based answer does not, even with the suggested refinements.
sed or perl are better suited to this task.
sed
echo '1 2 3 4 5 6 7 8 9 10' | sed -E 's/^([^ \t]*[ \t]*){8}//'
Result:
9 10
The -E option enables modern ERE regex syntax. This saves me the trouble of backslash escaping the parentheses and braces.
The {8} is a quantifier indicating to match the previous item exactly 8 times.
The sed s command replaces 8 occurrences of white space delimited words by an empty string. The remainder of the line is left intact.
perl
Perl regex supports the \h escape for horizontal whitespace.
echo '1 2 3 4 5 6 7 8 9 10' | perl -pe 's/^(\H*\h*){8}//'
Result:
9 10

ruby -lane 'print $F[3..-1].join(" ")' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sort a text file by line length including spaces - bash

Try this command instead: awk '{print length, $0}' your-file | sort -n | cut -d " " -f2-

With POSIX Awk: { c = length m[c] = m[c] ? m[c] RS $0 : $0 } END { for (c in m) print m[c] } Example

Revisiting this one. This is how I approached it (count length of LINE and store it as LEN, sort by LEN, keep only the LINE): cat test.csv | while read LINE; do LEN=$(echo ${LINE} | wc -c); echo ${LINE} ${LEN}; done | sort -k 2n | cut -d ' ' -f 1

Related

Get Average of Found Numbers in Each File to Two Decimal Places

how to find maximum and minimum values of a particular column using AWK [duplicate]

awk: find minimum and maximum in column

how to extract columns from a text file with bash

How to print all the columns after a particular number using awk?

Categories

Resources