Get all the text blocks with more than a certain number lines in a file - shell

The text blocks are separated by blank lines, like:
AAA
BBB
AAA'
BBB'
AAA
BBB
CCC
I'd like to get the last text block that has more than 2 lines.
I know I could write a Python script.
How can I do so by using some command line fu?

EDIT: I think I have misunderstood "get the last text block". To simply print all paragraphs with more than 2 lines:
awk -v RS= -v ORS='\n\n' -F '\n' 'NF>2' file
perl -F'\n' -00e 'print if $#F >= 2' file
awk:
awk -v RS= -F '\n' 'NF>2 {rec=$0} END {if (rec!="") print rec}' file
RS set to a null value enables "paragraph mode". FS has been set to \n (so that NF will be equivalent to the number of lines within each paragraph). The awk program saves the latest record matching the criteria NF>2 & prints it at the end.
perl using a similar idea (except that perl counts the number of fields differently):
perl -F'\n' -l -00e '$rec=$_ if $#F >= 2; END {print $rec if defined $rec}' file
Depending on the content of the file, it may be faster to read the file backwards, e.g. with tac:
tac file | perl -F'\n' -l -00e 'if ($#F >= 2) {print $_; exit}' | tac

Related

Using sed command in shell script for substring and replace position to need

I’m dealing data on text file and I can’t find a way with sed to select a substring at a fixed position and replace it.
This is what I have:
X|001200000000000000000098765432|1234567890|TQ
This is what I need:
‘X’,’00000098765432’,’1234567890’,’TQ’
The following code in sed gives the substring I need (00000098765432) but not overwrites position to need
echo “ X|001200000000000000000098765432|1234567890|TQ” | sed “s/
*//g;s/|/‘,’/g;s/^/‘/;s/$/‘/“
Could you help me?
Rather than sed, I would use awk for this.
echo "X|001200000000000000000098765432|1234567890|TQ" | awk 'BEGIN {FS="|";OFS=","} {print $1,substr($2,17,14),$3,$4}'
Gives output:
X,00000098765432,1234567890,TQ
Here is how it works:
FS = Field separator (in the input)
OFS = Output field separator (the way you want output to be delimited)
BEGIN -> think of it as the place where configurations are set. It runs only one time. So you are saying you want output to be comma delimited and input is pipe delimited.
substr($2,17,14) -> Take $2 (i.e. second field - awk begins counting from 1 - and then apply substring on it. 17 means the beginning character position and 14 means the number of characters from that position onwards)
In my opinion, this is much more readable and maintainable than sed version you have.
If you want to put the quotes in, I'd still use awk.
$: awk -F'|' 'BEGIN{q="\047"} {print q $1 q","q substr($2,17,14) q","q $3 q","q $4 q"\n"}' <<< "X|001200000000000000000098765432|1234567890|TQ"
'X','00000098765432','1234567890','TQ'
If you just want to use sed, note that you say above you want to remove 16 characters, but you are actually only removing 14.
$: sed -E "s/^(.)[|].{14}([^|]+)[|]([^|]+)[|]([^|]+)/'\1','\2','\3','\4'/" <<< "X|0012000000000000000098765432|1234567890|TQ"
'X','00000098765432','1234567890','TQ'
Using sed
$ sed "s/|\(0[0-9]\{15\}\)\?/','/g;s/^\|$/'/g" input_file
'X','00000098765432','1234567890','TQ'
Using any POSIX awk:
$ echo 'X|001200000000000000000098765432|1234567890|TQ' |
awk -F'|' -v OFS="','" -v q="'" '{sub(/.{16}/,"",$2); print q $0 q}'
'X','00000098765432','1234567890','TQ'
not as elegant as I hoped for, but it gets the job done :
'X','00000098765432','1234567890','TQ'
# gawk profile, created Mon May 9 21:19:17 2022
# BEGIN rule(s)
'BEGIN {
1 _ = sprintf("%*s", (__ = +2)^++__+--__*++__,__--)
1 gsub(".", "[0-9]", _)
1 sub("$", "$", _)
1 FS = "[|]"
1 OFS = "\47,\47"
}
# Rule(s)
1 (NF *= NF == __*__) * sub(_, "|&", $__) * \
sub("^.*[|]", "", $__) * sub(".+", "\47&\47") }'
Tested and confirmed working on gnu gawk 5.1.1, mawk 1.3.4, mawk 1.9.9.6, and macosx nawk
— The 4Chan Teller
awk -v del1="\047" \
-v del2="," \
-v start="3" \
-v len="17" \
'{
gsub(substr($0,start+1,len),"");
gsub(/[\|]/,del1 del2 del1);
print del1$0del1
}' input_file
'X',00000098765432','1234567890','TQ'

I want to use awk to print rearranged fields then print from the 4th field to the end

I have a text file containing filesize, filedate, filetime, and filepath records. The filepath can contain spaces and can be very long (classical music names). I would like to print the file with filedate, filetime, filesize, and filepath. The first part, without the filepath is easy:
awk '{print $2,$3,$1}' filelist.txt
This works, but it prints the record on two lines:
awk '{print $2,$3,$1,$1=$2=$3=""; print $0}' filelist.txt
I've tried using cut -d' ' -f '2 3 1 4-' , but that doesn't allow rearranging fields. I can fix the two line issue using sed to join. There must be a way to only use awk. In summary, I want to print the 2nd, 3rd, 1st, and from the 4th field to the end. Can anyone help?
Since the print statement in awk always prints a newline in the end (technically ORS, which defaults to a newline), your first print will break the output in two lines.
With printf, on the other hand, you completely control the output with your format string. So, you can print the first three fields with printf (without the newline), then set them to "", and just finish off with the print $0 (which is equivalent to print without arguments):
awk '{ printf("%s %s %s",$2,$3,$1); $1=$2=$3=""; print }' file
I avoid awk when I can. If I understand correctly what you have said -
while read size date time path
do echo "$date $time $size $path"
done < filelist.txt
You could printf instead of echo for more formatting options.
Embedded spaces in $path won't matter since it's the last field.
I have no awk at hand to test but I suppose you may use printf to format a one-line output. Just locate the third space in $0 and take a substring from that position through the end of the input line.
You may also try to swap fields before a standard print, although I'm not sure it will produce desired results...
It always helps to delimit your fields with something like <tab>, so subsequent operations are easier... (I can see you used cut without -d, so maybe your data is already tab delimited.)
echo 1 2 3 very long name |
sed -e 's/ /\t/' -e 's/ /\t/' -e 's/ /\t/' |
awk -v FS='\t' -v OFS='\t' '{print $2, $3, $1, $4}'
The first line generates data. The sed command substitutes first three spaces in each row with \t. Then the awk works flawlessly, outputting tab delimited data again (you need a reasonably new awk).
With GNU awk for gensub():
$ echo '1 2 3 4 5 6' | awk '{print $3, $2, $1, gensub(/([^ ]+){3}/,"",1)}'
3 2 1 4 5 6
With any awk:
$ echo '1 2 3 4 5 6' | awk '{rest=$0; sub(/([^ ]+ ){3}/,"",rest); print $3, $2, $1, rest}'
3 2 1 4 5 6

count the number of words between two lines in a text file

As the title says I'm wondering if there is an easier way of getting the number of words between two lines in a text file, using text processing tools available on *nix.
For example given a text file is as follows,
a bc ae
a b
ae we wke wew
countwords between, 1-2 -> 5, 2-3 -> 6.
You can use sed and wc like this:
sed -n '1,2p' file | wc -w
5
and
sed -n '2,3p' file | wc -w
6
You can do this with a simple awk command:-
awk -v start='1' -v end='2' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
For the sample file you have provided:-
$ cat file
a bc ae
a b
ae we wke wew
$ awk -v start='1' -v end='2' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
5
$ awk -v start='2' -v end='3' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
6
$ awk -v start='1' -v end='3' 'NR>=start && NR <=end{sum+=NF}END{print sum}' file
9
The logic is simple:-
Use the start, end variables for specifying the ranges in the file, they are awk variables
NR>=start && NR <=end provides the condition to loop from the lines you need
sum+=NF does the word count arithmetic. NF is a special awk variable which counts the number of words de-limited by IFS, which in this case is white-space.
END{print sum} prints the final count.
Worked fine on GNU Awk 3.1.7

awk syntax — what is OFS and why the 1 at the end?

awk -F"\t" -v OFS="\t" '{if($18~/^ *[0-9]*(\.[0-9]+)?" *$/)sub(/"/,"",$18);else $18=" "}1' sample.txt
The code above is some awk code used in a script I'm modifying. I'm new to Unix so am not able understand the syntax of the above awk.
-F is for splitting the colum with the delimeter.
What is OFS?
And what is the use of 1 at the end of the awk script?
-v OFS="\n" passes a param named OFS from the shell to the awk script. Like the -F option or FS it is the field separator - but for the output. It is called the output field separator
You can test it:
awk -v OFS=' ' '{print 1,2}' a.txt
Output separated by spaces:
1 2
1 2
.
awk -v OFS=';' '{print 1,2}' a.txt
Output separated by ;:
1;2
1;2
In your case it means, that the output will be separated by tabs (as the input)
The 1 at the end of the awk script, let awk print the original input line in addition to the script generated output. That's because an awk script usually contains tests (regex, etc) and actions for them. The test 1 will be always true. And as the default action of awk is printing the current line, it will print the line

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources