Number of total records - shell

I'm quite new using AWK. I just discover the FNR variable. I just wonder if it is possible to get the number of total records before processing the file?
So the FNR at the end of the file.
I just need it to do something like that
awk 'FNR<TOTALRECORDS-4 {print}'
In order to delete the 4 last lines of the files.
Thanks

If you merely want to print all but the last 4 lines of a file, use a different tool. But if you are doing some other processing with awk and need to incorporate this, just store the lines in a buffer and print them as needed. That is, store the most recent 4 lines, and print the last one in the buffer when you get a newline. For example:
awk 'NR>4 { print a[i%4]} {a[i++%4]=$0}' input
This keeps 4 lines in the array a. If we are in the first 4 lines of the file, do nothing but store the line in a. If we are on a line greater than 4, the first thing you do is print the line 4 lines back (stored in a at index i%4) You can put commands that manipulate $0 between these two action statements as needed.

To remove the last 4 lines from a file, you can just use head:
head -n -4 somefile > outputfile

Related

How to remove lines appear only once in a file using bash

How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
If your duplicated lines are consecutives, you can use uniq
uniq -D file
from the man pages:
-D print all duplicate lines
Just loop the file twice:
$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3
firstly to count how many times a line occurs: seen[ record ] keeps track of it as an array.
secondly to print those that appear more than once
Using single pass awk:
awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file
3
3
Using freq[$0]++ we count and store frequency of each line.
In the END block if frequency is greater than 1 then we print those lines as many times as the frequency.
Using awk, single pass:
$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3
If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.
Here's a POSIX-compliant awk alternative to the GNU-specific uniq -D:
awk '++seen[$0] == 2; seen[$0] >= 2' file
This turned out to be just a shorter reformulation of James Brown's helpful answer.
Unlike uniq, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.
That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.
For unsorted (ungrouped) data (and if preserving the input order is also important), consider:
fedorqui's helpful answer (elegant, but requires reading the file twice)
anubhava's helpful answer (single-pass solution, but a little more cumbersome).

Select full block of text delimited by some chars

I have a very large text file (40GB gzipped) where blocks of data are separated by //.
How can I select blocks of data where a certain line matches some criterion? That is, can I grep a pattern and extend the selection in both directions to the // delimiter? I can make no assumptions on the size of the block and the position of the line.
not interesting 1
not interesting 2
//
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
//
not interesting 1
not interesting 2
//
I want to select the block of data with MATCH THIS LINE:
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
I tried sed but can't get my head around the pattern definition. This for example should match from // to MATCH THIS LINE:
sed -n -e '/\/\//,/MATCH THIS LINE/ p' file.txt
But it fails matching the //.
Is it possible to achieve this with GNU command line tools?
With GNU awk (due to multi-char RS), you can set the record separator to //, so that every record is a //-delimited set of characters:
$ awk -v RS="//" '/MATCH THIS LINE/' file
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
Note this leaves an empty line above and below because it catches the new line just after // and prints it back, as well as the last one before the // at the end. To remove them you can pipe to awk 'NF'.
To print the separator between blocks of data you can say (thanks 123):
awk -v RS="//" '/MATCH THIS LINE/{print RT $0 RT}' file

Remove all lines except the last which start with the same string

I'm using awk to process a file to filter lines to specific ones of interest. With the output which is generated, I'd like to be able to remove all lines except the last which start with the same string.
Here's an example of what is generated:
this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text
Lines 2 and 3 should be removed because they start with duplicate, as does line 5. Therefore line 5 should be kept, as it is the last line starting with duplicate.
The same follows for line 6, since it begins with example, as does line 7. Therefore line 7 should be kept, as it is the last line which starts with example.
Given the example above, I'd like to produce the following output:
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text
How could I achieve this?
I tried the following, however it doesn't work correctly:
awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -
Why don't you read the file from the end to the beginning and print the first line containing duplicate? This way you don't have to worry about what was printed or not, hold the line, etc.
tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac
This sets a flag f the first time duplicate is seen. From the second timem, this flag makes the line be skipped.
If you want to make this generic in a way that every first word is printed just the last time, use an array approach:
tac file | awk '!seen[$1]++' | tac
This keeps track of the first words that have appeared so far. They are stored in the array seen[], so that by saying !seen[$1]++ we make it True just when $1 occurs for the first time; from the second time on, it evaluates as False and the line is not printed.
Test
$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text
You could use an (associative) array to always keep the last occurence:
awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

Remove every n lines for removing datablock using sed or awk

I have a big file made up of 316125000 lines. This file is made up of 112500 data blocks, and each data block has 2810 lines.
I need to reduce the size of the file, so I want to leave the 1st, 10th, 20th, ... 112490th, and 112450th data blocks, and remove all other data blocks. This will gonna give me 11250 data blocks as a result.
This means the same thing that I want to remove every 2811 ~ 28100 lines, and leaving every 1~2810, and 28101~30910 .... lines.
I was thinking of awk, sed or grep, but which one is faster, and how can I acheive this? I know how to remove every 2nd or 3rd line, with awk and NR, but I don't know how to remove big chunk of lines repetitively.
Thanks
Best,
Something along these lines might work:
awk 'int((NR - 1) / 2810) % 10 == 0' <infile >outfile
That is, int((NR - 1) / 2810) gives the (zero-based) number of the block of 2810 lines for the current line (NR), and if the remainder of that block number divided by ten is 0 (% 10 == 0) prints the line. This should result in every 10th block being printed, including the first (block number 0).
I wouldn't guess which is fastest, but I can provide a GNU sed recipe for your benchmarking:
sed -e '2811~28100,+25289d' <input >output
This says: starting at line 2811 and every 28100 lines thereafter, delete it and the next 25289 lines.
Equivalently, we can use sed -n and print lines 1-2810 every 28100 lines:
sed -ne '1~28100,+2809p' <input >output

Bash comparing two different files with different fields

I am not sure if this is possible to do but I want to compare two character values from two different files. If they match I want to print out the field value in slot 2 from one of the files. Here is an example
# File 1
Date D
Tamb B
# File 2
F gge0001x gge0001y gge0001z
D 12-30-2006 12-30-2006 12-30-2006
T 14:15:20 14:15:55 14:16:27
B 15.8 16.1 15
Here is my thought behind the problem I want to do
if [ (field2) from (file1) == (field1) from (file2) ] ; do
echo (field1 from file1) and also (field2 from file2) on the same line
which prints out "Date 12-30-2006"
"Tamb 15.8"
" ... "
and continually run through every line from essentially file 1 printing out any matches that there are. I am assuming these will need to be some sort of array involved. Any thoughts on if this is the correct logic and if this is even possible?
This reformats file2 based on the abbreviations found in file1:
$ awk 'FNR==NR{a[$2]=$1;next;} $1 in a {print a[$1],$2;}' file1 file2
Date 12-30-2006
Tamb 15.8
How it works
FNR==NR{a[$2]=$1;next;}
This reads each line of file1 and saves the information in array a.
In more detail, NR is the number of lines that have been read in so far and FNR is the number of lines that have been read in so far from the current file. So, when NR==FNR, we know that awk is still processing the first file. Thus, the array assignment, a[$2]=$1 is only performed for the first file. The statement next tells awk to skip the rest of the code and jump to the next line.
$1 in a {print a[$1],$2;}
Because of the next statement, above, we know that, if we get to this line, we are working on file2.
If field 1 of file2 matches any a field 2 of file1, then print a reformatted version of the line.

Resources