Get n lines from file which are equal spaced - bash

I have a big file with 1000 lines.I wanted to get 110 lines from it.
Lines should be evenly spread in Input file.
For example,I have read 4 lines from file with 10 lines
Input File
1
2
3
4
5
6
7
8
9
10
outFile:
1
4
7
10

Use:
sed -n '1~9p' < file
The -n option will stop sed from outputting anything. '1~9p' tells sed to print from line 1 every 9 lines (the p at the end orders sed to print).
To get closer to 110 lines you have to print every 9th line (1000/110 ~ 9).
Update: This answer will print 112 lines, if you need exactly 110 lines, you can limit the output just using head like this:
sed -n '1~9p' < file | head -n 110

$ cat tst.awk
NR==FNR { next }
FNR==1 { mod = int((NR-1)/tgt) }
!( (FNR-1)%mod ) { print; cnt++ }
cnt == tgt { exit }
$ wc -l file1
1000 file1
$ awk -v tgt=110 -f tst.awk file1 file1 > file2
$ wc -l file2
110 file2
$ head -5 file2
1
10
19
28
37
$ tail -5 file2
946
955
964
973
982
Note that this will not produce the output you posted in your question given your posted input file because that would require an algorithm that doesn't always use the same interval between output lines. You could dynamically calculate mod and adjust it as you parse your input file if you like but the above may be good enough.

With awk you can do:
awk -v interval=3 '(NR-1)%interval==0' file
where interval is the difference in line count between consecutive lines that are printed. The value is essentially a division of the total lines in the file divided by the number of lines that are printed.

I often like to use a combination of shell and awk for these sorts of things
#!/bin/bash
filename=$1
toprint=$2
awk -v tot=$(expr $(wc -l < $filename)) -v toprint=$toprint '
BEGIN{ interval=int((tot-1)/(toprint-1)) }
(NR-1)%interval==0 {
print;
nbr++
}
nbr==toprint{exit}
' $filename
Some examples:
$./spread.sh 1001lines 5
1
251
501
751
1001
$ ./spread.sh 1000lines 110 |head -n 3
1
10
19
$ ./spread.sh 1000lines 110 |tail -n 3
964
973
982

Related

Print first few and last few lines of file through a pipe with "..." in the middle

Problem Description
This is my file
1
2
3
4
5
6
7
8
9
10
I would like to send the cat output of this file through a pipe and receive this
% cat file | some_command
1
2
...
9
10
Attempted solutions
Here are some solutions I've tried, with their output
% cat temp | (head -n2 && echo '...' && tail -n2)
1
2
...
% cat temp | tee >(head -n3) >(tail -n3) >/dev/null
1
2
3
8
9
10
# I don't know how to get the ...
% cat temp | sed -e 1b -e '$!d'
1
10
% cat temp | awk 'NR==1;END{print}'
1
10
# Can only get 2 lines
An awk:
awk -v head=2 -v tail=2 'FNR==NR && FNR<=head
FNR==NR && cnt++==head {print "..."}
NR>FNR && FNR>(cnt-tail)' file file
Or if a single pass is important (and memory allows), you can use perl:
perl -0777 -lanE 'BEGIN{$head=2; $tail=2;}
END{say join("\n", #F[0..$head-1],("..."),#F[-$tail..-1]);}' file
Or, an awk that is one pass:
awk -v head=2 -v tail=2 'FNR<=head
{lines[FNR]=$0}
END{
print "..."
for (i=FNR-tail+1; i<=FNR; i++) print lines[i]
}' file
Or, nothing wrong with being a caveman direct like:
head -2 file; echo "..."; tail -2 file
Any of these prints:
1
2
...
9
10
It terms of efficiency, here are some stats.
For small files (ie, less than 10 MB or so) all these are less than 1 second and the 'caveman' approach is 2 ms.
I then created a 1.1 GB file with seq 99999999 >file
The two pass awk: 50 secs
One pass perl: 10 seconds
One pass awk: 29 seconds
'Caveman': 2 MS
You may consider this awk solution:
awk -v top=2 -v bot=2 'FNR == NR {++n; next} FNR <= top || FNR > n-top; FNR == top+1 {print "..."}' file{,}
1
2
...
9
10
Two single pass sed solutions:
sed '1,2b
3c\
...
N
$!D'
and
sed '1,2b
3c\
...
$!{h;d;}
H;g'
Assumptions:
as OP has stated, a solution must be able to work with a stream from a pipe
the total number of lines coming from the stream is unknown
if the total number of lines is less than the sum of the head/tail offsets then we'll print duplicate lines (we can add more logic if OP updates the question with more details on how to address this situation)
A single-pass awk solution that implements a queue in awk to keep track of the most recent N lines; the queue allows us to limit awk's memory usage to just N lines (as opposed to loading the entire input stream into memory, which could be problematic when processing a large volume of lines/data on a machine with limited available memory):
h=2 t=3
cat temp | awk -v head=${h} -v tail=${t} '
{ if (NR <= head) print $0
lines[NR % tail] = $0
}
END { print "..."
if (NR < tail) i=0
else i=NR
do { i=(i+1)%tail
print lines[i]
} while (i != (NR % tail) )
}'
This generates:
1
2
...
8
9
10
Demonstrating the overlap issue:
$ cat temp4
1
2
3
4
With h=3;t=3 the proposed awk code generates:
$ cat temp4 | awk -v head=${h} -v tail=${t} '...'
1
2
3
...
2
3
4
Whether or not this is the 'correct' output will depend on OP's requirements.
I suggest with bash:
(head -n 2; echo "..."; tail -n 2) < file
Output:
1
2
...
9
10

how to print every fifth row in a file

I have a file with numbers
20
18
21
16
14
30
40
24
and I need to output four files with rows printed with intervals of 4
So we have rows 1,5,9...
20
14
Then rows 2,6,10...
18
30
Then 3,7,11...
21
40
and then 4,8,12...
16
24
I did try the code below but it does not give me the control over the starting row
awk 'NR % 4 == 0'
In a single awk you can do:
awk '{print > ("file" (NR%4))}' inputfile
This will send the output to files file0, file1, file2 and file3
You may use these awk commands:
awk -v n=1 'NR%4 == n%4' file
20
14
awk -v n=2 'NR%4 == n%4' file
18
30
awk -v n=3 'NR%4 == n%4' file
21
40
awk -v n=4 'NR%4 == n%4' file
16
24
IMHO awk is the best solution. You can use sed:
Inputfile generated with seq 12:
for ((i=1;i<5; i++)); do
sed -n $i~4w$i.out <(seq 12)
done
Here w$i.out writes to file $i.out.
This might work for you (GNU sed):
sed -ne '1~4w file1' -e '2~4w file2' -e '3~4w file3' -e '4~4w file4' file

bash - split but only use certain numbers

let's say I want to split a large file into files that have - for example - 50 lines in them
split <file> -d -l 50 prefix
How do I make this ignore the first n and the last m lines in the <file>, though?
Use head and tail:
tail -n +N [file] | head -n -M | split -d -l 50
Ex (lines is a textfile with 10 lines, each with a consecutive number):
[bart#localhost playground]$ tail -n +3 lines | head -n -2
3
4
5
6
7
8
You can use awk over the file that is splitted, by provide a range of lines you need.
awk -v lineStart=2 -v lineEnd=8 'NR>=lineStart && NR<=lineEnd' splitted-file
E.g.
$ cat line
1
2
3
4
5
6
7
8
9
10
The awk with range from 3-8 by providing
$ awk -v lineStart=3 -v lineEnd=8 'NR>=lineStart && NR<=lineEnd' file
3
4
5
6
7
8
If n and m have the start and the end line number to print, you can do this
with sed
sed -n $n,${m}p file
-n avoid printing by default all lines. p is printing only the line that matches the range indicated by $n,${m}
With awk
awk "NR>$n && NR<$m" file
where NR represent the number of line

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

Counting equal lines in two files

Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482

Resources