To Split into fixed sequences and leave extra out

To Split into fixed sequences and leave extra out - shell

I would like to limit all files to be of the same fixed length but the last item can be any variable size but not more than 557.
This means that the file amount can be more than determined by the flag -n of the command split.
Code 1 (ok)
$ seq -w 1 1671 > /tmp/k && gsplit -n15 /tmp/k && wc -c xaa && wc -c xao
557 xaa
557 xao
where xaa is the first file of the sequence, while xao the last one.
I increase the sequence by one unit but it causes 5 unit increase (557->562) in the last file xao which I do not understand:
$ seq -w 1 1672 > /tmp/k && gsplit -n15 /tmp/k && wc -c xaa && wc -c xao
557 xaa
562 xao
Why does the increase of one-unit in sequence increase the last item (xao) by 5 units?
Code 2
$ seq -w 1 1671 | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
445 xaa
455 xao
$ seq -w 1 1672 | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
445 xaa
459 xao
so increasing the whole length by one sequence (4 characters) leads to 4 character increase (455 -> 459), in contrast to the first code where increase is 5 characters.
Code 3
Let's now keep each unit of sequence fixed to 4 characters by seq -w 0 0.0001 1 | gsed 's/\.//g':
$ seq -w 0 0.0001 1 | gsed 's/\.//g' | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
3333 xaa
3344 xao
$ seq -w 0 0.0001 1.0001 | gsed 's/\.//g' | gsed ':a;N;$!ba;s/\n//g' > /tmp/k && gsplit -n15 /tmp/k&& wc -c xaa && wc -c xao
3334 xaa
3335 xao
so increasing the sequence by one characters increases xaa by unit but decreases xao by 9 units.
This behavior is what I do not keep so logical.
How can you limit the sequence length first, for instance to be fixed at 557 and later determine the amount of files of successful files?

Original answer — for Code 1
Because seq -w 1 1671 generates 5 characters per number — 4 digits and 1 newline. So adding one number to the output adds 5 bytes to the output.
Extra answer — for Code 2
You've asked GNU split (aka gsplit) to split the file input into 15 chunks. It does its best to even the values out. But there's a limit to what it can do when the total number of bytes is not a multiple of 15. There are options to control what happens.
However, in the basic form, the -n 15 option means that the first 14 output files each get 445 characters, and the last gets 455 because there are 6685 = 445 * 15 + 10 characters in the output file. When you add another 4 characters to the file (because you delete the newlines), then the last file gets an additional 4 characters (because 6689 = 445 * 15 + 14).
Extra answer — for Code 3
First of all, the output from seq -w 0 0.0001 1 looks like:
0.0000
0.0001
0.0002
…
0.9998
0.9999
1.0000
So after the output is edited with the first sed, the numbers from 00000 to 10000 are present, one per line, with 6 characters per line (including the newline). The second sed eliminates the newlines, again.
There are 50006 bytes in /tmp/k on one line. That's equal to 15 * 3333 + 11, hence the first output. The second variant has 50011 bytes in /tmp/k, which is 15 * 3334 + 1. Hence the difference of only one.

Related

How to get the number of different lines in bash [duplicate]

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2

how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2

uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).

if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1

Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2

In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done

In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?

I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

How to grep two column from a single file

cat Error00
4 0 375
4 2001 21
4 2002 20
cat Error01
4 0 465
4 2001 12
4 2002 40
4 2016 1
I want output as below
4 0 375 465
4 2001 21 12
4 2002 20 20
4 2016 - 1
i am using the below query. here problem is i m not able to handle grep for two field because space is coming.
please suggest how can to get rid of this.
keylist=$(awk '{print $1,$2'} Error0[0-1] | sort | uniq)
for key in ${keylist} ; do
echo ${key}
val_a=$(grep "^${key}" Error00 | awk '{print $3}') ;val_a=${val_a:---}
val_b=$(grep "^${key}" Error01 | awk '{print $1,$2}') ; val_b=${val_b:--- --}
echo $key ${val_a} >>testreport
done
i m geting the oputput as below
4 375 465
0
4 21 12
2001
4 20 20
2002
4 - 1
2016

A single awk one liner can handle this easily:
awk 'FNR==NR{a[$1,$2]=$3;next}{print $1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3}' err0 err1
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
For formatted output you can use printf instead of print. Like Jonathan Leffler suggest:
printf "%s %-6s %-6s %s\n",$1,$2,(a[$1,$2]?a[$1,$2]:"-"),$3
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
However a general solution is to use column -t for a nice table output:
awk '{....}' err0 err1 | column -t
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1

grep is not really the right tool for this job. You can either play with awk or Perl (or Python, or …), or you can use join. However, join only joins on a single column at a time, and you appear to need to join on two columns. So, we're going to have to massage the data so that it will work with join. I'm about to assume you're using bash and so have process substitution available. You can do the job without, but it is fiddlier and involves temporary files (and traps to clean them up, etc).
The key to the join will be to replace the blank between the first two columns with a colon (or any other convenient character — control-A would work fine too), then join the files on column 1 with a replacement character. The inputs must be sorted; the output must have the colon replaced with a blank.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$
The 's/ */:/' operation replaces the first sequence of one or more blanks with a colon; the input data has two blanks between the 4 and the 0 in the first line of Error00. The input to join must be in sorted order of the joining field, here the first field. The output is the join field, the second column of Error00 and the second column of Error01 (remembering that means the second column after the first two have been fused by the colon). If there's an unmatched line in the first file, generate an output line (-a 1); ditto for the second file; and for the missing fields, insert a dash (-e '-'). The final sed removes the colon that was added.
If you want the data formatted, pipe it through awk.
$ join -o 0,1.2,2.2 -a 1 -a 2 -e '-' \
> <(sed 's/ */:/' Error00 | sort) \
> <(sed 's/ */:/' Error01 | sort) |
> sed 's/:/ /' |
> awk '{printf("%s %-6s %-6s %s\n", $1, $2, $3, $4)}'
4 0 375 465
4 2001 21 12
4 2002 20 40
4 2016 - 1
$

Bash: Limit output of ls and grep

Let me present an example and than try to explain my problem:
noob#noob:~/Downloads$ ls | grep srt$
Elementary - 01x01 - Pilot.LOL.English.HI.C.orig.Addic7ed.com.srt
Haven - 01x01 - Welcome to Haven.DVDRip.SAiNTS.English.updated.Addic7ed.com.srt
Haven - 01x01 - Welcome to Haven.FQM.English.HI.C.updated.Addic7ed.com.srt
Supernatural - 08x01 - We Need to Talk About Kevin.LOL.English.HI.C.updated.Addic7ed.com.srt
The Big Bang Theory - 06x02 - The Decoupling Fluctuation.LOL.English.HI.C.orig.Addic7ed.com.srt
Torchwood - 1x01 - Everything changes.0TV.English.orig.Addic7ed.com.srt
Torchwood - 1x01 - Everything changes.divx.English.updated.Addic7ed.com.srt
Now I only want to delete the first four results of the above command. Normally if I have to delete all the files I would do ls | grep srt$ | xargs -I {} rm {} but in this case I only want to delete the top four.
So, how can limit the output of ls and grep or suggest me an alternate way to achieve this.

You can pipe your commands to head -n to limit to n lines:
ls | grep srt | head -4

$ for i in `seq 1 345`; do echo $i ;done | sed -n '1,4p'
1
2
3
4
geee: ~
$ for i in `seq 1 345`; do echo $i ;done | sed -n '335,360p'
335
336
337
338
339
340
341
342
343
344
345

If you don't have too many files, you can use a bash array:
matching_files=( *.srt )
rm "${matching_files[#]:0:4}"

counting duplicates in a sorted sequence using command line tools

I have a command (cmd1) that greps through a log file to filter out a set of numbers. The numbers are
in random order, so I use sort -gr to get a reverse sorted list of numbers. There may be duplicates within
this sorted list. I need to find the count for each unique number in that list.
For e.g. if the output of cmd1 is:
100
100
100
99
99
26
25
24
24
I need another command that I can pipe the above output to, so that, I get:
100 3
99 2
26 1
25 1
24 2

how about;
$ echo "100 100 100 99 99 26 25 24 24" \
| tr " " "\n" \
| sort \
| uniq -c \
| sort -k2nr \
| awk '{printf("%s\t%s\n",$2,$1)}END{print}'
The result is :
100 3
99 2
26 1
25 1
24 2

uniq -c works for GNU uniq 8.23 at least, and does exactly what you want (assuming sorted input).

if order is not important
# echo "100 100 100 99 99 26 25 24 24" | awk '{for(i=1;i<=NF;i++)a[$i]++}END{for(o in a) printf "%s %s ",o,a[o]}'
26 1 100 3 99 2 24 2 25 1

Numerically sort the numbers in reverse, then count the duplicates, then swap the left and the right words. Align into columns.
printf '%d\n' 100 99 26 25 100 24 100 24 99 \
| sort -nr | uniq -c | awk '{printf "%-8s%s\n", $2, $1}'
100 3
99 2
26 1
25 1
24 2

In Bash, we can use an associative array to count instances of each input value. Assuming we have the command $cmd1, e.g.
#!/bin/bash
cmd1='printf %d\n 100 99 26 25 100 24 100 24 99'
Then we can count values in the array variable a using the ++ mathematical operator on the relevant array entries:
while read i
do
((++a["$i"]))
done < <($cmd1)
We can print the resulting values:
for i in "${!a[#]}"
do
echo "$i ${a[$i]}"
done
If the order of output is important, we might need an external sort of the keys:
for i in $(printf '%s\n' "${!a[#]}" | sort -nr)
do
echo "$i ${a[$i]}"
done

In case you have input stored in my_file you can do:
sort -nr my_file | uniq -c | awk ' { t = $1; $1 = $2; $2 = t; print; } '
Otherwise just pipe the input to be processed to the same cmd.
Explanation:
sort -nr sorts the input numerically (-n) in reverse order (-r)
uniq -c count duplicates and shows the count side-by-side
awk '{ t = $1; $1 = $2; $2 = t; print; }' swaps the two columns

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

To Split into fixed sequences and leave extra out - shell

Related

How to get the number of different lines in bash [duplicate]

awk length is counting +1

How to grep two column from a single file

Bash: Limit output of ls and grep

counting duplicates in a sorted sequence using command line tools

Categories

Resources