Solved: Grep and Dynamically Truncate at Same Time - bash

Given the following:
for(condition which changes $z)
aptitude show $z | grep -E 'Uncompressed Size: |x' | sed s/Uncompressed Size: //";
done
That means 3 items are outputting to screen ($Z, Uncompressed Size, x).
I want all of that to fit on one line, and a line I deem is = 100 characters.
So, ($Z, Uncompressed Size, x) must fit on one line. But X is very long and will have to be truncated. So there is a requirement to add "used" characters by $z and Uncompressed Size, so that x can be truncated dynamically. I love scripting and being able to do this I deem an absolute must. Needless to say all 3 items being output to screen change hence the characters of the first two outputs must be calculated to subtract from the allowed characters for x, and sum of all characters between all 3 items cannot exceed 100 characters.
sed 's/.//5g'
Lmao, sometimes I wish I thought in simpler terms; complicated description + simple solution = simple problem over complicated by interpreter.
Thank you, Barmar
That only leaves sed (100 - amount of characters used by $z which is this function: ${#z}

Related

Counting integer frequency through pipe

Description
I have a for loop in bash with 10^4 iterations in total. Each iteration a list of roughly 10^7 numbers is generated from a pipe, each number an integer between 1 and 10^8. I want to keep track of how many times each integer appeared. The ideal output would be a .txt file with 10^8 lines, each line containing a counter for the integer corresponding to the row number.
As a significant proportion of integers do not appear while others appear nearly every iteration, I imagined using a hashmap, so as to limit analysis to numbers that have appeared. However, I do not know how to fill it with numbers appearing sequentially from a pipe. Any help would be greatly appreciated!
Reproducible example:
sample.R
args = commandArgs(trailingOnly=TRUE)
n_samples = as.numeric(args[1])
n_max = as.numeric(args[2])
v = as.character(sample(1:n_max, n_samples))
writeLines(v)
for loop:
for i in {1..n_loops}
do
Rscript sample.R n_samples n_max | "COLLECT AND INCREMENT HERE"
done
, where in my case n_loops=10^4, n_samples=10^7, n_max = 10^8.
Simple Approach
Before doing premature optimization, try the usual approach with sort | uniq -c first -- if that is fast enough, you have less work and a shorter script. To speed things up without too much hassle, set the memory using -S and use the simplest text encoding LC_ALL=C.
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c
The output will have lines of the form number_of_matches integer_from_the_output. Only integers which appeared at least once will be listed.
To convert this format (inefficiently) into your preferred format with 108 lines, each containing the count for the integer corresponding to the line number, replace the ... | sort | uniq -c part with the following command:
... | cat - <(seq 100''000''000) | LC_ALL=C sort -nS40% | LC_ALL=C uniq -c | awk '{$1--;$2=""}1'
This assumes that all the generated integers are between 1 and 108 inclusive. The result gets mangled if any other values appear more than once.
Hash Map
If you want to go with the hash map, the simplest implementation would probably be an awk script:
for i in {1..10000}; do
Rscript sample.R n_samples n_max
done | awk '{a[$0]++} END {for (ln=1; ln<=100000000; ln++) print int(a[ln])}'
However, I'm unsure whether this is such a good idea. The hash map could allocate much more memory than the actual data requires and is probably slow for that many entries.
Also, your awk implementation has to support large numbers. 32-bit integers are not sufficient. If the entire output is just the same integer repeated over and over again you can get a up to ...
104 iterations * 107 occurrences / iteration = 104+7 occurrences = 1011 occurrences
... of that integer. To store the maximal count of 1011 you need at least 37 bits > log2(1011) bits.
GNU awk 5 on a 64-bit system seems to handle numbers of that size.
Faster Approach
Counting occurrences in a data structure is a good idea. However, a hash map is overkill as you have "only" 108 possible values as output. Therefore, you can use an array with 108 entries of 64-bit counters. The array would use ...
64 bit * 108 = 8 Byte * 102+6 = 800 MByte
... of memory. I think 800 MByte should be free even on old PCs and Laptops from 10 years ago.
To implement this approach, use a "normal" programming language of your choice. Bash is not the right tool for this job. You can use bash to pipe the output of the loop into your program. Alternatively, you can execute the for loop directly in your program.

Unexpected arithmetic result with zero padded numbers

I have a problem in my script wherein I'm reading a file and each line has data which is a representation of an amount. The said field always has a length of 12 and it's always a whole number. So let's say I have an amount of 25,000, the data will look like this 000000025000.
Apparently, I have to get the total amount of these lines but the zero prefixes are disrupting the computation. If I add the above mentioned number to a zero value like this:
echo $(( 0 + 000000025000 ))
Instead of getting 25000, I get 10752 instead. I was thinking of looping through 000000025000 and when I finally get a non-zero value, I'm going to substring the number from that index onwards. However, I'm hoping that there must be a more elegant solution for this.
The number 000000025000 is an octal number as it starts with 0.
If you use bash as your shell, you can use the prefix 10# to force the base number to decimal:
echo $(( 10#000000025000 ))
From the bash man pages:
Constants with a leading 0 are interpreted as octal numbers. A leading 0x or 0X denotes hexadecimal. Otherwise, numbers take the form [base#]n, where the optional base is a decimal number between 2 and 64 representing the arithmetic base, and n is a number in that base.
Using Perl
$ echo "000000025000" | perl -ne ' { printf("%d\n",scalar($_)) } '
25000

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

Strange column number " :0 " in VT100 terminal protocol

I am interpreting some output from a serial port. The output is in VT100 protocol. VT100 terminal protocol use some control character sequence to set the cursor location on screen. The control sequence looks like this:
ESC[row;columnH
For example,
ESC[01;01H means set cursor to row 1, column 1.
But I see the following sequence when column number exceed 2-digit number.
ESC[10;:0H
Note the extra ":" after the semicolon. This control sequence comes after ESC[10;99H, which means row 10, column 99.
My understanding is :0 = 100. But what if the column number is 200?
I don't think that's actually valid or, if it is, it's entirely by accident. The arguments passed to the CUP (cursor position) command (and many others involved in screen coordinates) is limited to one or two digits.
In the ASCII table, the digit 9 is followed by : so, where 99 would represent 9 * 10 + 9, :0 may represent 10 * 10 + 0 or 100:
Assuming the bug holds up for higher numbers (something I'm not confident of), you're looking for 200, which would be 20 * 10 + 0 or probably D0 (D being the character ten higher than : in the ASCII table).
No, the relevant standards do not specify that the number of digits is limited to two, for instance because VT100s can address 24 rows by 132 columns.
Leading zeroes in the parameters are ignored. Likely, OP is reporting a problem (from some unmentioned program) which uses only two digits. That is not related to the terminal itself (except perhaps in the context of a bug report directed to a terminal emulator's developers).
The resize program assumes that one's terminal is no larger than 999 by 999 to position the cursor to "past" the lower-right corner of the screen. For those individuals who do not rely upon multiple pixels to discern characters, xterm does use a font called "Unreadable", which could result in larger screens.
By the way, the source given in the question is not very good, although not the worst -- refer to vt100.net and ECMA-48.

Is there a way to split a large file into chunks with random sizes?

I know you can split a file with split, but for test purposes I would like to split a large file into chunks whose sizes differ. Is this possible?
Alternatively, if the above-mentioned file is a zip, is there a way to split it into volumes of unequal sizes?
Any suggestions welcome! Thanks!
So the general question that you're asking is: how can I compute N random integers that sum to S? Specifically, S is the size of your file and N is how many smaller files that you want to break it into.
For example, assume that you want to split your file into 4 parts. If a, b, c, and d are four random numbers, then:
a + b + c + d = X
a/X + b/X + c/X + d/X = 1
S*a/X + S*b/X + S*c/X + S*d/X = S
Giving us four random numbers that sum to S, the size of your file.
Which means you'd want to write a script that:
Computes N random numbers (any random numbers).
Computes X as the sum of those random numbers.
Multiplies each of those random numbers by S/X (and makes sure you're left with integers greater than 0 that sum to S)
Splits the original file into pieces using the generated random numbers as sizes, using whatever tool you want.
This is a little much for a shell script, but would be pretty straight forward in something like Perl.
since you tagged the question only with shell. so I supposed you want to handle it only with shell script and those common linux command/tools.
As far as I know there is no existing tool/cmd can split file randomly. To split file, we can consider to use split, dd
Both tools support options like, how big (size) split-ed file should be or how many files do you want to split. let's say, we use dd/split first split your file into 500 parts, each file has same size. so we have:
foo.zip.001
foo.zip.002
foo.zip.003
...
foo.zip.500
then we take this file list as input, to do merge (cat). This step could be done by awk or shell script.
for example we can build a set of cat statements like:
cat foo.zip.001, foo.zip.002 > part1
cat foo.zip.003, foo.zip.004, foo.zip.005 > part2
cat foo.zip.006, foo.zip.007, foo.zip.008, foo.zip.009 > part3
....
run the generated cat statements, you got final part1-n, each part has different size.
for example like:
kent$ seq -f'foo.zip.%g' 20|awk 'BEGIN{i=k=2}NR<i{s=s sprintf ("%s,",$0);next}{k++;i=(NR+k);print "cat "s$0" >part"k-2;s="" }'
cat foo.zip.1,foo.zip.2 >part1
cat foo.zip.3,foo.zip.4,foo.zip.5 >part2
cat foo.zip.6,foo.zip.7,foo.zip.8,foo.zip.9 >part3
cat foo.zip.10,foo.zip.11,foo.zip.12,foo.zip.13,foo.zip.14 >part4
cat foo.zip.15,foo.zip.16,foo.zip.17,foo.zip.18,foo.zip.19,foo.zip.20 >part5
but how is the performance you have to test on your own...at least this should work for your requirement.

Resources