How to shrink a large file in bash?

How to shrink a large file in bash? - bash

I have a file that I want to truncate to 2kb (i.e. keep the first 2kb of data, get rid of the rest). How can I do this with bash?

The command is (surprise, surprise) truncate.
truncate -s 2KB file

The standards-compliant way to do this (not relying on any Linux-only tools such as truncate) is to use dd:
dd if=/dev/null of=/file/to/truncate seek=1 bs=2k
Unlike the other dd answer, which merely copies the first 2k of a file, this one truncates the target file at that point.

You could do something like this in pure bash:
IFS= read -r -n 2048 first2k < file
printf "%s" "$first2k" > file
but using dd is a much better idea. For one, it's more likely to be atomic; it's possible an external process could modify the first 2048 bytes of file after read, but before the printf. Second, it's less verbose :)
You can also use read's default variable REPLY, which does not require setting IFS to avoid word splitting:
read -r -n 2048 < file
printf "%s" "$REPLY" > file

Use dd:
dd if=yourfile of=firstLump bs=2k count=1
if = the input file
of = the output file
bs = blocksize
count = number of blocks
Available on Linux AND Mac OSX.

Related

Is there a fast way to read alternate bytes in dd

I'm trying to read out every other pair of bytes in a binary file using dd in a loop, but it is unusably slow.
I have a binary file on a BusyBox embedded device containing data in rgb565 format. Each pixel is 2 bytes and I'm trying to read out every other pixel to do very basic image scaling to reduce file size.
The overall size is 640x480 and I've been able to read every other "row" of pixels by looping dd with a 960 byte block size. But doing the same for every other "column" that remains by looping through with a 2 byte block size is ridiculously slow even on my local system.
i=1
while [[ $i -le 307200 ]]
do
dd bs=2 skip=$((i-1)) seek=$((i-1)) count=1 if=./tmpfile >> ./outfile 2>/dev/null
let i=i+2
done
While I get the output I expect, this method is unusable.
Is there some less obvious way to have dd quickly copy every other pair of bytes?
Sadly I don't have much control over what gets compiled in to BusyBox. I'm open to other possible methods but a dd/sh solution may be all I can use. For instance, one build has omitted head -c...
I appreciate all the feedback. I will check out each of the various suggestions and check back with results.

Skipping every other character is trivial for tools like sed or awk as long as you don't need to cope with newlines and null bytes. But Busybox's support for null bytes in sed and awk is poor enough that I don't think you can cope with them at all. It's possible to deal with newlines, but it's a giant pain because there are 16 different combinations to deal with depending on whether each position in a 4-byte block is a newline or not.
Since arbitrary binary data is a pain, let's translate to hexadecimal or octal! I'll draw some inspiration from bin2hex and hex2bin scripts by Stéphane Chazelas. Since we don't care about the intermediate format, I'll use octal, which is a lot simpler to deal with because the final step uses printf which only supports octal. Stéphane's hex2bin uses awk for the hexadecimal-to-octal conversion; a oct2bin can use sed. So in the end you need sh, od, sed and printf.
I don't think you can avoid printf: it's critical to outputting null bytes. While od is essential, most of its options aren't, so it should be possible to tweak this code to support a very stripped-down od with a bit more postprocessing.
od -An -v -t o1 -w4 |
sed 's/^ \([0-7]*\) \([0-7]*\).*/printf \\\\\1\\\\\2/' |
sh
The reason this is so fast compared to your dd-based approach is that BusyBox runs printf in the parent process, whereas dd requires its own process. Forking is slow. If I remember correctly, there's a compilation option which makes BusyBox fork for all utilities. In this case my approach will probably be as slow as yours. Here's an intermediate approach using dd which can't avoid the forks, but at least avoids opening and closing the file every time. It should be a little faster than yours.
i=$(($(wc -c <"$1") / 4))
exec <"$1"
dd ibs=2 count=1 conv=notrunc 2>/dev/null
while [ $i -gt 1 ]; do
dd ibs=2 count=1 skip=1 conv=notrunc 2>/dev/null
i=$((i - 1))
done

No idea if this will be faster or even possible with BusyBox, but it's a thought...
#!/bin/bash
# Empty result file
> result
exec 3< datafile
while true; do
# Read 2 bytes into file "short"
dd bs=2 count=1 <&3 > short 2> /dev/null
[ ! -s short ] && break
# Accumulate result file
cat short >> result
# Read two bytes and discard
dd bs=2 count=1 <&3 > short 2> /dev/null
[ ! -s short ] && break
done
Or this should be more efficient:
#!/bin/bash
exec 3< datafile
for ((i=0;i<76800;i++)) ; do
# Skip 2 bytes then read 2 bytes
dd bs=2 count=1 skip=1 <&3 2> /dev/null
done > result
Or, maybe you could use netcat or ssh to send the file to a sensible (more powerful) computer with proper tools to process it and return it. For example, if the remote computer had ImageMagick it could down-scale the image very simply.

Another option might be to use Lua which has a reputation for being small, fast and well suited to embedded systems - see Lua website. There are pre-built, downloadable binaries of it there too. It is also suggested on the Busybox website here.
I have never written any Lua before, so there may be some inefficiencies but this seems to work pretty well and processes a 640x480 RGB565 image in a few milliseconds on my desktop.
-- scale.lua
-- Usage: lua scale.lua input.bin output.bin
-- Scale an image by skipping alternate lines and alternate columns
-- Set up width, height and bytes per pixel
w = 640
h = 480
bpp = 2
-- Open first argument for input, second for output
inp = assert(io.open(arg[1], "rb"))
out = assert(io.open(arg[2], "wb"))
-- Read image, one line at a time
for i = 0, h-1, 1 do
-- Read a whole line
line = inp:read(w*bpp)
-- Only use every second line
if (i % 2) == 0 then
io.write("DEBUG: Processing row: ",i,"\n")
-- Build up new, reduced line by picking substrings
reduced=""
for p = 1, w*bpp, bpp*2 do
reduced = reduced .. string.sub(line,p,p+bpp-1)
end
io.write("DEBUG: New line length in bytes: ",#reduced,"\n")
out:write(reduced)
end
end
assert(out:close())
I created a greyscale test image with ImageMagick as follows:
magick -depth 16 -size 640x480 gradient: gray:image.bin
Then I ran the above Lua script with:
lua scale.lua image.bin smaller.bin
Then I made a JPEG I could view for testing with:
magick -depth 16 -size 320x240 gray:smaller.bin smaller.jpg

How much memory does a variable take?

The variable "a=b" contains 1 char 'a' for the name, and 1 char 'b' for the value.
Together 2 bytes.
How many characters can you store with one byte ?
The variable needs a pointer. 8 bytes.
How many bytes do pointers take up ?
Together 10 bytes.
Does a variable "a=b" stored in memory take about 10 bytes ? And would 10 variables of the same size take about 100 bytes ?
So 1000 variables of 1000 bytes each would be almost 1MB of memory ?
I have a file data.sh that only contains variables.
I need to retrieve the value of one variable in that file.
I do this by using a function.
(called by "'function-name' 'datafile-name' 'variable-name'")
#!/usr/pkg/bin/ksh93
readvar () {
while read -r line
do
typeset "${line}"
done < "${1}"
nameref indirect="${2}"
echo "${indirect}"
}
readvar datafile variable
The function reads the file data.sh line by line.
While it does that is typesets each line.
After it's done with that,
it makes a namereference from the variable-name in the function-call,
to one of the variables of the file data.sh.
To finally print the value of that variable.
When the function is finished it no longer uses up memory.
But as long as the function is running it does.
This means all variables in the file data.sh are at some point stored in memory.
Correct ?
In reality I have a file with ip-addresses as variable name and a nickname as values. So I suppose this will not be such a problem on memory. But if I use this also for posts of visitors variable values will be of larger sizes. But then it would be possible to have this function only store for instance 10 variables in memory each time.
However I wonder if my way of calculating this memory usage of variables is making any sense.
Edit:
This might be a solution to avoid loading the whole file in memory.
#!/bin/ksh
readvar () {
input=$(print "${2}" | sed 's/\[/\\[/g' | sed 's/\]/\\]/g')
line=$(grep "${input}" "${1}")
typeset ${line}
nameref indirect="${2}"
print "${indirect}"
}
readvar ./test.txt input[0]
With the input test.txt
input[0]=192.0.0.1
input[1]=192.0.0.2
input[2]=192.0.0.2
And the output
192.0.0.1
Edit:
Of course !!!
In the original post
Bash read array from an external file
it said:
# you could do some validation here
so:
while read -r line
do
# you could do some validation here
declare "$line"
done < "$1"
lines would be declared (or typeset in ksh) under a condition.

Your real concern seems not to be "how much memory does this take?" but "how can I avoid taking uselessly much memory for this?". I'm going to answer this one first. For a bunch of thoughts about the original question, see the end of my answer.
For avoiding to use up memory I propose to use grep to get the one line which is of interest to you and ignore all the others:
line=$(grep "^$2=" "$1")
Then you can extract the information you need from this line:
result=$(echo "$line" | cut -d= -f 2)
Now the variable result contains the value which would have been assigned to $2 in the file $1. Since you have no need to store more than one such result value you definitely have no memory issue.
Now, to the original question:
To find out how much memory a shell uses up for each variable is tricky. You would need to have a look into the source of the shell to be sure on the implementation. It can vary from shell to shell (you appear to be using ksh which can be different from bash in this aspect). It also can vary from version to version.
One way to get an idea would be to watch a shell process's memory usage while making the shell set variables in large amounts:
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<1000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<10000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<100000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
bash -c 'a="$(head -c 1000 /dev/zero | tr "\0" x)"; for ((i=0; i<200000; i++)); do eval a${i}="$a"; done; grep ^VmPeak /proc/$$/status'
This prints the peak amount of memory in use by a bash which sets 1000, 10000, 100000, and 200000 variables with a value of 1000 x characters. On my machine (using bash 4.2.25(1)-release) this gave the following output:
VmPeak: 19308 kB
VmPeak: 30220 kB
VmPeak: 138888 kB
VmPeak: 259688 kB
This shows that the memory used is growing more or less in a linear fashion (plus a fixed offset of ~17000k) and that each new variable takes ~1.2kB of additional memory.
But as I said, other shells' results may vary.

How to move and rename a file with random characters in shell?

I have this file:
/root/.aria2/aria2.txt
and I want to move it to:
/var/spool/sms/outgoing/aria2_XXXXX
Note that XXXXX are random characters.
How do I do that using only the facilities exposed by the openwrt (a GNU/Linux distribution for embedded devices) and the ash shell?

A simple way of generating a semi-random number in bash is to use the date +%N command or the system provided $RANDOM:
rn=$(date +%N) # Nanoseconds
rn=${rn:3:5} # to limit to 5 digits
or, using $RANDOM, you need to check you have sufficient digits for your purpose. If 5 is the number of digits you need:
rn=$RANDOM
while [ ${#rn} -lt 5 ]; do
rn="${rn}${RANDOM}"
done
rn=${rn:0:5}
To move while providing the random suffix:
mv /root/.aria2/aria2.txt /var/spool/sms/outgoing/aria2_${rn}

On systems with /dev/random you can obtain a string of random ASCII characters with something like
dd if=/dev/random count=1 |
tr -dc ' -~' |
dd bs=8 count=1
Set the bs= in the second instance to the amount of characters you want.
The probability of getting the same result twice is very low, but you have not told us what is an acceptable range. You should understand (or help us help you understand) what is an acceptable probability in your scenario.

Use the tempfile command
mv aria2.txt `tempfile -d $dir -p aria2`
see man tempfile for the gory details.

Bash: delete chars in a file, only on the first line

I am making a script to copy a lot of files from a path to another one.
Any files on the path has, on the first line, a lot of "garbage" untill the word "Return-Path....."
File content example:
§°ç§°*é*é*§°ç§°çççççççReturn-PathOTHERTHINGS
REST
OF
THE
FILE
EOF
Probably sed or awk could help on this.
THE PROBLEM:
I want the whole content of the file, except for anything previous then "Return-Path" and it should be stripped ONLY on the first line, in this way:
Return-PathOTHERTHINGS
REST
OF
THE
FILE
EOF
Important thing: anything before Return-Path is "binary", infact files are seen as binary...
How to solve?

Ok, it's a new day, and now I do feel like coding this for you :-)
The algorithm is described in my other answer to your same question.
#!/bin/bash
################################################################################
# behead.sh
# Mark Setchell
#
# Utility to remove stuff preceding specified string near start of binary file
#
# Usage: behead.sh <infile> <outfile>
################################################################################
IN=$1
OUT=$2
SEARCH="Return-Path"
for i in {0..80}; do
str=$(dd if="$1" bs=1 count=${#SEARCH} iseek=$i 2> /dev/null)
if [ "$str" == $SEARCH ]; then
# The following line will go faster if you exchange "bs" and "iseek"
# parameters, because it will work in bigger blocks, it just looks
# wrong, so I haven't done it.
dd if="$1" of="$OUT" bs=1 iseek=$i 2> /dev/null
exit $?
fi
done
echo String not found, sorry.
exit 1
You can test it works like this:
#
# Create binary with 15 bytes of bash, then "Return-Path", then entire bash in file "bashed"
(dd if=/bin/bash bs=1 count=15 2>/dev/null; echo -n 'Return-Path'; cat /bin/bash) > bashed
#
# Chop off junk at start of "bashed" and save in "restored"
./behead.sh bashed restored
#
# Check the restored "bash" is exactly 11 bytes longer than original,
# as it has "Return-Path" at the beginning
ls -l bashed restored
If you save my script as "behead.sh" you will need to make it executable like this:
chmod +x behead.sh
Then you can run it like this:
./behead.sh inputfile outputfile
By the way, there is no concept of "a line" in a binary file, so I have assumed the first 80 characters - you are free to change it, of course!

Try:
sed '1s/.*Return-Path/Return-Path/'
This command substitutes anything before "Return-Path" with "Return-Path" only on the first line.

I don't feel like coding this at this minute, but can give you a hint maybe. "Return-Path" is 11 characters. You can get 11 characters from a file at offset "n" with
dd if=file bs=1 count=11 iseek=n
So if you do a loop with "n" starting at zero and increasing till the result matches "Return-Path" you can calculate how many bytes you need to remove off the front. Then you can do that with another "dd".
Alternatively, have a look at running the file through "xxd", editing that with "sed" and then running it back through "xxd" the other way with "xxd -r".

Adding and dividing variables in bash. error: TEST3=: command not found

SO I am making a program that tests the average REad/Write speed of the hard drive using the dd command and my code is as follows(bash):
a=1
b=1
numval=3
for i in `seq 1 3`;
do
TEST$i=$(dd if=/dev/zero of=speedtest bs=1M count=100 conv=fdatasync)
#I think that this is the problem line
done
RESULT=$(($TEST1 + $TEST2))
RESULT=$(($RESULT + $TEST3))
RESULT=$(($RESULT / $numval))
echo $RESULT > Result
The code above returns the following errors (in between the dd outputs):
TEST1=: command not found
TEST2=: command not found
TEST3=: command not found
Please help (believe it or not) this is for a school project
edit: I understand that my variable does not have a valid name. but Im wondering if there is a way to do this without this shit: "^$({$-%})$" REGEX? is there way to do it without that?

You have (at least) two problems.
TEST$i=... is not valid bash syntax for a variable assignment. And if the first "word" in a command line is not a valid assignment, then it's treated as a command name. So bash goes ahead and substitutes the value of $i for $i and the output of the dd command for $(dd ...) (see below), ending up with the successive "commands" TEST1=, TEST2= and TEST3=. Those aren't known commands, so it complains.
In an assignment, the only characters you can put before the = are letters, numbers and _ (unless it is an array assignment), which means that you cannot use parameter substitution to create a variable name. (But you could use an array.)
You seem to be assuming that the dd command will output the amount of time it took, or something like that. It doesn't. In fact, it doesn't output anything on stdout. It will output several lines on stderr, but stderr isn't captured with $(...)

First problem: you can't use a variable name that's defined in terms of other variables (TEST$i=...) without jumping through some special hops. There are several ways around this. You could use the declare command (declare TEST$i=...), or use an array (TEST[i]=... and then e.g. RESULT=$((TEST[1] + TEST[2]))), or what I'd recommend: accumulate the times as you go without bothering with the numbered TEST1 etc variables:
numval=3
result=0
for i in `seq 1 $numval`; do
test=$(dd if=/dev/zero of=speedtest bs=1M count=100 conv=fdatasync)
result=$((result + test))
done
result=$((result / numval))
(Note that I prefer to use lowercase variable names in shell scripts, to avoid accidentally using one of the shell's predefined variables and making a mess. Also, inside $(( )), variables are automatically replaced, so you don't need $ there.)
However, this still won't work because...
Second problem: dd doesn't output a number. In fact, it doesn't output anything to standard output (which $( ) captures). What it does is output a bunch of numbers and other such things to standard error. Your version of dd is a bit different from mine, but its stderr output is probably something like this:
$ dd if=/dev/zero of=/dev/null bs=1m count=100
100+0 records in
100+0 records out
104857600 bytes transferred in 0.011789 secs (8894645697 bytes/sec)
... and you presumably want to pick out the bytes/sec figure. Depending on your dd's exact output, something like this might work:
$ dd if=/dev/zero of=/dev/null bs=1m count=100 2>&1 | sed -n 's/^.*(\([0-9]*\) bytes.*$/\1/p'
8746239457
What this does is redirect dd's error output to standard output (2>&1), then pipe (|) that to a somewhat messy sed command that looks for something matching "(", then a bunch of digits, then " bytes", and outputs just the digits part.
Here's the full script I wind up with:
#!/bin/bash
numval=3
result=0
for i in `seq 1 $numval`; do
test=$(dd if=/dev/zero of=speedtest bs=1M count=100 conv=fdatasync 2>&1 | sed -n 's/^.*(\([0-9]*\) bytes.*$/\1/p')
result=$((result + test))
done
result=$((result / numval))
echo "$result" >Result

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to shrink a large file in bash? - bash

I have a file that I want to truncate to 2kb (i.e. keep the first 2kb of data, get rid of the rest). How can I do this with bash?

The command is (surprise, surprise) truncate. truncate -s 2KB file

The standards-compliant way to do this (not relying on any Linux-only tools such as truncate) is to use dd: dd if=/dev/null of=/file/to/truncate seek=1 bs=2k Unlike the other dd answer, which merely copies the first 2k of a file, this one truncates the target file at that point.

Use dd: dd if=yourfile of=firstLump bs=2k count=1 if = the input file of = the output file bs = blocksize count = number of blocks Available on Linux AND Mac OSX.

Related

Is there a fast way to read alternate bytes in dd

How much memory does a variable take?

How to move and rename a file with random characters in shell?

Bash: delete chars in a file, only on the first line

Adding and dividing variables in bash. error: TEST3=: command not found

Categories

Resources