Using Bash to concatenate the first 10 bytes of many files - bash

I am trying to write a simple Bash loop to concatenate the first 10 bytes of all the files in a directory.
So far, I have the code block:
for filename in /content/*.bin;
do
cat -- (`head --bytes 10 $filename`) > "file$i.combined"
done
However, the the syntax is clearly incorrect here. I know the inner command:
head --bytes 10 $filename
...returns what I need; the first 10 bytes of the passed filename. And when I use:
cat -- $filename > "file$i.combined"
...the code works, only it concats the entire file contents.
How can I combine the two functions so that my loop concatenates the first 10 bytes of all the looped files?

The loop will do the concatenation for you (or rather, the output of the sequential executions of head are written in order to the standard output of the loop itself); you don't need to use cat.
for filename in /content/*.bin; do
head --bytes 10 "$filename"
done > "file$i.combined"

With head from GNU coreutils:
head -qc 10 /content/*.bin > combined
For the very unlikely case you run into problems with ARG_MAX (either because of very long filenames, or a huge number of files) you can use
find /content -maxdepth 1 -name '*.bin' -exec head -qc 10 {} + > combined

You should use #chepner's answer, but to show how you could use cat...
Here we have a recursive function that rolls everything up into a single copy of cat.
catHeads() {
local first="$1"; shift || return
case $# in
0) head --bytes 10 "$first" ;;
1) cat <(head --bytes 10 "$first") <(head --bytes 10 "$1");;
*) cat <(head --bytes 10 "$first") <(catHeads "$#");;
esac
}
catHeads /content/*.bin >"file$i.combined"
Don't ever actually do this. Use chepner's answer (or Socowi's) instead.

Related

How to use locale variable and 'delete last line' in sed

I am trying to delete the last three lines of a file in a shell bash script.
Since I am using local variables in combination with the Regex syntax in sed the answer proposed in How to use sed to remove the last n lines of a file does not cover this case. On the contrary, the cases covered deal with sed in a terminal and does not cover syntax in shell scripts, neither does it cover the use of variables in sed expressions.
The commands I have available is limited, since I am not on a Linux but use a MINGW64 for it.
sed does a create job so far, but deleting the last three lines gives me some headaches in relation of how to format the expression.
I use wc to be aware of how many lines the file has and subtract then with expr three lines.
n=$(wc -l < "$distribution_area")
rel=$(expr $n - 3)
The start point for deleting lines is defined by rel but accessing the local variable happens through the $ and unfortunately the syntax of sed is using the $ to define the end of file. Hence,
sed -i "$rel,$d" "$distribution_area"
won't work, and what ever variant of combinations e.g. '"'"$rel"'",$d' gives me sed: -e expression #1, char 1: unknown command: `"' or something similar.
Can somebody show me how to combine the variable with the $d regex syntax of sed?
sed -i "$rel,$d" "$distribution_area"
Here you're missing the variable name (n) for the second arg.
Consider the following example on a file called test that contains 1-10:
n=$(wc -l < test)
rel=$(($n - 3))
sed "$rel,$n d" test
Result:
1
2
3
4
5
6
To make sure the d will not interfere with the $n, you can add a space instead of escaping.
If you have a recent head available, I'd recommend something like:
head -n -3 test
Can somebody show me how to combine the variable with the $d regex syntax of sed?
$d expands to a varibale d, you have to escape it.
"$rel,\$d"
or:
"$rel"',$d'
But I would use:
head -n -3 "$distribution_area" > "$distribution_area".tmp
mv "$distribution_area".tmp "$distribution_area"
You can remove the last N lines using only pure Bash, without forking additional processes (such as sed). Such scripts look ugly, but they would work in any environment where only Bash runs and nothing else is available, no other binaries like sed, awk etc.
If the entire file fits in RAM, a straightforward solution is to split it by lines and print all but the N trailing ones:
delete_last_n_lines() {
local -ir n="$1"
local -a lines
readarray lines
((${#lines[#]} > n)) || return 0
printf '%s' "${lines[#]::${#lines[#]} - n}"
}
If the file does not fit in RAM, you can keep a FIFO buffer that stores N lines (N + 1 in the “implementation” below, but that’s just a technical detail), let the file (arbitrarily large) flow through the buffer and, after reaching the end of the file, not print out what remains in the buffer (the last N lines to remove).
delete_last_n_lines() {
local -ir n="$1 + 1"
local -a lines
local -i pos i
for ((i = 0; i < n; ++i)); do
IFS= read -r lines[i] || return 0
done
printf '%s\n' "${lines[pos]}"
while IFS= read -r lines[pos++]; do
((pos %= n))
printf '%s\n' "${lines[pos]}"
done
}
The following example gets 10 lines of input, 0 to 9, but prints out only 0 to 6, removing 7, 8 and 9 as desired:
printf '%s' {0..9}$'\n' | delete_last_n_lines 3
Last but not least, this simple hack lacks sed’s -i option to edit files in-place. That could be implemented (e.g.) using a temporary file to store the output and then renaming the temporary file to the original. (A more sophisticated approach would be needed to avoid storing the temporary copy altogether. I don’t think Bash exposes an interface like lseek() to read files “backwards”, so this cannot be done in Bash alone.)

Concatenate specific number of files

I have a bunch of files named uv_set_XXXXXXXX where the 6 Xs stand for the usual format year, month and day. Imagine I have 325 files of this type. I would like to concatenate by groups of 50 files, so in the end I have 7 files (6 files of 50 and 1 of 25).
I have been thinking in using cat but I can't see an option to select a number of files from a list. I could do this with Python, but just wondering if some Unix command line utility does it more directly.
Thanks.
With GNU parallel you can use the following command
parallel -n50 "cat {} > out{#}" ::: uv_set_*
This will merge the first 50 files into out1, the next 50 files into out2, and so on.
I would just break down and do this in Awk.
awk 'FNR==1 && (++i%50 == 0) {
if(NR>1) close p;
p = "dest_" ++j }
{ print >p }' uv_set_????????
This creates files dest_1 through dest_7, the first 6 with 50 files in each and the last with the remainder.
Closing the previous file is necessary because the system only allows Awk to have a limited number of open file handles (though the limit is typically higher than 7 so it's probably not important in your example).
Thinking out loud dept, just to prevent anyone else from wasting time on repeating this dead end.
You could use xargs -L 50 cat to concatenate 50 files at a time, but there is no simple way to pass in a new redirection for standard output for each invocation. You could try to hack your way around that with something like
# XXX Do not use: incomplete
printf '%s\n' uv_set_???????? |
xargs -L 50 sh -c 'cat "$#" > ... something' _
but I can't come up with elegant way to have a different something each time.

`uniq` without sorting an immense text file?

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.
The file has unix line endings, and all content matches [[:print:]]. I tried the following awk script to display only unique lines:
awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt
The thought was that I'd populate an array by referencing its elements, using the contents of the file as keys, then skip lines that were already in the array. But this fails for two reasons -- firstly, because it inexplicably just doesn't work (even on small test files), and secondly because I know that my system will run out of memory before the entire set of unique lines is loaded into memory by awk.
After searching, I found this answer which recommended:
awk '!x[$0]++'
And while this works on small files, it also will run out of memory before reading my entire file.
What's a better (i.e. working) solution? I'm open to just about anything, though I'm more partial to solutions in languages I know (bash & awk, hence the tags). In trying to visualize the problem, the best I've come up with would be to store an array of line checksums or MD5s rather than the lines themselves, but that only saves a little space and runs the risk of checksum collisions.
Any tips would be very welcome. Telling me this is impossible would also be welcome, so that I stop trying to figure it out. :-P
The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.
However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt
Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.
However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.
The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:
cat huge.txt | pv |
parallel --pipe --keep-order --block 100M -j4 -q \
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt
The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.
In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).
Other options include:
Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
Sort your huge file with an external merge sort or distribution/bucket sort
Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
Or use bloom filters
Here is an example of using the Bloom::Faster module of Perl:
perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt
You may install Bloom::Faster from cran (sudo cran and then install "Bloom::Faster")
Explanation:
You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.
In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.
I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:
$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five
Our raw data includes one duplicated line. The pipes function as follows:
nl adds line numbers. It's a standard, low-impact unix tool.
sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
sort the second time puts things back in the order defined by the nl command.
cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.
Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.
I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.
Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).
sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
awk -f dedupe.awk file
Which on a test input file like:
line 1
line 2
line 3
line 2
line 2
line 3
line 4
line 5
line 6
creates an awk script of:
$0=="line 2"{x[1]++; if (x[1]>1){next}}
$0=="line 3"{x[2]++; if (x[2]>1){next}}
7
and run as awk -f dedupe.awk file outputs:
line 1
line 2
line 3
line 4
line 5
line 6
If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:
sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'
which cuts seven characters off each line (six if you remove the space from the original too) and generates:
{f=1}
$0=="line 2"{x[1]++;f=(x[1]<=1)}
$0=="line 3"{x[2]++;f=(x[2]<=1)}
f
This solution will probably run slower though because it doesn't short-circuit the script as matches are found.
If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).
I'd do it like this:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1"
for x in x*; do
awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done
cat $(sort -n yx*) | sort | uniq -d | \
while IFS= read -r line; do
fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
while IFS=: read -r file nr rest; do
sed -i -d ${nr}d "$file"
done
done
cat $(sort -n yx*) >uniq_"$1" && rm -f yx*
(proof of concept; needs more polishing before being used in production).
What's going on here:
split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
sed -i -e ${nr}d "$file" removes line $nr from chunk $file
cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.
This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.
The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).
Edit: Second version, after feedback from #EtanReisner:
#! /bin/sh
usage ()
{
echo "Usage: ${0##*/} <file> [<lines>]" >&2
exit 1
}
if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi
tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM
LC_ALL=C
export LC_ALL
split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"
ls -1 "$tdir" | while IFS= read -r x; do
awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
rm -f "${tdir}/$x" || exit 1
done
find "$tdir" -type f -name 'yx*' | \
xargs -n 1 cat | \
sort | \
uniq -d >"$dupes" || exit 1
find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
sed 's!.*/!!' | \
sort -t: -n -k 1.3,1 -k 2,2 | \
perl '
while(<STDIN>) {
chomp;
m/^(yx\d+):(\d+):(.*)$/o;
if ($dupes{$3}++)
{ push #{$del{$1}}, int($2) }
else
{ $del{$1} = [] }
}
undef %dupes;
chdir $ARGV[0];
for $fn (sort <"yx*">) {
open $fh, "<", $fn
or die qq(open $fn: $!);
$line = $idx = 0;
while(<$fh>) {
$line++;
if ($idx < #{$del{$fn}} and $line == $del{$fn}->[$idx])
{ $idx++ }
else
{ print }
}
close $fh
or die qq(close $fn: $!);
unlink $fn
or die qq(remove $fn: $!);
}
' "$tdir" >uniq_"$1" || exit 1
If there's a lot of duplication, one possibility is to split the file using split(1) into manageable pieces and using something conventional like sort/uniq to make a summary of unique lines. This will be shorter than the actual piece itself. After this, you can compare the pieces to arrive at an actual summary.
Maybe not the answer you've been looking for but here goes: use a bloom filter.
https://en.wikipedia.org/wiki/Bloom_filter This sort of problem is one of the main reasons they exist.

Fastest way to print a single line in a file

I have to fetch one specific line out of a big file (1500000 lines), multiple times in a loop over multiple files, I was asking my self what would be the best option (in terms of performance).
There are many ways to do this, i manly use these 2
cat ${file} | head -1
or
cat ${file} | sed -n '1p'
I could not find an answer to this do they both only fetch the first line or one of the two (or both) first open the whole file and then fetch the row 1?
Drop the useless use of cat and do:
$ sed -n '1{p;q}' file
This will quit the sed script after the line has been printed.
Benchmarking script:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
Just save as benchmark.sh and run bash benchmark.sh.
Results:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**Results from file with 1,000,000 lines.*
So the times for sed -n 1p will grow linearly with the length of the file but the timing for the other variations will be constant (and negligible) as they all quit after reading the first line:
Note: timings are different from original post due to being on a faster Linux box.
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
If you want to print only 1 line (say the 20th one) from a large file you could also do:
head -20 filename | tail -1
I did a "basic" test with bash and it seems to perform better than the sed -n '1{p;q} solution above.
Test takes a large file and prints a line from somewhere in the middle (at line 10000000), repeats 100 times, each time selecting the next line. So it selects line 10000000,10000001,10000002, ... and so on till 10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
vs.
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
For printing a line out of multiple files
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
How about avoiding pipes?
Both sed and head support the filename as an argument. In this way you avoid passing by cat. I didn't measure it, but head should be faster on larger files as it stops the computation after N lines (whereas sed goes through all of them, even if it doesn't print them - unless you specify the quit option as suggested above).
Examples:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
Again, I didn't test the efficiency.
I have done extensive testing, and found that, if you want every line of a file:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
Is much much faster then any other (Bash based) method out there. All other methods (like sed) read the file each time, at least up to the matching line. If the file is 4 lines long, you will get: 1 -> 1,2 -> 1,2,3 -> 1,2,3,4 = 10 reads whereas the while loop just maintains a position cursor (based on IFS) so would only do 4 reads in total.
On a file with ~15k lines, the difference is phenomenal: ~25-28 seconds (sed based, extracting a specific line from each time) versus ~0-1 seconds (while...read based, reading through the file once)
The above example also shows how to set IFS in a better way to newline (with thanks to Peter from comments below), and this will hopefully fix some of the other issue seen when using while... read ... in Bash at times.
For the sake of completeness you can also use the basic linux command cut:
cut -d $'\n' -f <linenumber> <filename>

Grabbing every 4th file

I have 16,000 jpg's from a webcan screeb grabber that I let run for a year pointing into the back year. I want to find a way to grab every 4th image so that I can then put them into another directory so I can later turn them into a movie. Is there a simple bash script or other way under linux that I can do this.
They are named like so......
frame-44558.jpg
frame-44559.jpg
frame-44560.jpg
frame-44561.jpg
Thanks from a newb needing help.
Seems to have worked.
Couple of errors in my origonal post. There were actually 280,000 images and the naming was.
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163405.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163505.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163605.jpg
I ran.
cp $(ls | awk '{nr++; if (nr % 10 == 0) print $0}') ../newdirectory/
Which appears to have copied the images. 70-900 per day from the looks of it.
Now I'm running
mencoder mf://*.jpg -mf w=640:h=480:fps=30:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2 -nosound -o ../output-msmpeg4v2.avi
I'll let you know how the movie works out.
UPDATE: Movie did not work.
Only has images from 2007 in it even though the directory has 2008 as well.
webcam_2008-02-17_101403.jpg webcam_2008-03-27_192205.jpg
webcam_2008-02-17_102403.jpg webcam_2008-03-27_193205.jpg
webcam_2008-02-17_103403.jpg webcam_2008-03-27_194205.jpg
webcam_2008-02-17_104403.jpg webcam_2008-03-27_195205.jpg
How can I modify my mencoder line so that it uses all the images?
One simple way is:
$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ mv $(ls | awk '{nr++; if (nr % 4 == 0) print $0}') destdir
Create a script move.sh which contains this:
#!/bin/sh
mv $4 ../newdirectory/
Make it executable and then do this in the folder:
ls *.jpg | xargs -n 4 ./move.sh
This takes the list of filenames, passes four at a time into move.sh, which then ignores the first three and moves the fourth into a new folder.
This will work even if the numbers are not exactly in sequence (e.g. if some frame numbers are missing, then using mod 4 arithmetic won't work).
As suggested, you should use
seq -f 'frame-%g.jpg' 1 4 number-of-frames
to generate the list of filenames since 'ls' will fail on 280k files. So the final solution would be something like:
for f in `seq -f 'frame-%g.jpg' 1 4 number-of-frames` ; do
mv $f destdir/
done
seq -f 'frame-%g.jpg' 1 4 number-of-frames
…will print the names of the files you need.
An easy way in perl (probably easily adaptable to bash) is to glob the filenames in an array then get the sequence number and remove those that are not divisible by 4
Something like this will print the files you need:
ls -1 /path/to/files/ | perl -e 'while (<STDIN>) {($seq)=/(\d*)\.jpg$/; print $_ if $seq && $seq % 4 ==0}'
You can replace the print by a move...
This will work if the files are numbered in sequence even if the number of digits is not constant like file_9.jpg followed by file_10.jpg )
Given masto's caveats about sorting:
ls | sed -n '1~4 p' | xargs -i mv {} ../destdir/
The thing I like about this solution is that everything's doing what it was designed to do, so it feels unixy to me.
Just iterate over a list of files:
files=( frame-*.jpg )
i=0
while [[ $i -lt ${#files} ]] ; do
cur_file=${files[$i]}
mungle_frame $cur_file
i=$( expr $i + 4 )
done
This is pretty cheesy, but it should get the job done. Assuming you're currently cd'd into the directory containing all of your files:
mkdir ../outdir
ls | sort -n | while read fname; do mv "$fname" ../outdir/; read; read; read; done
The sort -n is there assuming your filenames don't all have the same number of digits; otherwise ls will sort in lexical order where frame-123.jpg comes before frame-4.jpg and I don't think that's what you want.
Please be careful, back up your files before trying my solution, etc. I don't want to be responsible for you losing a year's worth of data.
Note that this solution does handle files with spaces in the name, unlike most of the others. I know that wasn't part of the sample filenames, but it's easy to write shell commands that don't handle spaces safely, so I wanted to do that in this example.
brace expansion {m..n..s} is more efficient than seq. AND it allows a bit of output formatting:
$ echo {0000..0010..2}
0000 0002 0004 0006 0008 0010
Postscript: In curl if you only want every fourth (nth) numbered images so you tell curl a step counter too. This example range goes from 0 to 100 with an increment of 4 (n):
curl -O "http://example.com/[0-100:4].png"

Resources