Grabbing every 4th file - bash

I have 16,000 jpg's from a webcan screeb grabber that I let run for a year pointing into the back year. I want to find a way to grab every 4th image so that I can then put them into another directory so I can later turn them into a movie. Is there a simple bash script or other way under linux that I can do this.
They are named like so......
frame-44558.jpg
frame-44559.jpg
frame-44560.jpg
frame-44561.jpg
Thanks from a newb needing help.
Seems to have worked.
Couple of errors in my origonal post. There were actually 280,000 images and the naming was.
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163405.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163505.jpg
/home/baldy/Desktop/webcamimages/webcam_2007-05-29_163605.jpg
I ran.
cp $(ls | awk '{nr++; if (nr % 10 == 0) print $0}') ../newdirectory/
Which appears to have copied the images. 70-900 per day from the looks of it.
Now I'm running
mencoder mf://*.jpg -mf w=640:h=480:fps=30:type=jpg -ovc lavc -lavcopts vcodec=msmpeg4v2 -nosound -o ../output-msmpeg4v2.avi
I'll let you know how the movie works out.
UPDATE: Movie did not work.
Only has images from 2007 in it even though the directory has 2008 as well.
webcam_2008-02-17_101403.jpg webcam_2008-03-27_192205.jpg
webcam_2008-02-17_102403.jpg webcam_2008-03-27_193205.jpg
webcam_2008-02-17_103403.jpg webcam_2008-03-27_194205.jpg
webcam_2008-02-17_104403.jpg webcam_2008-03-27_195205.jpg
How can I modify my mencoder line so that it uses all the images?

One simple way is:
$ touch a b c d e f g h i j k l m n o p q r s t u v w x y z
$ mv $(ls | awk '{nr++; if (nr % 4 == 0) print $0}') destdir

Create a script move.sh which contains this:
#!/bin/sh
mv $4 ../newdirectory/
Make it executable and then do this in the folder:
ls *.jpg | xargs -n 4 ./move.sh
This takes the list of filenames, passes four at a time into move.sh, which then ignores the first three and moves the fourth into a new folder.
This will work even if the numbers are not exactly in sequence (e.g. if some frame numbers are missing, then using mod 4 arithmetic won't work).

As suggested, you should use
seq -f 'frame-%g.jpg' 1 4 number-of-frames
to generate the list of filenames since 'ls' will fail on 280k files. So the final solution would be something like:
for f in `seq -f 'frame-%g.jpg' 1 4 number-of-frames` ; do
mv $f destdir/
done

seq -f 'frame-%g.jpg' 1 4 number-of-frames
…will print the names of the files you need.

An easy way in perl (probably easily adaptable to bash) is to glob the filenames in an array then get the sequence number and remove those that are not divisible by 4
Something like this will print the files you need:
ls -1 /path/to/files/ | perl -e 'while (<STDIN>) {($seq)=/(\d*)\.jpg$/; print $_ if $seq && $seq % 4 ==0}'
You can replace the print by a move...
This will work if the files are numbered in sequence even if the number of digits is not constant like file_9.jpg followed by file_10.jpg )

Given masto's caveats about sorting:
ls | sed -n '1~4 p' | xargs -i mv {} ../destdir/
The thing I like about this solution is that everything's doing what it was designed to do, so it feels unixy to me.

Just iterate over a list of files:
files=( frame-*.jpg )
i=0
while [[ $i -lt ${#files} ]] ; do
cur_file=${files[$i]}
mungle_frame $cur_file
i=$( expr $i + 4 )
done

This is pretty cheesy, but it should get the job done. Assuming you're currently cd'd into the directory containing all of your files:
mkdir ../outdir
ls | sort -n | while read fname; do mv "$fname" ../outdir/; read; read; read; done
The sort -n is there assuming your filenames don't all have the same number of digits; otherwise ls will sort in lexical order where frame-123.jpg comes before frame-4.jpg and I don't think that's what you want.
Please be careful, back up your files before trying my solution, etc. I don't want to be responsible for you losing a year's worth of data.
Note that this solution does handle files with spaces in the name, unlike most of the others. I know that wasn't part of the sample filenames, but it's easy to write shell commands that don't handle spaces safely, so I wanted to do that in this example.

brace expansion {m..n..s} is more efficient than seq. AND it allows a bit of output formatting:
$ echo {0000..0010..2}
0000 0002 0004 0006 0008 0010
Postscript: In curl if you only want every fourth (nth) numbered images so you tell curl a step counter too. This example range goes from 0 to 100 with an increment of 4 (n):
curl -O "http://example.com/[0-100:4].png"

Related

Using Bash to concatenate the first 10 bytes of many files

I am trying to write a simple Bash loop to concatenate the first 10 bytes of all the files in a directory.
So far, I have the code block:
for filename in /content/*.bin;
do
cat -- (`head --bytes 10 $filename`) > "file$i.combined"
done
However, the the syntax is clearly incorrect here. I know the inner command:
head --bytes 10 $filename
...returns what I need; the first 10 bytes of the passed filename. And when I use:
cat -- $filename > "file$i.combined"
...the code works, only it concats the entire file contents.
How can I combine the two functions so that my loop concatenates the first 10 bytes of all the looped files?
The loop will do the concatenation for you (or rather, the output of the sequential executions of head are written in order to the standard output of the loop itself); you don't need to use cat.
for filename in /content/*.bin; do
head --bytes 10 "$filename"
done > "file$i.combined"
With head from GNU coreutils:
head -qc 10 /content/*.bin > combined
For the very unlikely case you run into problems with ARG_MAX (either because of very long filenames, or a huge number of files) you can use
find /content -maxdepth 1 -name '*.bin' -exec head -qc 10 {} + > combined
You should use #chepner's answer, but to show how you could use cat...
Here we have a recursive function that rolls everything up into a single copy of cat.
catHeads() {
local first="$1"; shift || return
case $# in
0) head --bytes 10 "$first" ;;
1) cat <(head --bytes 10 "$first") <(head --bytes 10 "$1");;
*) cat <(head --bytes 10 "$first") <(catHeads "$#");;
esac
}
catHeads /content/*.bin >"file$i.combined"
Don't ever actually do this. Use chepner's answer (or Socowi's) instead.

How to use locale variable and 'delete last line' in sed

I am trying to delete the last three lines of a file in a shell bash script.
Since I am using local variables in combination with the Regex syntax in sed the answer proposed in How to use sed to remove the last n lines of a file does not cover this case. On the contrary, the cases covered deal with sed in a terminal and does not cover syntax in shell scripts, neither does it cover the use of variables in sed expressions.
The commands I have available is limited, since I am not on a Linux but use a MINGW64 for it.
sed does a create job so far, but deleting the last three lines gives me some headaches in relation of how to format the expression.
I use wc to be aware of how many lines the file has and subtract then with expr three lines.
n=$(wc -l < "$distribution_area")
rel=$(expr $n - 3)
The start point for deleting lines is defined by rel but accessing the local variable happens through the $ and unfortunately the syntax of sed is using the $ to define the end of file. Hence,
sed -i "$rel,$d" "$distribution_area"
won't work, and what ever variant of combinations e.g. '"'"$rel"'",$d' gives me sed: -e expression #1, char 1: unknown command: `"' or something similar.
Can somebody show me how to combine the variable with the $d regex syntax of sed?
sed -i "$rel,$d" "$distribution_area"
Here you're missing the variable name (n) for the second arg.
Consider the following example on a file called test that contains 1-10:
n=$(wc -l < test)
rel=$(($n - 3))
sed "$rel,$n d" test
Result:
1
2
3
4
5
6
To make sure the d will not interfere with the $n, you can add a space instead of escaping.
If you have a recent head available, I'd recommend something like:
head -n -3 test
Can somebody show me how to combine the variable with the $d regex syntax of sed?
$d expands to a varibale d, you have to escape it.
"$rel,\$d"
or:
"$rel"',$d'
But I would use:
head -n -3 "$distribution_area" > "$distribution_area".tmp
mv "$distribution_area".tmp "$distribution_area"
You can remove the last N lines using only pure Bash, without forking additional processes (such as sed). Such scripts look ugly, but they would work in any environment where only Bash runs and nothing else is available, no other binaries like sed, awk etc.
If the entire file fits in RAM, a straightforward solution is to split it by lines and print all but the N trailing ones:
delete_last_n_lines() {
local -ir n="$1"
local -a lines
readarray lines
((${#lines[#]} > n)) || return 0
printf '%s' "${lines[#]::${#lines[#]} - n}"
}
If the file does not fit in RAM, you can keep a FIFO buffer that stores N lines (N + 1 in the “implementation” below, but that’s just a technical detail), let the file (arbitrarily large) flow through the buffer and, after reaching the end of the file, not print out what remains in the buffer (the last N lines to remove).
delete_last_n_lines() {
local -ir n="$1 + 1"
local -a lines
local -i pos i
for ((i = 0; i < n; ++i)); do
IFS= read -r lines[i] || return 0
done
printf '%s\n' "${lines[pos]}"
while IFS= read -r lines[pos++]; do
((pos %= n))
printf '%s\n' "${lines[pos]}"
done
}
The following example gets 10 lines of input, 0 to 9, but prints out only 0 to 6, removing 7, 8 and 9 as desired:
printf '%s' {0..9}$'\n' | delete_last_n_lines 3
Last but not least, this simple hack lacks sed’s -i option to edit files in-place. That could be implemented (e.g.) using a temporary file to store the output and then renaming the temporary file to the original. (A more sophisticated approach would be needed to avoid storing the temporary copy altogether. I don’t think Bash exposes an interface like lseek() to read files “backwards”, so this cannot be done in Bash alone.)

How do I find duplicate files by comparing them by size (ie: not hashing) in bash

How do I find duplicate files by comparing them by size (ie: not hashing) in bash.
Testbed files:
-rw-r--r-- 1 usern users 68239 May 3 12:29 The W.pdf
-rw-r--r-- 1 usern users 68239 May 3 12:29 W.pdf
-rw-r--r-- 1 usern users 8 May 3 13:43 X.pdf
Yes, files can have spaces (Boo!).
I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.
My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.
So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls, so I need another way to get actual size (and not a du approximate).
"Vote to close!"
Hold up cowboy.
I bet you're going to suggest this (cool, you can google search):
https://unix.stackexchange.com/questions/71176/find-duplicate-files
No fdupes (nor jdupes, nor...), nor finddup, nor rmlint, nor fslint - I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.
https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash
All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.
https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files
Proposed:
$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breaks for me:
find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory
https://unix.stackexchange.com/questions/170693/compare-directory-trees-regarding-file-name-and-size-and-date
Haven't been able to figure out how rsync -nrvc --delete might work in the same directory, but there might be solution in there.
Well how about cmp? Yeah, that looks pretty good, actually!
cmp -z file1 file2
Bummer, my version of cmp does not include the -z size option.
However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.
if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1 # Cuz no set -e, and trap not working
fi
for i in ./*
do
for j in ./*
do
if [[ "$i" != "$j" ]]; then # Yes, it will be identical to itself
if [[ $(cmp -s "$i" "$j") ]]; then
echo "null" # Cuz I can't use negative of the comparison?
else
mv -i "$i" ../Dupes/
fi
fi
done
done
https://unix.stackexchange.com/questions/367749/how-to-find-and-delete-duplicate-files-within-the-same-directory
Might have something I could use, but I'm not following what's going on in there.
https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible
If it were something that returns size, instead of md5, maybe one of the answers in here?
https://unix.stackexchange.com/questions/570305/what-is-the-most-efficient-way-to-find-duplicate-files
Didn't really get answered.
TIL: Sending errors from . scriptname will close my terminal instantly. Thanks, Google!
TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug + trap checkcommand DEBUG are set in profile to try and catch rm -r * - but at least will respect my alias for exit
TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P
TIL: How to catch non-ascii characters in filenames, without using basename
TIL: "${file##*/}"
TIL: file - yes, X.pdf is not a PDF.
On the matter of POSIX
I'm afraid you cannot get the actual file size (not the number of blocks allocated by the file) in a plain posix shell without using ls. All the solutions like du --apparent-size, find -printf %s, and stat are not posix.
However, as long as your filenames don't contain linebreaks (spaces are ok) you could create safe solutions relying on ls. Correctly handling filenames with linebreaks would require very non-posix tools (like GNU sort -z) anyway.
Bash+POSIX Approach Actually Comparing The Files
I would drop the approach to compare only the file sizes and use cmp instead. For huge directories the posix script will be slow no matter what you do. Also, I expect cmp to do some fail fast checks (like comparing the file sizes) before actually comparing the file contents. For common scenarios with only a few files speed shouldn't matter anyway as even the worst script will run fast enough.
The following script places each group of actual duplicates (at least two, but can be more) into its own subdirectory of dups/. The script should work with all filenames; spaces, special symbols, and even linebreaks are ok. Note that we are still using bash (which is not posix). We just assume that all tools (like mv, find, ...) are posix.
#! /usr/bin/env bash
files=()
for f in *; do [ -f "$f" ] && files+=("$f"); done
max=${#files[#]}
for (( i = 0; i < max; i++ )); do
sameAsFileI=()
for (( j = i + 1; j < max; j++ )); do
cmp -s "${files[i]}" "${files[j]}" &&
sameAsFileI+=("${files[j]}") &&
unset 'files[j]'
done
(( ${#sameAsFileI[#]} == 0 )) && continue
mkdir -p "dups/$i/"
mv "${files[i]}" "${sameAsFileI[#]}" "dups/$i/"
# no need to unset files[i] because loops won't visit this entry again
files=("${files[#]}") # un-sparsify array
max=${#files[#]}
done
Fairly Portable Non-POSIX Approach Using File Sizes Only
If you need a faster approach that only compares the file sizes I suggest to not use a nested loop. Loops in bash are slow already, but if you nest them you have quadratic time complexity. It is faster and easier to ...
print only the file sizes without file names
apply sort | uniq -d to retrieve duplicates in time O(n log n)
Move all files having one of the duplicated sizes to a directory
This solution is not strictly posix conform. However, I tried to verify, that the tools and options in this solution are supported by most implementations. Your find has to support the non-posix options -maxdepth and -printf with %s for the actual file size and %f for the file basename (%p for the full path would be acceptable too).
The following script places all files of the same size into the directory potential-dups/. If there are two files of size n and two files of size m all four files end up in this single directory. The script should work with all file names expect those with linebreaks (that is \n; \r should be fine though).
#! /usr/bin/env sh
all=$(find . -maxdepth 1 -type f -printf '%s %f\n' | sort)
dupRegex=$(printf %s\\n "$all" | cut -d' ' -f1 | uniq -d |
sed -e 's/[][\.|$(){}?+*^]/\\&/g' -e 's/^/^/' | tr '\n' '|' | sed 's/|$//')
[ -z "$dupRegex" ] && exit
mkdir -p potential-dups
printf %s\\n "$all" | grep -E "$dupRegex" | cut -d' ' -f2- |
sed 's/./\\&/' | xargs -I_ mv _ potential-dups
In case you wonder about some of the sed commands: They quote the file names such that spaces and special symbols are processed correctly by subsequent tools. sed 's/[][\.|$(){}?+*^]/\\&/g' is for turning raw strings into equivalent extended regular expressions (ERE) and sed 's/./\\&/' is for literal processing by xargs. See the posix documentation of xargs:
-I replstr [...] Any <blank>s at the beginning of each line shall be ignored.
[...]
Note that the quoting rules used by xargs are not the same as in the shell. [...] An easy rule that can be used to transform any string into a quoted form that xargs interprets correctly is to precede each character in the string with a backslash.

Command to list all file types and their average size in a directory

I am working on a specific project where I need to work out the make-up of a large extract of documents so that we have a baseline for performance testing.
Specifically, I need a command that can recursively go through a directory and, for each file type, inform me of the number of files of that type and their average size.
I've looked at solutions like:
Unix find average file size,
How can I recursively print a list of files with filenames shorter than 25 characters using a one-liner? and https://unix.stackexchange.com/questions/63370/compute-average-file-size, but nothing quite gets me to what I'm after.
This du and awk combination should work for you:
du -a mydir/ | awk -F'[.[:space:]]' '/\.[a-zA-Z0-9]+$/ { a[$NF]+=$1; b[$NF]++ }
END{for (i in a) print i, b[i], (a[i]/b[i])}'
Give you something to start, with below script, you will get a list of file and its size, line by line.
#!/usr/bin/env bash
DIR=ABC
cd $DIR
find . -type f |while read line
do
# size=$(stat --format="%s" $line) # For the system with stat command
size=$(perl -e 'print -s $ARGV[0],"\n"' $line ) # #Mark Setchell provided the command, but I have no osx system to test it.
echo $size $line
done
Output sample
123 ./a.txt
23 ./fds/afdsf.jpg
Then it is your homework, with above output, you should be easy to get file type and their average size
You can use "du" maybe:
du -a -c *.txt
Sample output:
104 M1.txt
8 in.txt
8 keys.txt
8 text.txt
8 wordle.txt
136 total
The output is in 512-byte blocks, but you can change it with "-k" or "-m".

print recursively the number of files in folders

I struggled for hours to get this uggly line to work
wcr() { find "$#" -type d | while read F; do find $F -maxdepth 0 && printf "%5d " $(ls $F | wc) && printf "$F"; done; echo; }
Here is the result
39 41 754 ../matlab.sh
1 1 19 ./matlab.sh./micmac
1 1 14 ./micmac
My first question is: how can I write it smarter?
Second question: I would like the names printed before the counts but I dont know how to tabulate the outputs, so I cannot do better than this:
.
./matlab.sh 1 1 19
./matlab.sh./micmac 1 1 19
./micmac 1 1 14
I don't see what that find $F -maxdepth 0 shall be good for, so I would just strip it.
Also, if a filename contains a %, you are in trouble if you use it as the format string to printf, so I'd add an explicit format string. And I combined the two printfs. To switch the columns (see also below for more on this topic), just switch the arguments and adapt the format string accordingly.
You should use double quotes around using variables ("$F" instead of $F) to avoid problems with filenames with spaces or other stuff in them.
Then, if a file starts with spaces, your read would skip those, rendering the resulting variable useless. To avoid that, set IFS to an empty string for the time of the read.
To get only the number of directory entries, you should use option -l for wc to only count the lines (and not also words and characters).
Use option --sort=none to ls to speed up things by avoiding useless sorting.
Use option -b to ls to escape newline characters in file names and thus avoid breaking of counting.
Indent your code properly if you want others to read it.
This is the result:
wcr() {
find "$#" -type d | while IFS='' read F
do
printf "%5d %s\n" "$(ls --sort=none -b "$F" | wc -l)" "$F"
done
echo
}
I'd object to switching the columns. The potentially widest column should be at the end (in this case the path to the file). Otherwise you will have to live with unreadable output. But if you really want to do this, you'd have to do two passes: One to determine the longest entry and a second to format the output accordingly.
for i in $(find . -type d); do
printf '%-10s %s\n' "$(ls $i | wc -l)" "$i"
done
You probably could pre-process the output and use column to make some fancier output with whatever order, but since the path can get big, doing this is probably simpler.

Resources