How to sort images by aspect ratio - shell

I want to sort images by aspect ratio, then use MPV to browse them, and I got some codes from Google:
identify * |
gawk '{split($3,sizes,"x"); print $1,sizes[1]/sizes[2]}' |
sed 's/\[.\]//' | sort -gk 2
This is a output:
28.webp 0.698404
1.webp 0.699544
27.webp 0.706956
10.webp 0.707061
25.webp 0.707061
9.webp 0.707061
2.webp 0.707241
22.webp 1.41431
23.webp 1.41431
24.webp 1.41431
Then I made some adaptations to fit my need:
identify * |
gawk '{split($3,sizes,"x"); print $1,sizes[1]/sizes[2]}' |
sed 's/\[.\]//' | sort -gk 2 |
gawk '{print $1}' |
mpv --no-resume-playback --really-quiet --playlist=-
It works, but isn't perfect. It can't deal with filename with space and identify is too slower than exiftool especially when handling WebP format, besides, exiftool has a -r option, so I want to use exiftool to get this output instead, but I don't know how to deal with the output of exiftool -r -s -ImageSize, anyone could help me?

Using exiftool you could use
exiftool -p '$filename ${ImageSize;m/(\d+)x(\d+)/;$_=$1/$2}' /path/to/files | sort -gk 2
This will format the output the same as your example and I assume the same sort command will work with that. If not, then the sort part would need editing.

Display aspect ratio and image filename without additional calculations with identify
identify -format '%f %[fx:w/h]\n' *.jpg | sort -n -k2,2
file1.jpg 1
file2.jpg 1.46789
file6.jpg 1.50282
file5.jpg 1.52
file7.jpg 1.77778
file3.jpg 1.90476
Regarding performance of identify vs exiftool, identify makes less calls but exiftool looks faster
strace -c identify -format '%f %[fx:w/h]\n' *.jpg 2>&1 | grep -E 'syscall|total'
% time seconds usecs/call calls errors syscall
100.00 0.001256 867 43 total
strace -c exiftool -r -s -ImageSize *.jpg 2>&1 | grep -E 'syscall|total'
% time seconds usecs/call calls errors syscall
100.00 0.000582 1138 311 total


How to extract "Create Date" in a faster way than with "identify"

I have done a short and ugly script to create a list of photos and datetime of when it was taken.
identify -verbose *.JPG | grep "Image:\|CreateDate:" | sed ':a;N;$!ba;s/JPG\n/JPG/g' | sed 's[^ ]* \([^ ]*\)[^0-9]*\(.*\)$/\1 \2/'
The output looks like
photo1.JPG 2018-11-28T16:11:44.06
photo2.JPG 2018-11-28T16:11:48.32
photo3.JPG 2018-11-28T16:13:23.01
It works pretty well, but my last folder had 3000 images and the script ran for a few hours after completing the task. This is mostly because identify is very slow. Does anyone have and alternative method? Preferably (but not exclusively) using native tools because it's a server and it is not so easy to convince the admin to install new tools.
Lose the grepand sed and such and use -format. This took about 10 seconds for 500 jpgs:
$ for i in *jpg ; do identify -format '%f %[date:create]\n' "$i" ; done
image1.jpg 2018-01-19T04:53:59+02:00
image2.jpg 2018-01-19T04:53:59+02:00
If you want to modify the output, put the command after the done to avoid forking a process after each image, like:
$ for i in *jpg ; do identify -format '%f %[date:create]\n' "$i" ; done | awk '{gsub(/+.*/,"",$NF)}1'
image1.jpg 2018-01-19T04:53:59
image2.jpg 2018-01-19T04:53:59
native tools? identify is the best ("native", I would call imagemagick a native tool) for this job. I don't think you'll find a faster method. Run it for 3000 images in parallel, you will have like nth-x speedup.
find . -maxdepth 1 -name '*.JPG' |
xargs -P0 -- sh -c "
identify -verbose \"\$1\" |
grep 'Image:\|CreateDate:' |
sed ':a;N;$!ba;s/JPG\n/JPG/g' |
sed 's[^ ]* \([^ ]*\)[^0-9]*\(.*\)$/\1 \2/'
" --
Or you can just use bash for f in "*.JPF"; do ( identify -verbose "$f" | .... ) & done.
Your seds look strange and output "unmatched ]" on my platform, I don't know what they are supposed to do, but I think cut -d: -f2 | tr -d '\n' would suffice. Greping for image name is also strange - you already now the image name...
find . -maxdepth 1 -name '*.JPG' |
xargs -P0 -- sh -c "
echo \"\$1 \$(
identify -verbose \"\$1\" |
grep 'CreateDate:' |
tr -d '[:space:]'
cut -d: -f2-
" --
This will work for filenames without any spaces in them. I think it will be ok with you, as your output is space separated, so you assume your filenames have no special characters.
jhead is small, fast and a stand-alone utility. Sample output:
jhead ~/sample/images/iPhoneSample.JPG
Sample Output
File name : /Users/mark/sample/images/iPhoneSample.JPG
File size : 2219100 bytes
File date : 2013:03:09 08:59:50
Camera make : Apple
Camera model : iPhone 4
Date/Time : 2013:03:09 08:59:50
Resolution : 2592 x 1936
Flash used : No
Focal length : 3.8mm (35mm equivalent: 35mm)
Exposure time: 0.0011 s (1/914)
Aperture : f/2.8
ISO equiv. : 80
Whitebalance : Auto
Metering Mode: pattern
Exposure : program (auto)
GPS Latitude : N 20d 50.66m 0s
GPS Longitude: E 107d 5.46m 0s
GPS Altitude : 1.13m
JPEG Quality : 96
I did 5,000 iPhone images like this in 0.13s on a MacBook Pro:
jhead *jpg | awk '/^File name/{f=substr($0,16)} /^Date\/Time/{print f,substr($0,16)}'
In case you are unfamiliar with awk, that says "Look out for lines starting with File name and if you see one, save characters 16 onwards as f, the filename. Look out for lines starting with Date/Time and if you see any, print the last filename you remembered and the 16th character of the current line onwards".

Piping stdout to two different commands

Been working on this all day, kind of got it to run, but I may still need some help to polish my code language.
Situation: I am using bedtools that gets two files (tab delimited) that contain genomic intervals (one per line) with some additional data (by column). More precisely, I am running the window function, this generates and output that contains for each interval in "a" file, all the intervals in "b" file that fall into the window that I have defined with parameter -l and -r. More precise explanation can be found here.
An example of function as taken from their web:
$ cat A.bed
chr1 1000 2000
$ cat B.bed
chr1 500 800
chr1 10000 20000
$ bedtools window -a A.bed -b B.bed -l 200 -r 20000
chr1 1000 2000 chr1 10000 20000
$ bedtools window -a A.bed -b B.bed -l 300 -r 20000
chr1 1000 2000 chr1 500 800
chr1 1000 2000 chr1 10000 20000
Question: So the thing is that I want to use that stdout to do a number of things in one shot.
Count the number of lines in the original stdout. For that I use wc -l
cut columns 4-6 cut -f 4-6
sort lines and keep only those not repeated sort | uniq -u
save to a file tee file.bed
count number of lines of the new stdout, again wc -l
So I have manged to get it to work more or less with this:
windowBed -a ARS_saccer3.bed -b ./Peaks/WTappeaks_-Mit_sorted.bed -r 0 -l 10000 | tee >(wc -l) >(cut -f 7-13 | sort | uniq -u | tee ./Window/windowBed_UP10.bed | wc -l)
This kind of works, because I get the output file correctly, and the values show in screen but... like this
juan#juan-VirtualBox:~/Desktop/sf_Biolinux_sf/IGV/Collisions$ 448
The first number is the second wc -l I don't understand why it shows first. And also, after the second number, cursor remains awaiting for instructions instead of appearing a new command line, so I assume there is something that remain unfinished with the code line as it is right now.
This probably is something very basic, but I will be very grateful to anyone that cares to explain me a little more about programming.
For anyone willing to offer solutions, bear in mind that I would like to keep this pipe in one line, without the need to run additional sh or anything else.
When you create a "forked pipeline" like this, bash has to run the two halves of the fork concurrently, otherwise where would it buffer the stdout for the other half of the fork? So it is essentially like running both subshells in the background, which explains why you get the results in an order you did not expect (due to the concurrency) and why the output is dumped unceremoniously on top of your command prompt.
You can avoid both of these problems by writing the two outputs to separate temporary files, waiting for everything to finish, and then concatenating the temporary files in the order you expect, like this:
windowBed -a ARS_saccer3.bed -b ./Peaks/WTappeaks_-Mit_sorted.bed -r 0 -l 10000 | tee >(wc -l >tmp1) >(cut -f 7-13 | sort | uniq -u | tee ./Window/windowBed_UP10.bed | wc -l >tmp2)
cat tmp1 tmp2
rm tmp1 tmp2

Scripting: get number of root files in RAR archive

I'm trying to write a bash script that determines whether a RAR archive has more than one root file.
The unrar command provides the following type of output if I run it with the v option:
[...#... dir]$ unrar v my_archive.rar
UNRAR 4.20 freeware Copyright (c) 1993-2012 Alexander Roshal
Archive my_archive.rar
Size Packed Ratio Date Time Attr CRC Meth Ver
2208411 2037283 92% 08-08-08 08:08 .....A. 00000000 m3g 2.9
103 103 100% 08-08-08 08:08 .....A. 00000000 m0g 2.9
9911403 9003011 90% 08-08-08 08:08 .....A. 00000000 m3g 2.9
3 12119917 11040397 91%
and since RAR is proprietary I'm guessing this output is as close as I'll get.
If I can get just the file list part (the lines between ------), and then perhaps filter out all even lines or lines beginning with multiple spaces, then I could do num_root_files=$(list of files | cut -d'/' -f1 | uniq | wc -l) and see whether [ $num_root_files -gt 1 ].
How do I do this? Or is there a saner approach?
I have searched for and found ways to grep text between two words, but then I'd have to include those "words" in the command, and doing that with entire lines of dashes is just too ugly. I haven't been able to find any solutions for "grep text between lines beginning with".
What I need this for is to decide whether to create a new directory or not before extracting RAR archives.
The unrar program does provide the x option to extract with full path and e for extracting everything to the current path, but I don't see how that could be useful in this case.
SOLUTION using the accepted answer:
num_root_files=$(unrar v "$file" | sed -n '/^----/,/^----/{/^----/!p}' | grep -v '^ ' | cut -d'/' -f1 | uniq | wc -l)
which seems to be the same as the shorter:
num_root_files=$(unrar v "$file" | sed -n '/^----/,/^----/{/^----/!p}' | grep -v '^ ' | grep -c '^ *[^/]*$')
OR using 7z as mentioned in a comment below:
num_root_files=$(7z l -slt "$file" | grep -c 'Path = [^/]*$')
# check if value is gt 2 rather than gt 1 - the archive itself is also listed
Oh no... I didn't have a man page for unrar so I looked one up online, which seems to have lacked some options that I just discovered with unrar --help. Here's the real solution:
unrar vb "$file" | grep -c '^[^/]*$'
I haven't been able to find any solutions for "grep text between lines
beginning with".
In order to get the lines between ----, you can say:
unrar v my_archive.rar | sed -n '/^----/,/^----/{/^----/!p}'

MPlayer modify Status Line

I am using MPlayer on ubuntu 13.04 in a Theatre (No Cinema, a Musical Theatre) to Play a Video on a second screen/DLP, while it shows nothing but the video and on the first screen there is a terminal with some Information about the video (eg. the time when I have to start it and to stop)
There is also shown the MPlayer status-line (STATUSLINE: A: 7.9 V: 7.9 A-V: 0.000 ct: 0.040 0/ 0 20% 1% 0.4% 0 0) I know what the Variables are standing for, but I want to have other variables be put out.
The best status-Line would be:
Playing | 1m3s / 6m7s | Remaining 5m4s | CPU:1%
Is there any way to change it? If it is only changeable in the Source it would be cool to know at least the file where I have to search.
I would try to start with something like
mplayer test.mp3 | stdbuf -o0 tr [:cntrl:] '\n' | stdbuf -oL grep A: | stdbuf -oL tr [:] [m] | stdbuf -oL tr -d ['('')'] | awk '{ print "Playing | " $3 "/" $5 " | Remaining tbd " }'
buffering gave me a hard time too when trying automated statusline read, my only way is to make single lines first by removing all :ctrl: stuff , and dont use sed or bbe, is not working unbuffered even when using stdbuf.
good luck

Grepping a huge file (80GB) any way to speed it up?

grep -i -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
This has been running for an hour on a fairly powerful linux server which is otherwise not overloaded.
Any alternative to grep? Anything about my syntax that can be improved, (egrep,fgrep better?)
The file is actually in a directory which is shared with a mount to another server but the actual diskspace is local so that shouldn't make any difference?
the grep is grabbing up to 93% CPU
Here are a few options:
1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.
2) Use fgrep because you're searching for a fixed string, not a regular expression.
3) Remove the -i option, if you don't need it.
So your command becomes:
LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql
It will also be faster if you copy your file to RAM disk.
If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:
< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'
Depending on your disks and CPUs it may be faster to read larger blocks:
< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'
It's not entirely clear from you question, but other options for grep include:
Dropping the -i flag.
Using the -F flag for a fixed string
Disabling NLS with LANG=C
Setting a max number of matches with the -m flag.
Some trivial improvement:
Remove the -i option, if you can, case insensitive is quite slow.
Replace the . by \.
A single point is the regex symbol to match any character, which is also slow
Two lines of attack:
are you sure, you need the -i, or do you habe a possibility to get rid of it?
Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.
< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'
If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.
Try ripgrep
It provides much better results compared to grep.
All the above answers were great. What really did help me on my 111GB file was using the LC_ALL=C fgrep -m < maxnum > fixed_string filename.
However, sometimes there may be 0 or more repeating patterns, in which case calculating the maxnum isn't possible. The workaround is to use the start and end patterns for the event(s) you are trying to process, and then work on the line numbers between them. Like so -
startline=$(grep -n -m 1 "$start_pattern" file|awk -F":" {'print $1'})
endline=$(grep -n -m 1 "$end_pattern" file |awk -F":" {'print $1'})
logs=$(tail -n +$startline file |head -n $(($endline - $startline + 1)))
Then work on this subset of logs!
hmm…… what speeds do you need ? i created a synthetic 77.6 GB file with nearly 525 mn rows with plenty of unicode :
rows = 524759550. | UTF8 chars = 54008311367. | bytes = 83332269969.
and randomly selected rows at an avg. rate of 1 every 3^5, using rand() not just NR % 243, to place the string db_pd.Clients at a random position in the middle of the existing text, totaling 2.16 mn rows where the regex pattern hits
rows = 2160088. | UTF8 chars = 42286394. | bytes = 42286394.
% dtp; pvE0 < testfile_gigantic_001.txt|
mawk2 '
_^(_<_)<NF { print (__=NR-(_+=(_^=_<_)+(++_)))<!_\
?_~_:__,++__+_+_ }' FS='db_pd[.]Clients' OFS=','
in0: 77.6GiB 0:00:59 [1.31GiB/s] [1.31GiB/s] [===>] 100%
out9: 40.3MiB 0:00:59 [ 699KiB/s] [ 699KiB/s] [ <=> ]
And mawk2 took just 59 seconds to extract out a list of row ranges it needs. From there it should be relatively trivial. Some overlapping may exist.
At throughput rates of 1.3GiB/s, as seen above calculated by pv, it might even be detrimental to use utils like parallel to split the tasks.
