Bash script: get sublist of file

Bash script: get sublist of file - bash

I store the SQL script for a particular release in a subdirectory of 'scripts' named after the release version, e.g.
...
./scripts/1.8.3/script-1.8.3.sql
./scripts/1.8.4/script-1.8.4.sql
./scripts/1.8.4.1/script-1.8.4.1.sql
./scripts/1.8.4.2/script-1.8.4.2.sql
./scripts/1.8.4.3/script-1.8.4.3.sql
./scripts/1.9.0/script-1.9.0.sql
./scripts/1.9.1/script-1.9.1.sql
./scripts/1.9.2/script-1.9.2.sql
./scripts/1.9.3/script-1.9.3.sql
./scripts/1.9.4/script-1.9.4.sql
./scripts/1.9.5/script-1.9.5.sql
./scripts/1.9.6/script-1.9.6.sql
./scripts/1.9.6.1/script-1.9.6.1.sql
...
In a bash script, I need to get all the SQL files that apply beyond a certain version number. For example if this version number is 1.9.4 I would like to get the list
./scripts/1.9.4/script-1.9.4.sql
./scripts/1.9.5/script-1.9.5.sql
./scripts/1.9.6/script-1.9.6.sql
./scripts/1.9.6.1/script-1.9.6.1.sql
...
I know I can get the entire list of files ordered by release via
all_files = `find . -name '*.sql' | sort`
But I'm not sure how I can filter this list to get all files "on or after" a particular version.

echo 1.2.3 | awk -F'.' '{ ver=1000000*$1 + 1000*$2 + $3; if (ver > 1002001) print $_ }'

Brute force (matching patterns with regexps):
find . -name "*.sql" | egrep -v "1\.[0-8]|1\.9\.[0-3]"
Nicer way with sed:
% find . -name "*.sql" | sort -r | sed '/1\.9\.4/ {q}'
...
./scripts/1.9.6/script-1.9.6.sql
./scripts/1.9.6.1/script-1.9.6.1.sql
./scripts/1.9.5/script-1.9.5.sql
./scripts/1.9.4/script-1.9.4.sql
Explanation: sort in reverse, then use sed to stop processing the input the instant the version (1.9.4) is matched.

One twist if the files are created ordered by time would be
find . -name \*.sql -newer ./scripts/$VERSION/script-$VERSION.sql -print

Here's a generalized version based on Quassnoi's.
Assume
$ ls -1 releases/
program-0.0.1.app
program-0.0.10.app
program-0.0.9.app
program-0.3.1.app
program-3.3.1.app
program-3.30.1.app
program-3.9.1.app
(Notice 0.0.10 comes before 0.0.9, which is wrong). However, if we perform some "version math" we get the correct order:
$ ls -1 releases/ | sed 's/\(program-\)\(.*\)\(\.app\)/\2 \1\2\3/g' | awk -F '.' '{ ver=1000000*$1 + 1000*$2 + $3; printf "%010d %s\n", ver, $0}' | sort | awk '{print $3}'
program-0.0.1.app
program-0.0.9.app
program-0.0.10.app
program-0.3.1.app
program-3.3.1.app
program-3.9.1.app
program-3.30.1.app

Related

Script for printing out file names and their number of appearance starting from a given folder

I need to write a shell script, which starts with a given folder name as an argument will print out the names of folder and files in it and how many times does each name appear in the given folder.
edit I need to check only their names, without taking into consideration the file extensions.
#!/bin/bash
folder="$1"
for f in "$folder"
do
echo "$f"
done
And I would expect to see something like this (if i have 3 files with the same name and different extension like x.html, x.css, x.sh and so on, in a directory called dir)
x
3 times
after executing the script with dir (the name of the directory) as a parameter.

The find command already does most of this for you.
find . -printf "%f\n" |
sort | uniq -c
This will not work correctly if you have files whose names contain a newline.
If your find doesn't support -printf, maybe try
find . -exec basename {} \; |
sort | uniq -c
To restrict to just file names or directory names, add -type f or -type d, respectively, before the action (-exec or -printf).
If you genuinely want to remove extensions, try
find .... whatever ... |
sed 's%\.[^./]*$%%' |
sort | uniq -c

Can you try this,
#!/bin/bash
IFS=$'\n' array=($(ls))
iter=0;
for file in ${array[*]}; do
filename=$(basename -- "$file")
extension="${filename##*.}"
filename="${filename%.*}"
filenamearray[$iter]=$filename
iter=$((iter+1))
done
for filename in ${filenamearray[#]}; do
echo $filename;
grep -o $filename <<< "${filenamearray[#]}" | wc -l
done

You can try with find and awk :
find . -type f -print0 |
awk '
BEGIN {
FS="/"
RS="\0"
}
{
k = split( $NF , b , "." )
if ( k > 1 )
sub ( "."b[k] , "" , $NF )
a[$NF]++
}
END {
for ( i in a ) {
j = a[i]>1 ? "s" : ""
print i
print a[i] " time" j
}
}'

bash script to list duplicate hash files [duplicate]

This question already has answers here:
Linux Command Line using for loop and formatting results
(3 answers)
Closed 5 years ago.
I want to create a bash script that searches a given directory for pictures to copy. the pictures have to have the name format IMG_\d\d\d\d.JPG. If the pictures have a duplicate filename, then copy them to /images/archives and append .JPG to the end of their name, so the duplicates have .JPG.JPG. There are also duplicate pictures, so I want to hash each picture and check if it is a duplicate picture. If it is a duplicate picture, then do not copy the duplicate into /archives but store the duplicate file path into a file called output.txt.
I am struggling with trying to get the duplicate hashes to display the filenames as well. This is what I had so far:
if [ -d $1 ]
then echo using directory $1 as source
else echo Sorry, not a valid drive
exit
fi
if [ -d $2 ]
then echo $2 target location already exists
else mkdir -p $2
fi
cd $1
myList=`find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG"`
echo $myList
ImagesToCopy=`find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | cut -f1 -d" " | sort | uniq`
echo $ImagesToCopy
This gives me a list of the files I need to copy and their hashes. In the command line if I type in the command:
# find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | sort | cut -f1 -d" "| uniq -d
I receive the results:
266ab54fd8a6dbc7ba61a0ee526763e5
88761da2c2a0e57d8aab5327a1bb82a9
cc640e50f69020dd5d2d4600e20524ac
This is the list of duplicate files that I do not want to copy but I want to also display the file path and filenames alongside this, like this:
# find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | sort -k1 | uniq -u
043007387f39f19b3418fcba67b8efda ./IMG_1597.JPG
05f0c10c49983f8cde37d65ee5790a9f ./images/IMG_2012/IMG_2102.JPG
077c22bed5e0d0fba9e666064105dc72 ./DCIM/IMG_0042.JPG
1a2764a21238aaa1e28ea6325cbf00c2 ./images/IMG_2012/IMG_1403.JPG
1e343279cd05e8dbf371331314e3a2f6 ./images/IMG_1959.JPG
2226e652bf5e3ca3fbc63f3ac169c58b ./images/IMG_0058.JPG
266ab54fd8a6dbc7ba61a0ee526763e5 ./images/IMG_0079.JPG
266ab54fd8a6dbc7ba61a0ee526763e5 ./images/IMG_2012/IMG_0079.JPG
2816dbcff1caf70aecdbeb934897fd6e ./images/IMG_1233.JPG
451110cc2aff1531e64f441d253b7fec ./DCIM/103canon/IMG_0039.JPG
45a00293c0837f10e9ec2bfd96edde9f ./DCIM/103canon/IMG_0097.JPG
486f9dd9ee20ba201f0fd9a23c8e7289 ./images/IMG_2013/IMG_0060.JPG
4c2054c57a2ca71d65f92caf49721b4e ./DCIM/IMG_1810.JPG
53313e144725be3993b1d208c7064ef6 ./IMG_2288.JPG
5ac56dcddd7e0fd464f9b243213770f5 ./images/IMG_2012/favs/IMG_0039.JPG
65b15ebd20655fae29f0d2cf98588fc3 ./DCIM/IMG_2564.JPG
88761da2c2a0e57d8aab5327a1bb82a9 ./images/IMG_2012/favs/IMG_1729.JPG
88761da2c2a0e57d8aab5327a1bb82a9 ./images/IMG_2013/IMG_1729.JPG
8fc75b0dd2806d5b4b2545aa89618eb6 ./DCIM/103canon/IMG_2317.JPG
971f0a4a064bb1a2517af6c058dc3eb3 ./images/IMG_2012/favs/IMG_2317.JPG
aad617065e46f97d97bd79d72708ec10 ./images/IMG_2013/IMG_1311.JPG
c937509b5deaaee62db0bf137bc77366 ./DCIM/IMG_1152.JPG
cc640e50f69020dd5d2d4600e20524ac ./images/IMG_2012/favs/IMG_2013.JPG
cc640e50f69020dd5d2d4600e20524ac ./images/IMG_2013/IMG_2013.JPG
d8edfcc3f9f322ae5193e14b5f645368 ./images/IMG_2012/favs/IMG_1060.JPG
dcc1da7daeb8507f798e4017149356c5 ./DCIM/103canon/IMG_1600.JPG
ded2f32c88796f40f080907d7402eb44 ./IMG_0085.JPG
Thanks in advance.

Let's suppose that you have the results of md5sum. For example:
$ cat file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1b
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2b
To remove duplicates from the list, use awk:
$ awk '!($1 in a){a[$1]; print}' file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
This uses the array a to keep track of which md5 sums we have seen so far. For each line, if the md5 has not appeared before, !($1 in a), we mark that md5 as having been seen and print the line.
Alternative
A shorter version of the code is:
$ awk '!a[$1]++' file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
This uses array a to count the number of times that md5sum $1 has appeared. If the count is initially zero, then the line is printed.

Search for numbers in files and replace the match with "result -1"

I want to subtract 1 from every number before the character '=' in a list of files. For example, rename a string in a file such as "sometext.10.moretext" to "sometext.9.moretext".
I thought this might work:
grep -rl "[0-9]*" ./TEST | xargs sed -i "s/[0-9]*/&-1/g"
But it merely adds "-1" as a string after my numbers, so that the result is "sometext.10-1.moretext". I'm not really experienced with bash (and using it via windows), is there a way to do this? Powershell would also be an option.
edit
Input: some.text.10.text=some.other.text.10
Desired Output: some.text.9.text=some.other.text.10
Note: The actual number can be something from 1 to 9999.
File names have the following pattern: text#name#othername.config

You can use awk to achieve this :
I believe your words in strings are delimited by . and digits won't appear just before = otherwise this will fail as we are using only . as the delimiter.
awk 'BEGIN{FS=OFS="."} { for(i=1; i<=NF; i++) { if($i~"=")break; if($i~/^[0-9]+$/){$i=$i-1} }}1'
Input :
sometext.10.moretext=meow.10.meow
Output:
sometext.9.moretext=meow.10.meow

PowerShell
Get-ChildItem -Recurse | # List files
Select-String -Pattern '[0-9]' -List | # 'grep' files with numbers
Foreach-Object { # loop over those files
(Get-Content -LiteralPath $_.Path) | # read their content
ForEach-Object { # loop over the lines, do regex replace
[regex]::Replace($_, '[0-9]+', {param($match) ([int]$match.Value) - 1})
} | Set-Content -Path $_.Path -Encoding ASCII # output the lines
}
or in short form
gci -r|sls [0-9] -lis|% {(gc -lit $_.Path)|%{
[regex]::Replace($_,'[0-9]+',{"$args"-1})
}|sc $_.Path -enc ascii}

You can try
echo "sometext.10.moretext=meow.10.meow" |
sed -r 's/([^0-9]*)([0-9]*)(.*)/echo "\1$((\2-1))\3"/e'
Or changing files under TEST (see EDIT)
sed -ri 's/([^0-9]*)([0-9]*)(.*)/echo "\1$((\2-1))\3"/e' $(find ./TEST -type f)
EDIT
The find command will cause problems when filenames with spaces or newlines are encountered. So you should change the approach:
(Do not use grep -rlz "[0-9]*" ./TEST, that failed earlier)
find TEST -type f -print0 | xargs -0 sed -ri 's/([^0-9]*)([0-9]+)(.*)/echo "\1$((\2-1))\3"/e'

echo sometext.10.moretext=meow.10.meow| awk '{sub(/10/,"9")}1'
sometext.9.moretext=meow.10.meow

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.

You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired

sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.

This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

If xargs is map, what is filter?

I think of xargs as the map function of the UNIX shell. What is the filter function?
EDIT: it looks like I'll have to be a bit more explicit.
Let's say I have to hand a program which accepts a single string as a parameter and returns with an exit code of 0 or 1. This program will act as a predicate over the strings that it accepts.
For example, I might decide to interpret the string parameter as a filepath, and define the predicate to be "does this file exist". In this case, the program could be test -f, which, given a string, exits with 0 if the file exists, and 1 otherwise.
I also have to hand a stream of strings. For example, I might have a file ~/paths containing
/etc/apache2/apache2.conf
/foo/bar/baz
/etc/hosts
Now, I want to create a new file, ~/existing_paths, containing only those paths that exist on my filesystem. In my case, that would be
/etc/apache2/apache2.conf
/etc/hosts
I want to do this by reading in the ~/paths file, filtering those lines by the predicate test -f, and writing the output to ~/existing_paths. By analogy with xargs, this would look like:
cat ~/paths | xfilter test -f > ~/existing_paths
It is the hypothesized program xfilter that I am looking for:
xfilter COMMAND [ARG]...
Which, for each line L of its standard input, will call COMMAND [ARG]... L, and if the exit code is 0, it prints L, else it prints nothing.
To be clear, I am not looking for:
a way to filter a list of filepaths by existence. That was a specific example.
how to write such a program. I can do that.
I am looking for either:
a pre-existing implementation, like xargs, or
a clear explanation of why this doesn't exist

If map is xargs, filter is... still xargs.
Example: list files in the current directory and filter out non-executable files:
ls | xargs -I{} sh -c "test -x '{}' && echo '{}'"
This could be made handy trough a (non production-ready) function:
xfilter() {
xargs -I{} sh -c "$* '{}' && echo '{}'"
}
ls | xfilter test -x
Alternatively, you could use a parallel filter implementation via GNU Parallel:
ls | parallel "test -x '{}' && echo '{}'"

So, youre looking for the:
reduce( compare( filter( map(.. list()) ) ) )
what can be rewiritten as
list | map | filter | compare | reduce
The main power of bash is a pipelining, therefore isn't need to have one special filter and/or reduce command. In fact nearly all unix commands could act in one (or more) functions as:
list
map
filter
reduce
Imagine:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
Creating a test case:
mkdir ./testcase
cd ./testcase || exit 1
for i in {1..10}
do
strings -1 < /dev/random | head -1000 > file.$i.txt
done
mkdir emptydir
You will get a directory named testcase and in this directory 10 files and one directory
emptydir file.1.txt file.10.txt file.2.txt file.3.txt file.4.txt file.5.txt file.6.txt file.7.txt file.8.txt file.9.txt
each file contains 1000 lines of random strings some lines are contains only numbers
now run the command
find testcase -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
and you will get the largest number-only line from each files like: 42. (of course, this can be done more effectively, this is only for demo)
decomposed:
The find testcase -type f -print will print every plain files so, LIST (and reduced only to files). ouput:
testcase/file.1.txt
testcase/file.10.txt
testcase/file.2.txt
testcase/file.3.txt
testcase/file.4.txt
testcase/file.5.txt
testcase/file.6.txt
testcase/file.7.txt
testcase/file.8.txt
testcase/file.9.txt
the xargs grep -H '^[0-9]*$' as MAP will run a grep command for each file from a list. The grep is usually using as filter, e.g: command | grep, but now (with xargs) changes the input (filenames) to (lines containing only digits). Output, many lines like:
testcase/file.1.txt:1
testcase/file.1.txt:8
....
testcase/file.9.txt:4
testcase/file.9.txt:5
structure of lines: filename colon number, want only numbers so calling a pure filter, what strips out the filenames from each line cut -d: -f2. It outputs many lines like:
1
8
...
4
5
Now the reduce (getting the largest number), the sort -nr sorts all number numerically and reverse order (desc), so its output is like:
42
18
9
9
...
0
0
and the head -1 print the first line (the largest number).
Of course, you can write your own list/filter/map/reduce functions directly with bash programming constructions (loops, conditions and such), or you can employ any fullblown scripting language like perl, special languages like awk, sed "language", or dc (rpn) and such.
Having an special filter command such:
list | filter_command cut -d: -f 2
is simple doesn't needed, because you can use directly the
list | cut

You can have awk do the filter and reduce function.
Filter:
awk 'NR % 2 { $0 = $0 " [EVEN]" } 1'
Reduce:
awk '{ p = p + $0 } END { print p }'

I totally understand your question here as a long time functional programmer and here is the answer: Bash/unix command pipelining isn't as clean as you'd hoped.
In the example above:
find mydir -type f -print | xargs grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^------list+filter------^ ^--------map-----------^ ^--filter--^ ^compare^ ^reduce^
a more pure form would look like:
find mydir | xargs -L 1 bash -c 'test -f $1 && echo $1' _ | grep -H '^[0-9]*$' | cut -d: -f 2 | sort -nr | head -1
^---list--^^-------filter---------------------------------^^------map----------^^--map-------^ ^reduce^
But, for example, grep also has a filtering capability: grep -q mypattern which simply return 0 if it matches the pattern.
To get a something more like what you want, you simply would have to define a filter bash function and make sure to export it so it was compatible with xargs
But then you get into some problems. Like, test has binary and unary operators. How will your filter function handle this? Hand, what would you decide to output on true for these cases? Not insurmountable, but weird. Assuming only unary operations:
filter(){
while read -r LINE || [[ -n "${LINE}" ]]; do
eval "[[ ${LINE} $1 ]]" 2> /dev/null && echo "$LINE"
done
}
so you could do something like
seq 1 10 | filter "> 4"
5
6
7
8
9
As I wrote this I kinda liked it

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash script: get sublist of file - bash

echo 1.2.3 | awk -F'.' '{ ver=1000000$1 + 1000$2 + $3; if (ver > 1002001) print $_ }'

One twist if the files are created ordered by time would be find . -name \*.sql -newer ./scripts/$VERSION/script-$VERSION.sql -print

Related

Script for printing out file names and their number of appearance starting from a given folder

bash script to list duplicate hash files [duplicate]

Search for numbers in files and replace the match with "result -1"

find only the first file from many directories

If xargs is map, what is filter?

Categories

Resources