bash - how to grep a string within a file and index with the file name - bash

I have a few thousand files each within their own folder and I want to link a specific string ID with the file name.
So for example here I have a couple of different folders :
Folder_1
file_abc.txt
Sample-001-abc
Folder_2
file_efg.txt
Sample-002-efg
Folder_3
file_hig.txt
Sample-003-hig
and I would like to grep based on the string "Sample" and link that string to the filename each is located in, to ensure I have the correct sample/filename order. Essentially, I would like to output a separate file that looks like this:
Filename_sample_linked.txt
file_abc.txt Sample-001-abc
file_efg.txt Sample-002-efg
file_hig.txt Sample-003-hig
...etc
I can grab all the IDs with the following:
find dir/*.txt -type f -exec grep -H 'Sample' {} + >> List_of_samples.txt
but am not quite sure how to link that with the file name that the string came from. Thanks in advance for your help!

Maybe sed.
If you actually meant for that filename to be line 1:
$: sed -E '1{s/.*/Filename_sample_linked.txt/;p;d}; /[.]txt/{N;s/\n/ /}; /^$/{N;d}' file
Filename_sample_linked.txt
file_abc.txt Sample-001-abc
file_efg.txt Sample-002-efg
file_hig.txt Sample-003-hig
Otherwise, replace 1{s/.*/Filename_sample_linked.txt/;p;d}; with just 1d; maybe.
I will assume you did NOT mean for the last line to literally be ...etc.

TLDR;
generate a script via "meta programming"
run it
save the output
Sometimes, if you can solve a problem by doing simple steps repetitively, then all you need to do is figure out the steps and repeat them. The advantage of this approach is that it's not very sophisticated and relatively easy to do.
go to the folder
list the two files side by side
save that to a list
In shell-speak, that could look like this:
cd Folder_1; echo file_* Sample*; cd ..
cd Folder_3; echo file_* Sample*; cd ..
I can show you how to do this many times, but first let's have some fun.
setup.sh
#!/bin/bash
md () { mkdir -p "$1"; cd "$1"; pwd; }
ran3() { tr -dc '[[:alnum:]]' </dev/urandom | dd bs=${1:-3} count=1 status=none ; }
w80() { echo $(( 20 + 80 * RANDOM / 65536 )); }
w20() { echo $(( 3 + 20 * RANDOM / 65536 )); }
# Initial setup
md Folder_1
touch file_abc.txt
touch Sample-001-abc
cd ..
md Folder_2
touch file_efg.txt
touch Sample-002-efg
cd ..
md Folder_3
touch file_hig.txt
touch Sample-003-hig
cd ..
# More folders with 3-char comparison patterns
s=0
s=$( find -name Sample\* | wc -l )
(( s++ ))
max=$(( s + 20 ))
while [[ $s -lt $max ]]; do
md Folder_$s
str=`ran3`
printf -v ss '%03d' $s
touch file_$str.txt
touch Sample-$ss-$str
(( s++ ))
cd ..
done
# More folders with random names
max=$(( max + 20 ))
while [[ $s -lt $max ]]; do
fstr=`ran3 $(w80)`
md ${fstr}_$s
str=`ran3 $(w20)`
printf -v ss '%03d' $s
touch file_$str.txt
touch Sample-$ss-$str
(( s++ ))
cd ..
done
I ran that a few times and now I have:
./Folder_143/file_nhE.txt
./Folder_143/Sample-143-nhE
./yHb9aSWkRDXqQfITkrqplQQMH[cNdMTYt_144/file_W9eDdDIEN.txt
./yHb9aSWkRDXqQfITkrqplQQMH[cNdMTYt_144/Sample-144-W9eDdDIEN
./ClauM57QCXPCqLPBHUMERI6Vxc_145/file_vhTSZfTh.txt
./ClauM57QCXPCqLPBHUMERI6Vxc_145/Sample-145-vhTSZfTh
./kpXndK8eapnUJKf9XvFZgnY31kUVNmUkHDp1ey[Q3IsY53EOQ_146/file_b[rj3ft.txt
./kpXndK8eapnUJKf9XvFZgnY31kUVNmUkHDp1ey[Q3IsY53EOQ_146/Sample-146-b[rj3ft
./mx]Yrj5eEa4PlEL2snmYOttZwc]Vi4rSCJ_147/file_]OyWwrVd0RN.txt
./mx]Yrj5eEa4PlEL2snmYOttZwc]Vi4rSCJ_147/Sample-147-]OyWwrVd0RN
./Zpi1sip87UmM85gd[dh]9sQn5ZE5rjBGA[[9ae_148/file_uQPZU[Tn[.txt
step 1 (generate the meta script)
find -name Sample\* -type f \
-printf "cd $PWD"';s="%f"; cd %h ; echo file_${s##*-}.txt %f\n'
Here we use find to make a bunch of commands for us, that we can run. Mixing quotes like "ABC"'123' into a single string ABC123 is just a clever way to interpolate some variables, while passing others to -printf for later.
"cd $PWD"
Go to the base folder before each sub-folder.
';s="%f"; cd %h ;
End the cd-pwd and asign the basename to s.
Go to the dirname of each found Samlple* file.
echo file_${s##*-}.txt %f\n'
Print the two files side by side.
${s##GLOB} means remove everything from the start of the s string that matches GLOB. So effectively remove everything up to and with the last dash.
The %f and %h macros are available when using find -printf, if that was unclear.
cd /home/jaroslav/tmp/so-link;s="Sample-138-Lj]"; cd ./Folder_138 ; echo file_${s##*-}.txt Sample-138-Lj]
cd /home/jaroslav/tmp/so-link;s="Sample-139-pad"; cd ./Folder_139 ; echo file_${s##*-}.txt Sample-139-pad
cd /home/jaroslav/tmp/so-link;s="Sample-140-ImN"; cd ./Folder_140 ; echo file_${s##*-}.txt Sample-140-ImN
cd /home/jaroslav/tmp/so-link;s="Sample-141-nxr"; cd ./Folder_141 ; echo file_${s##*-}.txt Sample-141-nxr
cd /home/jaroslav/tmp/so-link;s="Sample-142-4Di"; cd ./Folder_142 ; echo file_${s##*-}.txt Sample-142-4Di
cd /home/jaroslav/tmp/so-link;s="Sample-143-nhE"; cd ./Folder_143 ; echo file_${s##*-}.txt Sample-143-nhE
cd /home/jaroslav/tmp/so-link;s="Sample-144-W9eDdDIEN"; cd ./yHb9aSWkRDXqQfITkrqplQQMH[cNdMTYt_144 ; echo file_${s##*-}.txt Sample-144-W9eDdDIEN
cd /home/jaroslav/tmp/so-link;s="Sample-145-vhTSZfTh"; cd ./ClauM57QCXPCqLPBHUMERI6Vxc_145 ; echo file_${s##*-}.txt Sample-145-vhTSZfTh
cd /home/jaroslav/tmp/so-link;s="Sample-146-b[rj3ft"; cd ./kpXndK8eapnUJKf9XvFZgnY31kUVNmUkHDp1ey[Q3IsY53EOQ_146 ; echo file_${s##*-}.txt Sample-146-b[rj3ft
Keep in mind that this solution is fragile and relies on the regularity of the names of the input files / folders. If you are working with unpredictable data, you may need to think harder...
step 2 (run the meta script)
$ find -name Sample\* -type f \
-printf "cd $PWD"';s="%f"; cd %h ; echo file_${s##*-}.txt %f\n' \
| bash -x
cd /home/jaroslav/tmp/so-link;s="Sample-001-abc"; cd ./Folder_1 ; echo file_${s##*-}.txt Sample-001-abc
cd /home/jaroslav/tmp/so-link;s="Sample-002-efg"; cd ./Folder_2 ; echo file_${s##*-}.txt Sample-002-efg
cd /home/jaroslav/tmp/so-link;s="Sample-003-hig"; cd ./Folder_3 ; echo file_${s##*-}.txt Sample-003-hig
(...)
+ cd './v0a7F1K5P[fKL5NSYXaMZdFdGV7UCK_154'
+ echo file_R0IZENm2ni.txt Sample-154-R0IZENm2ni
file_R0IZENm2ni.txt Sample-154-R0IZENm2ni
+ cd /home/jaroslav/tmp/so-link
+ s=Sample-155-MEuAFsvztX
+ cd ./tKUlAFlPy2zq2xiZhruhS9U2VDnNQc7LiwYkUxL_155
+ echo file_MEuAFsvztX.txt Sample-155-MEuAFsvztX
file_MEuAFsvztX.txt Sample-155-MEuAFsvztX
+ cd /home/jaroslav/tmp/so-link
+ s='Sample-156-M7e]VOOHFz'
+ cd './3zVn7Z9ltN2MpmS[lo]6DCgv4RFEdX9XDoskFY0p_156'
+ echo 'file_M7e]VOOHFz.txt' 'Sample-156-M7e]VOOHFz'
file_M7e]VOOHFz.txt Sample-156-M7e]VOOHFz
(...)
step 3 (save the output)
$ find -name Sample\* -type f \
-printf "cd $PWD"';s="%f"; cd %h ; echo file_${s##*-}.txt %f\n' \
| bash | column -t \
| tee /tmp/list
(...)
file_ImN.txt Sample-140-ImN
file_nxr.txt Sample-141-nxr
file_4Di.txt Sample-142-4Di
file_nhE.txt Sample-143-nhE
file_W9eDdDIEN.txt Sample-144-W9eDdDIEN
file_vhTSZfTh.txt Sample-145-vhTSZfTh
file_b[rj3ft.txt Sample-146-b[rj3ft
file_]OyWwrVd0RN.txt Sample-147-]OyWwrVd0RN
(...)
From here you can sort the output or whatever else you need
$ cat /tmp/list | sed 's/\ \+/ /' | sort -b -k2.9,2.14n --debug
_________________________________________
file_M7e]VOOHFz.txt Sample-156-M7e]VOOHFz
___
_________________________________________
file_ftb8ifOfYt.txt Sample-157-ftb8ifOfYt
___
_________________________________________
file_tWOa]ZbTA.txt Sample-158-tWOa]ZbTA
___
_______________________________________
file_n4s2Znk.txt Sample-159-n4s2Znk
___
___________________________________
file_mjMox[P.txt Sample-160-mjMox[P
___
___________________________________
file_tQ3.txt Sample-161-tQ3
___
___________________________
file_G6VFcFHYH.txt Sample-162-G6VFcFHYH
___
_______________________________________
file_4mXCFTZaC.txt Sample-163-4mXCFTZaC
___
_______________________________________

Related

moving files to their respective folders using bash scripting

I have files in this format:
2022-03-5344-REQUEST.jpg
2022-03-5344-IMAGE.jpg
2022-03-5344-00imgtest.jpg
2022-03-5344-anotherone.JPG
2022-03-5343-kdijffj.JPG
2022-03-5343-zslkjfs.jpg
2022-03-5343-myimage-2010.jpg
2022-03-5343-anotherone.png
2022-03-5342-ebee5654.jpeg
2022-03-5342-dec.jpg
2022-03-5341-att.jpg
2022-03-5341-timephoto_december.jpeg
....
about 13k images like these.
I want to create folders like:
2022-03-5344/
2022-03-5343/
2022-03-5342/
2022-03-5341/
....
I started manually moving them like:
mkdir name
mv name-* name/
But of course I'm not gonna repeat this process for 13k files.
So I want to do this using bash scripting, and since I am new to bash, and I am working on a production environment, I want to play it safe, but it doesn't give me my results. This is what I did so far:
#!/bin/bash
name = $1
mkdir "$name"
mv "${name}-*" $name/
and all I can do is: ./move.sh name for every folder, I didn't know how to automate this using loops.
With bash and a regex. I assume that the files are all in the current directory.
for name in *; do
if [[ "$name" =~ (^....-..-....)- ]]; then
dir="${BASH_REMATCH[1]}"; # dir contains 2022-03-5344, e.g.
echo mkdir -p "$dir" || exit 1;
echo mv -v "$name" "$dir";
fi;
done
If output looks okay, remove both echo.
Try this
xargs -i sh -c 'mkdir -p {}; mv {}-* {}' < <(ls *-*-*-*|awk -F- -vOFS=- '{print $1,$2,$3}'|uniq)
Or:
find . -maxdepth 1 -type f -name "*-*-*-*" | \
awk -F- -vOFS=- '{print $1,$2,$3}' | \
sort -u | \
xargs -i sh -c 'mkdir -p {}; mv {}-* {}'
Or find with regex:
find . -maxdepth 1 -type f -regextype posix-extended -regex ".*/[0-9]{4}-[0-9]{2}-[0-9]{4}.*"
You could use awk
$ cat awk.script
/^[[:digit:]-]/ && ! a[$1]++ {
dir=$1
} /^[[:digit:]-]/ {
system("sudo mkdir " dir )
system("sudo mv " $0" "dir"/"$0)
}
To call the script and use for your purposes;
$ awk -F"-([0-9]+)?[[:alpha:]]+.*" -f awk.script <(ls)
You will see some errors such as;
mkdir: cannot create directory ‘2022-03-5341’: File exists
after the initial dir has been created, you can safely ignore these as the dir now exist.
The content of each directory will now have the relevant files
$ ls 2022-03-5344
2022-03-5344-00imgtest.jpg 2022-03-5344-IMAGE.jpg 2022-03-5344-REQUEST.jpg 2022-03-5344-anotherone.JPG

Sort files based on filename into folders and concat files within each folder based on folder name

Any help would be VERY appreciated! I have hundreds of video files named in the following format (see below). The first 4 characters are random, but there is always 4. 3000 is always there.
Can someone please help me create folders based on the center of the filename (ie 000, 001, 002, 003 and so on).
Then concatenate all the files in each of the folders using ffmpeg in order in their filename. 0000.ts, 0001.ts, 0002.ts and so on to a file named 000merged.ts, 001merged.ts, 002merged.ts and so on...
This is close to what I need
find . -type f -name "*jpg" -maxdepth 1 -exec bash -c 'mkdir -p "${0%%_*}"' {} \; \
-exec bash -c 'mv "$0" "${0%%_*}"' {} ;
mkdir /tmp/test && cd $_ #or cd ~/Desktop
echo A > 1e98_3000_000_000_0000.ts #create some small test files
echo B > 1e98_3000_000_000_0001.ts
echo C > 1e98_3000_000_000_0002.ts
echo D > 1e98_3000_000_000_0003.ts
echo E > d82j_3000_001_000_0000.ts
echo F > d82j_3000_001_000_0001.ts
echo G > d82j_3000_001_000_0002.ts
echo H > d82j_3000_001_000_0003.ts
echo I > a03l_3000_002_000_0000.ts
echo J > a03l_3000_002_000_0001.ts
echo K > a03l_3000_002_000_0002.ts
echo L > a03l_3000_002_000_0003.ts
# mkdir and copy each *.ts into its dir plus rename file:
perl -E'/^...._3000_(...)_..._(....\.ts)$/&&qx(mkdir -p $1;cp -p $_ $1/$2)for#ARGV' *.ts
ls -rtl
find ??? -type f -ls
for dir in ???;do cat $dir/????.ts > $dir/${dir}merged.ts; done
ls -rtl */*merged.ts
Cleanup test:
rm -rf /tmp/test/??? #cleanup new dirs with files
rm -rf /tmp/test #cleanup all

How to come in only in the new directory with script bash

#!/bin/bash
#cicle 1
for fname in *.xlsx *csv
do
dname=${fname%.*}
[[ -d $dname ]] || mkdir "$dname"
mv "$fname" "$dname"
done
# In questo ciclo per ogni gene entra nella cartella e lancia i programmi di getChromosomicPositions.sh per avere
# le posizioni nel genoma, e getHapolotipeStings.sh per avere le varianti
#cicle 2
for geni in */; do
cd $geni
z=$(tail -n 1 *.csv | tr ';' "\n" | wc -l) # geuarda nel file csv quante colonne ci sono e trasferisce il riusltato al programma getChromosomicPositions
cd ..
cp getHGSVposHG19.sh $geni
cp getChromosomicPositions.sh $geni
cp getHaplotypeStrings.sh $geni
cd $geni
pippo=$(basename $(pwd))
export z
export pippo
./getChromosomicPositions.sh *.csv
export z
./getHaplotypeStrings.sh *.csv
cd ..
done
I have this script with 2 cicle,
I want that at the cicle 2 do the work of compilate the programs only for the new directory created in the cicle 1.
I mean:
I have the principal directory with this file:
pippo.xlsx pippo.csv
caio.xlsx caio.csv
topolino(directory)
minny(directory)
paperino(directory)
in the cicle 1 make the directorys pippo and caio
I want that at the cicle 2 I want that comes in in the new directory (pippo, caio) and makes all the work of cicle only to the new directory pippo and caio but not to the old directorys topolino, minny and paperino.
How can I do this?
You could create an array with the newly created directories, and then loop over this array:
#!/bin/bash
set -o errexit # Exit the script in case of error
# Pre-declare and empty array.
declare -a new_directories;
# Save other scripts path for later use.
getHGSVposHG19="${pwd}/getHGSVposHG19.sh"
getChromosomicPositions="${pwd}/getChromosomicPositions.sh"
getHaplotypeStrings="${pwd}/getHaplotypeStrings.sh"
# Create missing directories
for fname in *.xlsx *csv
do
dname=${fname%.*}
if [ ! -d $dname ]
then
mkdir "$dname";
new_directories+=("${dname}");
fi
mv "$fname" "$dname";
done
# Initialize only new directories
for geni in "${new_directories[#]}"
do
pushd "${geni}"; # Add to directory stack
export pippo="$(basename $(pwd))";
export z=$(tail -n 1 *.csv | tr ';' "\n" | wc -l)
"${getChromosomicPositions}" *.csv;
export z; # Is the previous script modifying z?
"${getHaplotypeStrings}" *.csv;
popd; #Remove from directory stack
done
I also took the liberty to clean your script a bit, since I can't run it as is, it might have some typos/bugs...
What is pushd and popd?

Need a bash scripts to move files to sub folders automatically

I have a folder with 320G images, I want to move the images to 5 sub folders randomly(just need to move to 5 sub folders). But I know nothing on bash scripts.Please could someone help? thanks!
You could move the files do different directories based on their first letter:
mv [A-Fa-f]* dir1
mv [F-Kf-k]* dir2
mv [^A-Ka-k]* dir3
Here is my take on this. In order to use it place the script somewhere else (not in you folder) but run it from your folder. If you call your script file rmove.sh, you can place it in, say ~/scripts/, then cd to your folder and run:
source ~/scripts/rmove.sh
#/bin/bash
ndirs=$((`find -type d | wc -l` - 1))
for file in *; do
if [ -f "${file}" ]; then
rand=`dd if=/dev/random bs=1 count=1 2>/dev/null | hexdump -b | head -n1 | cut -d" " -f2`
rand=$((rand % ndirs))
i=0
for directory in `find -type d`; do
if [ "${directory}" = . ]; then
continue
fi
if [ $i -eq $rand ]; then
mv "${file}" "${directory}"
fi
i=$((i + 1))
done
fi
done
Here's my stab at the problem:
#!/usr/bin/env bash
sdprefix=subdir
dirs=5
# pre-create all possible sub dirs
for n in {1..5} ; do
mkdir -p "${sdprefix}$n"
done
fcount=$(find . -maxdepth 1 -type f | wc -l)
while IFS= read -r -d $'\0' file ; do
subdir="${sdprefix}"$(expr \( $RANDOM % $dirs \) + 1)
mv -f "$file" "$subdir"
done < <(find . -maxdepth 1 -type f -print0)
Works with huge numbers of files
Does not beak if a file is not moveable
Creates subdirectories if necessary
Does not break on unusual file names
Relatively cheap
Any scripting language will do so I'll write in Python here:
#!/usr/bin/python
import os
import random
new_paths = ['/path1', '/path2', '/path3', '/path4', '/path5']
image_directory = '/path/to/images'
for file_path in os.listdir(image_directory):
full_path = os.path.abspath(os.path.join(image_directory, file_path))
random_subdir = random.choice(new_paths)
new_path = os.path.abspath(os.path.join(random_subdir, file_path))
os.rename(full_path, new_path)
mv `ls | while read x; do echo "`expr $RANDOM % 1000`:$x"; done \
| sort -n| sed 's/[0-9]*://' | head -1` ./DIRNAME
run it in your current image directory, this command will select one file at a time and move it to ./DIRNAME, iterate this command until there are no more files to move.
Pay attention that ` is backquotes and not just quotes characters.

Need a quick bash script

I have about 100 directories all in the same parent directory that adhere to the naming convention [sitename].com. I want to rename them all [sitename].subdomain.com.
Here's what I tried:
for FILE in `ls | sed 's/.com//' | xargs`;mv $FILE.com $FILE.subdomain.com;
But it fails miserably. Any ideas?
Use rename(1).
rename .com .subdomain.com *.com
And if you have a perl rename instead of the normal one, this works:
rename s/\\.com$/.subdomain.com/ *.com
Using bash:
for i in *
do
mv $i ${i%%.com}.subdomain.com
done
The ${i%%.com} construct returns the value of i without the '.com' suffix.
What about:
ls |
grep -Fv '.subdomain.com' |
while read FILE; do
f=`basename "$FILE" .com`
mv $f.com $f.subdomain.com
done
See: http://blog.ivandemarino.me/2010/09/30/Rename-Subdirectories-in-a-Tree-the-Bash-way
#!/bin/bash
# Simple Bash script to recursively rename Subdirectories in a Tree.
# Author: Ivan De Marino <ivan.demarino#betfair.com>
#
# Usage:
# rename_subdirs.sh <starting directory> <new dir name> <old dir name>
usage () {
echo "Simple Bash script to recursively rename Subdirectories in a Tree."
echo "Author: Ivan De Marino <ivan.demarino#betfair.com>"
echo
echo "Usage:"
echo " rename_subdirs.sh <starting directory> <old dir name> <new dir name>"
exit 1
}
[ "$#" -eq 3 ] || usage
recursive()
{
cd "$1"
for dir in *
do
if [ -d "$dir" ]; then
echo "Directory found: '$dir'"
( recursive "$dir" "$2" "$3" )
if [ "$dir" == "$2" ]; then
echo "Renaming '$2' in '$3'"
mv "$2" "$3"
fi;
fi;
done
}
recursive "$1" "$2" "$3"
find . -name '*.com' -type d -maxdepth 1 \
| while read site; do
mv "${site}" "${site%.com}.subdomain.com"
done
Try this:
for FILE in `ls -d *.com`; do
FNAME=`echo $FILE | sed 's/\.com//'`;
`mv $FILE $FNAME.subdomain.com`;
done

Resources