Splitting a directory of 14 million files to multiple subdirectories - bash

I have a directory called direct and it contains 14 million files that have the form file54.txt , where the number 54 in name file54.txt could be replaced by any natural number between 1 and 14 million. Is there a way to split those files into for example 1000 sub-directories in the directory titled direct that contain in total all of the 14 million files?

#!/bin/bash
for (( i=0; i < 14000000; ++i )); do
(( dirname=i/14000 ))
if (( i%14000 == 0 )); then
mkdir -p direct/$dirname
fi
mv direct/file$i.txt direct/$dirname/file$i.txt
done

Related

How do I create 100 files with 1 random number in each of them and give them permissions based on the number

I need to create a bash script that creates 100 files with random numbers in them.
I tried to do it using:
for i in {1..100}; do $RANDOM > RANDOM.txt
I don't know if that's the correct way to do it.
And then I need to give the files reading writing and execution permissions based on the number inside the file. No idea how to do that.
I tried using:
if [ $i%2==0 ]
then
echo chmod +rw $RANDOM.txt
but that doesn't seem to work
Just got some feedback, it turns out I was doing everything wrong.
I had to create 100 files 1.txt to 100.txt I used touch {1..100}.txt and then paste 1 random number in each of those files. Should I use echo or shuf to do this?
I think it'd be simplest to use chmod with octal permissions, like 0777 for rwxrwxrwx etc.
Example:
#!/bin/bash
for((i=0; i<100; ++i)) {
rnd=$RANDOM # pick a random number
(( perm = rnd % 512 )) # make it in the range [0, 512) (0000-0777 oct)
printf -v octperm "%04o" $perm # convert to octal
file=$rnd.txt # not sure if you meant to name it after the number
echo $rnd > $file
chmod $octperm $file # chmod with octal number
}
Excerpt of files:
-r-xrw--wx 1 ted users 5 15 dec 17.53 6515.txt
---x-wxrwx 1 ted users 5 15 dec 17.53 6751.txt
-rwx-w--w- 1 ted users 5 15 dec 17.53 8146.txt
-rw-r----- 1 ted users 5 15 dec 17.53 8608.txt
--w--w---x 1 ted users 5 15 dec 17.53 8849.txt
--wx----wx 1 ted users 5 15 dec 17.53 8899.txt
--wxrwx-wx 1 ted users 5 15 dec 17.53 8955.txt
-rw-r-xrw- 1 ted users 5 15 dec 17.53 9134.txt
...
If you want to take your current umask into consideration, you could do that too, by masking away the bits in the permission indicated by the umask.
#!/bin/bash
(( um = ~$(umask) )) # bitwise negated umask
for((i=0; i<100; ++i)) {
rnd=$RANDOM
(( perm = (rnd % 01000) & um )) # [0000,0777] bitwise AND umask
printf -v octperm "%04o" $perm
file=$i.$rnd.txt # using $i. to make name unique
echo $rnd > $file
chmod $octperm $file
}
If your umask is currently 0022 the above example would not create any files writeable for group and/or others while the other (normal) permissions would be random.
First, you need to echo the random number, not use it as a command.
Second, if you want to use the same random number as the filename and content, you need to save it to a variable. Otherwise you'll get a different number each time you write $RANDOM.
Third, that's not how you do arithmetic and conditions inside [], any shell scripting tutorial should show the correct way. You can also use a bash arithmetic expression with (( expression )).
#!/bin/bash
for i in {1..100}
do
r=$RANDOM
echo "$r" > "$r.txt"
if (( i % 2 == 0 ))
then
chmod +rw "$r.txt"
fi
done
From Ted Lyngmo's answer
With some bashisms, like using integer variables properties and avoiding forks...
declare -i um=" ~$(umask) " i rnd perm
for((i=100;i--;)){
rnd=RANDOM
perm=' ( rnd % 01000 ) & um '
printf -v file file-%03d-%04X.txt $i $rnd
printf -v octperm "%04o" "$perm"
echo "$rnd" > "$file"
chmod "$octperm" "$file"
}
(Filename is built with file number as decimal AND random number in hexadecimal)
About performances
Maybe a little quicker, because of avoiding forks and using integers.
( The for((;;)){ ;} syntax used here is not quicker, just different (shorter)...
In fact, for ((i=100;i--;)) ;do ...;done is (insensibly) slower than for i in {1..100};do ...;done! I just wanted to use this unusual syntax for extreme bashism... ;)
Some comparison:
export TIMEFORMAT=$'(%U + %S) / \e[1m%R\e[0m : %P'
About forks, trying 1'000 variable assignment for formatting, using printf:
time for i in {1..1000};do var=$(printf "foo");done
(0.773 + 0.347) / 1.058 : 105.93
time for i in {1..1000};do printf -v var "foo";done
(0.006 + 0.000) / 0.006 : 99.80
From 1.5 seconds to 6 milliseconds on my host!!! There are no discussion: forks (syntax $(printf...)) is to be avoided!!
About integer properties (using 100'000 binary operations):
declare -i intvar
time for i in {1..100000};do var=$(( ( 431214 % 01000 ) & -19 ));done
(0.272 + 0.005) / 0.278 : 99.46
time for i in {1..100000};do intvar=' ( 431214 % 01000 ) & -19 ';done
(0.209 + 0.000) / 0.209 : 99.87
From 0,28 seconds to 0.21 seconds, this is less significant, but.
About for i in { vs for ((i= (now using 1'000'000 loops):
time for i in {1..1000000};do :;done
(1.600 + 0.000) / 1.602 : 99.86
time for ((i=1000000;i--;));do :;done
(1.880 + 0.001) / 1.882 : 99.95
But this is clearly less significant (care about memory consumtion, using braces).
With awk, you could try following once. This program also take care of closing the open files in backend(in case we get error "too many files opened" once). Written and tested in GNU awk.
awk -v numberOfFiles="100" '
BEGIN{
srand()
for(i=1;i<=numberOfFiles;i++){
out=(int(1 + rand() * 10000))
print out > (out".txt")
system("chmod +rw " out".txt")
close(out)
}
}'
I have created an awk variable named numberOfFiles where I have given 100(as per need to create 100 files), you can keep it as per your need too, in future if you need to change it and we need not to change anything in rest of program here.

Sort text file in a shell script [duplicate]

This question already has answers here:
How to merge two files consistently line by line
(6 answers)
How can I extract a predetermined range of lines from a text file on Unix?
(28 answers)
Closed 2 years ago.
I've got a text file given, and the results and the counts vary (date, link and id can be anything). However, the count of dates, links and id's is always the same (so n - n - n for any positive integer n). Is n a positive integer, then the lines n, (n + k/3) and (n+2(k/3)), where k is the number of all lines, belong together.
As en example, I picked out n=3. So lines (1 | 4 | 7), (2 | 5 | 8) and (3 | 6 | 9) belong together:
Today, 17:09
Yesterday, 09:44
08.09.2020
07.09.2020
06.09.2020
/s-show/Link111...
/s-show/Link211...
/s-show/Link311...
/s-show/Link411...
/s-show/Link511...
id="1222222222"
id="2222222222"
id="3222222222"
id="4222222222"
id="5222222222"
I would like to sort the text file as the following:
id="1222222222"Today, 17:09/s-show/Link111...
id="2222222222"Yesterday, 09:44/s-show/Link211
id="3222222222"08.09.2020/s-show/Link311
id="4222222222"07.09.2020/s-show/Link411
id="5222222222"06.09.2020/s-show/Link511
In a former question, I only had two categories (date and link) and was adviced to do it like the following:
lc=$(wc -l <Textfile); paste -d '' <(head -n $((lc/2)) Textfile) <(tail -n
$((lc/2)) Textfile)
However, here I have 3 categories and the head and tail command won't let me read only the lines in the middle.
How could this be solved?
Leveraging the techniques taught in How can I extract a predetermined range of lines from a text file on Unix? --
#!/usr/bin/env bash
input=$1
total_lines=$(wc -l <"$1")
sections=$2
lines_per_section=$(( total_lines / sections ))
if (( lines_per_section * sections != total_lines )); then
echo "ERROR: ${total_lines} does not evenly divide into ${sections} sections" >&2
exit 1
fi
start=0
ranges=( )
for (( i=0; i<sections; i++ )); do
ranges+=( "$start:$(( start + lines_per_section ))" )
(( start += lines_per_section ))
done
get_range() { sed -n "$(( $1 + 1 )),$(( $2 ))p;$(( $2 + 1 ))q" <"$input"; }
consolidate_input() {
if (( $# )); then
current=$1; shift
paste <(get_range "${current%:*}" "${current#*:}") <(consolidate_input "$#")
fi
}
consolidate_input "${ranges[#]}"
But don't do that. Just put your three sections in three separate files, so you can use paste file1 file2 file3.

How to nest loops correctly

I have 2 scripts, #1 and #2. Each work OK by themselves. I want to read a 15 row file, row by row, and process it. Script #2 selects rows. Row 0 is is indicated as firstline=0, lastline=1. Row 14 would be firstline=14, lastline=15. I see good results from echo. I want to do the same with script #1. Can't get my head around nesting correctly. Code below.
#!/bin/bash
# script 1
filename=slash
firstline=0
lastline=1
i=0
exec <${filename}
while read ; do
i=$(( $i + 1 ))
if [ "$i" -ge "${firstline}" ] ; then
if [ "$i" -gt "${lastline}" ] ; then
break
else
echo "${REPLY}" > slash1
fold -w 21 -s slash1 > news1
sleep 5
fi
fi
done
# script2
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
for ((i=0;i<${#firstline[#]};i++))
do
echo ${firstline[$i]} ${lastline[$i]};
done
Your question is very unclear, but perhaps you are simply looking for some simple function calls:
#!/bin/bash
script_1() {
filename=slash
firstline=$1
lastline=$2
i=0
exec <${filename}
while read ; do
i=$(( $i + 1 ))
if [ "$i" -ge "${firstline}" ] ; then
if [ "$i" -gt "${lastline}" ] ; then
break
else
echo "${REPLY}" > slash1
fold -w 21 -s slash1 > news1
sleep 5
fi
fi
done
}
# script2
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
for ((i=0;i<${#firstline[#]};i++))
do
script_1 ${firstline[$i]} ${lastline[$i]};
done
Note that reading the file this way is extremely inefficient, and there are undoubtedly better ways to handle this, but I am trying to minimize the changes from your code.
Update: Based on your later comments, the following idiomatic Bash code that uses sed to extract the line of interest in each iteration solves your problem much more simply:
Note:
- If the input file does not change between loop iterations, and the input file is small enough (as it is in the case at hand), it's more efficient to buffer the file contents in a variable up front, as is demonstrated in the original answer below.
- As tripleee points out in a comment: If simply reading the input lines sequentially is sufficient (as opposed to extracting lines by specific line numbers, then a single, simple while read -r line; do ... # fold and output, then sleep ... done < "$filename" is enough.
# Determine the input filename.
filename='slash'
# Count its number of lines.
lineCount=$(wc -l < "$filename")
# Loop over the line numbers of the file.
for (( lineNum = 1; lineNum <= lineCount; ++lineNum )); do
# Use `sed` to extract the line with the line number at hand,
# reformat it, and output to the target file.
fold -w 21 -s <(sed -n "$lineNum {p;q;}" "$filename") > 'news1'
sleep 5
done
A simplified version of what I think you're trying to achieve:
#!/bin/bash
# Split fields by newlines on input,
# and separate array items by newlines on output.
IFS=$'\n'
# Read all input lines up front, into array ${lines[#]}
# In terms of your code, you'd use
# read -d '' -ra lines < "$filename"
read -d '' -ra lines <<<$'line 1\nline 2\nline 3\nline 4\nline 5\nline 6\nline 7\nline 8\nline 9\nline 10\nline 11\nline 12\nline 13\nline 14\nline 15'
# Define the arrays specifying the line ranges to select.
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
# Loop over the ranges and select a range of lines in each iteration.
for ((i=0; i<${#firstline[#]}; i++)); do
extractedLines="${lines[*]: ${firstline[i]}: 1 + ${lastline[i]} - ${firstline[i]}}"
# Process the extracted lines.
# In terms of your code, the `> slash1` and `fold ...` commands would go here.
echo "$extractedLines"
echo '------'
done
Note:
The name of the array variable filled with read -ra is lines; ${lines[#]} is Bash syntax for returning all array elements as separate words (${lines[*]} also refers to all elements, but with slightly different semantics), and this syntax is used in the comments to illustrate that lines is indeed an array variable (note that if you were to use simply $lines to reference the variable, you'd implicitly get only the item with index 0, which is the same as: ${lines[0]}.
<<<$'line 1\n...' uses a here-string (<<<) to read an ad-hoc sample document (expressed as an ANSI C-quoted string ($'...')) in the interest of making my example code self-contained.
As stated in the comment, you'd read from $filename instead:
read -d '' -ra lines <"$filename"
extractedLines="${lines[*]: ${firstline[i]}: 1 + ${lastline[i]} - ${firstline[i]}}" extracts the lines of interest; ${firstline[i]} references the current element (index i) from array ${firstline[#]}; since the last token in Bash's array-slicing syntax
(${lines[*]: <startIndex>: <elementCount>}) is the count of elements to return, we must perform a calculation to determine the count, which is what 1 + ${lastline[i]} - ${firstline[i]} does.
By virtue of using "${lines[*]...}" rather than "${lines[#]...}", the extracted array elements are joined by the first character in $IFS, which in our case is a newline ($'\n') (when extracting a single line, that doesn't really matter).

How to sum a row of numbers from text file-- Bash Shell Scripting

I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")

count number of patterns

I have a CSV file, and I wanna remove the columns that have less than 5 different values. e.g
a b c;
1 1 1;
1 2 2;
1 3 4;
2 4 5;
1 6 7;
then I wanna remove column a since it has only two different values (1,2). How to do this?
A solution using arrays:
infile="infile.txt"
different=5
rows=0
while read -a line ; do
data+=( ${line[#]/;/} ) # remove all semicolons
((rows++))
done < "$infile"
cols=$(( ${#data[#]}/rows )) # calculate number of rows
result=()
for (( CNTR1=0; CNTR1<cols; CNTR1+=1 )); do
cnt=()
save=( ${data[CNTR1]} ) # add column header
for (( CNTR2=cols; CNTR2<${#data[#]}; CNTR2+=cols )); do
cnt[${data[CNTR1+CNTR2]}]=1
save+=( ${data[CNTR1+CNTR2]} ) # add column data
done
if [ ${#cnt[#]} -eq $different ] ; then # choose column?
result+=( ${save[#]} ) # add column to the result
fi
done
cols=$((${#result[#]}/rows)) # recalculate number of columns
for (( CNTR1=0; CNTR1<rows; CNTR1+=1 )); do
for (( CNTR2=0; CNTR2<${#result[#]}; CNTR2+=rows )); do
printf " %s" "${result[CNTR1+CNTR2]}"
done
printf ";\n"
done
The output:
b c;
1 1;
2 2;
3 4;
4 5;
6 7;
I think to resolve this problem you can read this file to get data (numbers) (can put in a array) then search columns you want to remove and write this result back to file at last.

Resources