Average number of rows in 10000 text files - bash

I have a set of 10000 text files (file1.txt, file2.txt,...file10000.txt). Each one has a different number of rows. I'd like to know which is the average number of rows, among these 10000 files, excluding the last row. For example:
File1:
a
b
c
d
last
File2:
a
b
c
last
File2:
a
b
c
d
e
last
here I should obtain 4 as result. I tried with python but it requires too much time to read all the files. How could I do with a shell script?

Here's one way:
touch file{1..3}.txt
file 1 has 1 line, file 2 two lines and so on...
$ for i in {1..3}; do wc -l file${i}.txt; done | awk '{sum+=$1}END{print sum/NR}'
2

Related

how to take average count of columns from multiple files using shell?

I have 3 files each having 4 columns each, i want to take count of columns of each file and divide by total number of files, thus get the average column count from multiple files.
note- column count of each file may not be same each time and number of files can increase.
kindly help
eg -
file1 = 3 columns
file2 = 4 columns
file3 = 5 columns
sum(3+4+5)/3(file count)= avg column count for directory having multiple files.
You can use the below code snippet if you want. It is a bit elaborative to explain everything in detail.
The location contains 3 files
1 #!/bin/bash
2
3 fileCount=`ls -lrt file*.txt | wc -l` ##take the count of number of files in the location
4
5 for file in ./file*.txt; do ##running loop over the list of files individually. "file" is a parameter which'll reppresent each file
6 temp=`awk -F"," '{print NF}' ${file} | uniq` ##storing the column count in "temp" variable. Assumed each row has an entry
7 columnCount=$(($columnCount + $temp)) storing and adding column count from each file
8 done
9
10 avgColumn=$(( columnCount / fileCount )) ##Calculating average
11 echo "Files in directory `pwd` are having ${avgColumn} columns on average!" ##Printing the average

Compare files by position

I want to compare two files only by their first column.
My first looks like this:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf
000f7d4a8234d5c9 hdisk22 vgarcbkp
My second file looks like this:
000f7d4a8234a675 hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdiskpower61 [Lun_vgarcbkp]
This is the output I would like to generate:
0009608a4138a8e7 hdisk26 altinst_rootvg
000f7d4a8234a675 hdisk12 vgdbf hdiskpower64 [Lun_vgdbf]
000f7d4a8234d5c9 hdisk22 vgarcbkp hdiskpower61 [Lun_vgarcbkp]
I wonder why diff does not support positional compare.
Something like diff -y -p1-17 file1 file2. Any idea?
You can use join to produce your desired output :
join -a 1 file1 file2
The -a 1 option states to output lines from the first file which have no correspondances in the second, so this assumes the first file contains every id that is present in the second.
It also relies on the files being sorted on their first file, which could be the case according to your sample data. If it's not you will need to sort them beforehand (the join command will warn you about your files not being sorted).
Sample execution :
$ echo '1 a b
> 2 c d
> 3 e f' > test1
$ echo '2 9 8
> 3 7 6' > test2
$ join -a 1 test1 test2
1 a b
2 c d 9 8
3 e f 7 6
Ideone test with your sample data.

Joining lines, modulo the number of records

Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file

How to read a file located in in several folders and subfolders in Bash Shell

There are several files named TESTFILE which located in directories ~/main1/sub1, ~/main1/sub2, ~/main1/sub3, ..., ~/main2/sub1,~/main2/sub2, ... ~/mainX/subY where mainX is the main folder and subY are the subfolders inside the main folder. The TESTFILE file for each main folder-subfolder has the same pattern, but the data in each is unique.
Now here's what I want to do:
I want to read a specific number in the TESTFILE for each ~/mainX/subY.
I want to create a text file where every line has the following format [mainX][space][subY][space][value read from TESTFILE]
Some information about TESTFILE and the data I want to get:
It is an OSZICAR file from VASP, a DFT program
The number of lines in OSZICAR varies in different folder-subfolder combination
The information I want to get is always located in the last two lines of the file
The last two lines always look like this:
DAV: 2 -0.942521930239E+01 0.27889E-09 -0.79991E-13 864 0.312E-06
10 F= -.94252193E+01 E0= -.94252193E+01 d E =-.717252E-07
Or in general, the last two lines pattern is:
DAV: a b c d e f
g F= h E0= i d E = j
where the italicized parts are the parts that do not change and boldfaced variable are the ones that I want to get
Some information about main folder mainX and sub-folder subY:
The folders mainX and subY are all real numbers.
How I want the output to be:
Suppose mainX={0.12, 0.20, 0.34, 0.7} and subY={1.10, 2.30, 4.50, 1.00, 2.78}, and the last two lines of ~/0.12/1.10/OSZICAR is the example above, my output file should contain:
0.12 1.10 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
0.7 2.30 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
mainX mainY a g h i j
How do I do this in the simplest way possible? I'm reading grep, awk, sed and I'm very overwhelmed.
You could do this using some for loops in bash:
for m in ~/main*/; do
main=$(basename "$m")
for s in "$m"sub*/; do
sub=$(basename "$s")
num=$(tail -n2 TESTFILE | awk -F'[ =]+' 'NR==1{s=$2;next}{print s,$1,$3,$5,$8}')
echo "$main $sub $num"
done
done > output_file
I have modified the command to extract the data from your file. It uses tail to read the last two lines of the file. The lines are passed to awk, where they are split into fields using any number of spaces and = signs together as the field separator. The second field from the first of the two lines is saved to the variable s. next skips to the next line, then the columns that you are interested in are printed.
Your question is not very clear - specifically on how to extract the value from TESTFILE, but this is something like what you want:
#!/bin/bash
for X in {1..100}; do
for Y in {1..100}; do
directory="main${X}/sub${Y}"
echo Checking $directory
if [ -f "${directory}/TESTFILE" ]; then
something=$(grep something "${directory}/TESTFILE")
echo main${X} sub${Y} $something
fi
done
done

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Resources