using bash scripting for extracting data to various files

using bash scripting for extracting data to various files - bash

using bash scripting for extracting data to various files
awk '{for(i=1; i<=10; i++){if($1== 2**($i)){getline; print}}}' test.csv>> test/test_$i.csv
Description; I want to extract data to multiple files where column 1 of input file has sizes in power of 2. I want to extract rows having same size into a different file.
input file:
4 10.06 9.64 10.36 1000
8 10.16 9.79 10.48 1000
16 10.49 10.02 10.86 1000
32 10.54 10.13 10.91 1000
4 10.76 9.64 10.36 1000
8 10.90 9.79 10.48 1000

awk 'log($1)/log(2) == int(log($1)/log(2)) { out="pow-" $1; print >out }' file.in
This will, for the given data, create the files pow-N for N equal to 4, 8, 16 and 32.
It will skip lines that does not have a number that is a power of 2 in its first column.

Thanks for your help.
I figured out the possible solution:
for i in `seq 0 $numline`
do
if [ -e $InDir/$file ]
then
awk -v itr=$i '{if($2== 2**(itr)) {print $0}}' $file >> $OutDir/$(awk "BEGIN{print (2 ** $i)}")mb_$file
fi
done

Related

bash print only last 7 fields

I have hundreds of thousands of files with several hundreds of thousands of lines in each of them.
2022-09-19/SALES_1.csv:CUST1,US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20/SALES_2.csv:CUST2,NA,2022-09-20,12.4,16.08,48.08,18.9,15.9,3517
The lines may have different number of fields. NO matter how many fields are present, I'm wanting to extract just the last 7 fields.
I'm trying with cut & awk but, have been only able to prit a range of fields but not last 'n' fields.
Please could I request guidance.

$ rev file | cut -d, -f1-7 | rev
will give the last 7 fields regardless of varying number of fields in each record.

Using any POSIX awk:
$ awk -F',' 'NF>7{sub("([^,]*,){"NF-7"}","")} 1' file
US,2022-09-19,43.31,17.56,47.1,154.48,154. 114
2022-09-20,12.4,16.08,48.08,18.9,15.9,3517

1 {m,g}awk' BEGIN { _+=(_+=_^= FS = OFS = ",")+_
2 ___= "^[^"(__= "\5") ("]*")__
3
4 } NF<=_ || ($(NF-_) = __$(NF-_))^(sub(___,"")*!_)'
US,
2022-09-19,
43.31,
17.56,
47.1,
154.48,
154. 114
2022-09-20,
12.4,
16.08,
48.08,
18.9,
15.9,
3517

In pure Bash, without any external processes and/or pipes:
(IFS=,; while read -ra line; do printf '%s\n' "${line[*]: -7}"; done;) < file

Prints the last 7 fields:
sed -E 's/.*,((.*,){6}.*)/\1/' file

mean ad standart deviation of multiple files or every three lines in a single line in linux

My operating system is windows 10, I use bash on windows for executing linux commands. I have a file with 96 lines and I have multiple files that covered every three lines of this file and I want to add the mean and standard deviation of them into a single file as line by line.
Single file
1 31.31
2 32.24
3 32.11
4 20.97
5 20.93
6 20.91
7 22.58
8 22.46
9 22.52
10 20.71
11 20.25
12 20.51
File 1
1 31.31
2 32.24
3 32.11
File 2
4 20.97
5 20.93
6 20.91
File 3
7 22.58
8 22.46
9 22.52
First of all, I tried to split file with verbose mode to multiple files with
grep -i 'Sample' Sample3.txt | awk '{print $5, $6}' | sed 's/\,/\./g' >> Sample4.txt | split -l3 Sample4.txt --verbose
Can tcsh commands like foreach and awk used for bash scripting? can we do this in a single text file or do we have to split that single file into files?
for example output can be:
output.txt
mean stand.D.
31.88667 0.50362 ----- first three rows mean and sd
20.93667 0.030 ----- second three rows mean and sd
22.52 0.06 ----- third three rows mean and sd
etc etc etc

How about using this awk script?
BEGIN {
avg=0; j=0
fname = "file_output.txt"
printf "mean\t stand.D\n" > fname
}
{
avg = avg + $2
values[j] = $2
j = j + 1
if (NR % 3 == 0) {
printf "%f\t", avg/3 > fname
sd = 0
for (k = 0; k < 3; k++) {
sd = sd + (values[k]-avg/3)*(values[k]-avg/3)
}
printf "%f\n", sqrt(sd/3) > fname
avg = 0; j = 0
}
}
Output:
mean stand.D
31.8867 0.411204
20.9367 0.0249444
22.52 0.0489898
20.49 0.188326
"Bash script" (foo.sh):
#!/bin/bash
# data.txt is Single File
awk -F " " -f script.awk data.txt

Splitting files by line across two files equally without pre-defined chunk length - Unix

I have two files of equal length (i.e. no. of lines):
text.en
text.cs
I want to incrementally split the files into 12 parts and as I iterate, I need to add 1 out of the first ten part to it.
Let's say if I the files contain 100 lines, I need some sort of loop that does:
#!/bin/bash
F1=text.en
F2=text.cs
for i in `seq 0 9`;
do
split -n l/12 -d text.en
cat x10 > dev.en
cat x11 > test.en
echo "" > train.en
for j in `seq 0 $i`; do
cat x0$j >> train.en
done
split -n l/12 -d text.cs
cat x10 > dev.cs
cat x11 > test.cs
echo "" > train.cs
for j in `seq 0 $i`; do
cat x0$j >> train.cs
done
wc -l train.en train.cs
echo "############"
done
[out]:
55632 train.en
55468 train.cs
111100 total
############
110703 train.en
110632 train.cs
221335 total
############
165795 train.en
165011 train.cs
330806 total
############
It's giving me unequal chunks between the files.
Also, when I use split, it's splitting into unequal chunks:
alvas#ubi:~/workspace/cvmt$ split -n l/12 -d text.en
alvas#ubi:~/workspace/cvmt$ wc -l x*
55631 x00
55071 x01
55092 x02
54350 x03
54570 x04
54114 x05
55061 x06
53432 x07
52685 x08
52443 x09
52074 x10
52082 x11
646605 total
I don't know the no. of lines of the file before hand, so I can't use the split -l option.
How do I split a file into equal size by no. of lines given that I don't know how many lines are there in the files beforehand? Should I do some sort of pre-calculation with wc -l?
How do I ensure that the split across two files are of equal size in for every chunk?
(Note that the solution needs to split the file at the end of the lines, i.e. don't split up any lines, just split the file by line).

It's not entirely clear what you're trying to achieve, but here are a few pointers:
split -n l/12 splits into 12 chunks of roughly equal byte size, not number of lines.
split -n r/12 will try to distribute the line count evenly, but if the chunk size is not a divisor of the total line count, you'll still get (slightly) varying line counts: the extra lines are distributed round-robin style.
E.g., with 100 input lines and a line chunk size of 12, you'll get line counts of 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8: 100 / 12 = 8 (integer division), and 100 % 12 = 4, so all files get at least 8 lines, with the extra 4 lines distributed among the first 4 output files.
So, yes, if you want a fixed line count for all files (except for the last, if the chunk size is not a divisor), you must calculate the total line count up front, perform integer division to get the fixed line count, and use split -l with that count:
totalLines=$(wc -l < text.en)
linesPerFile=$(( totalLines / 12 ))
split -l 12 text.en # with 100 lines, yields 8 files with 12 and 1 with 4 lines
Additional observations:
With a small, fixed iteration count, it is easier and more efficient to use brace expansion (e.g., for i in {0..9} rather than for i in `seq 0 9`).
If a variable must be used, or with larger numbers, use an arithmetic expression:
n=9; for (( i = 0; i <= $n; i++ )); do ...; done
While you cannot do cat x0{0..$i} directly (because Bash doesn't support variables in brace expansions), you can emulate it by combining seq -f and xargs:
You can replace
echo "" > train.en
for j in `seq 0 $i`; do
cat x0$j >> train.en
done
with the following:
seq -f 'x%02.f' "$i" | xargs cat > train.en
Since you control the value of $i, you could even simplify to:
eval "cat x0{0..$i}" > train.en # !! Only do this if you trust $i to contain a number.

Count and sum up specific decimal number (bash,awk,perl)

I have a tab delimited file and I want to sum up certain decimal number to the output (1.5) each time its find number instead of character to the first column and print out the results for all the rows from the first to the last.
I have example file which look like this:
It has 8 rows
1st-column 2nd-Column
a ship
1 name
b school
c book
2 blah
e blah
3 ...
9 ...
Now I want my script to read line by line and if it finds number sum up 1.5 and give me output just for first column like this:
0
1.5
1.5
1.5
3
3
4.5
6
my script is:
#!/bin/bash
for c in {1..8}
do
awk 'NR==$c { if (/^[0-9]/) sum+=1.5} END {print sum }' file
done
but I don't get any output!
Thanks for your help in advance.

The last item in your expected output appears to be incorrect. If it is, then you can do:
$ awk '$1~/^[[:digit:]]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6

use warnings;
use strict;
my $sum = 0;
while (<DATA>) {
my $data = (split)[0]; # 1st column
$sum += 1.5 if $data =~ /^\d+$/;
print "$sum\n";
}
__DATA__
a ship
1 name
b school
c book
2 blah
e blah
3 ...
6 ...

Why not just use awk:
awk '{if (/^[0-9]+[[:blank:]]/) sum+=1.5} {print sum+0 }' file
Edited to simplify based on jaypal's answer, bound the number and work with tabs and spaces.

How about
perl -lane 'next unless $F[0]=~/^\d+$/; $c+=1.5; END{print $c}' file
Or
awk '$1~/^[0-9]+$/{c+=1.5}END{print c}' file
These only produce the final sum as your script would have done. If you want to show the numbers as they grow use:
perl -lane 'BEGIN{$c=0}$c+=1.5 if $F[0]=~/^\d+$/; print "$c"' file
Or
awk 'BEGIN{c=0}{if($1~/^[0-9]+$/){c+=1.5}{print c}}' file

I'm not sure if you're multiplying the first field by 1.5 or adding 1.5 to a sum every time there's any number in $1 and ignoring the contents of the line otherwise. Here's both in awk, using your sample data as the contents of "file."
$ awk '$1~/^[0-9]+$/{val=$1*1.5}{print val+0}' file
0
1.5
1.5
1.5
3
3
4.5
9
$ awk '$1~/^[0-9]+$/{sum+=1.5}{print sum+0}' file
0
1.5
1.5
1.5
3
3
4.5
6
Or, here you go in ksh (or bash if you have a newer bash that can do floating point math), assuming the data is on STDIN
#!/usr/bin/ksh
sum=0
while read a b
do
[[ "$a" == +([0-9]) ]] && (( sum += 1.5 ))
print $sum
done

How to split a file into equal parts, without breaking individual lines? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 5 years ago.
I was wondering if it was possible to split a file into equal parts (edit: = all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!

If you mean an equal number of lines, split has an option for this:
split --lines=75
If you need to know what that 75 should really be for N equal parts, its:
lines_per_part = int(total_lines + N - 1) / N
where total lines can be obtained with wc -l.
See the following script for an example:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
This outputs:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:
split --number=l/6 ${fspec} xyzzy.
(that's ell-slash-six, meaning lines, not one-slash-six).
That will give you roughly equal files in terms of size, with no mid-line splits.
I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.

The script isn't even necessary, split(1) supports the wanted feature out of the box:
split -l 75 auth.log auth.log.
The above command splits the file in chunks of 75 lines a piece, and outputs file on the form: auth.log.aa, auth.log.ab, ...
wc -l on the original file and output gives:
321 auth.log
75 auth.log.aa
75 auth.log.ab
75 auth.log.ac
75 auth.log.ad
21 auth.log.ae
642 total

A simple solution for a simple question:
split -n l/5 your_file.txt
no need for scripting here.
From the man file, CHUNKS may be:
l/N split into N files without splitting lines
Update
Not all unix dist include this flag. For example, it will not work in OSX. To use it, you can consider replacing the Mac OS X utilities with GNU core utilities.

split was updated in coreutils release 8.8 (announced 22 Dec 2010) with the --number option to generate a specific number of files. The option --number=l/n generates n files without splitting lines.
coreutils manual

I made a bash script, that given a number of parts as input, split a file
#!/bin/sh
parts_total="$2";
input="$1";
parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
lines=$(wc -l "$input" | cut -f 1 -d" ")
#n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
n=$(awk -v lines=$lines -v parts=$parts 'BEGIN {
n = lines/parts;
rounded = sprintf("%.0f", n);
if(n>rounded){
print rounded + 1;
}else{
print rounded;
}
}');
head -$n "$input" > split${i}
tail -$((lines-n)) "$input" > .tmp${i}
input=".tmp${i}"
parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*
I used head and tail commands, and store in tmp files, for split the files
#10 means 10 parts
sh mysplitXparts.sh input_file 10
or with awk, where 0.1 is 10% => 10 parts, or 0.334 is 3 parts
awk -v size=$(wc -l < input) -v perc=0.1 '{
nfile = int(NR/(size*perc));
if(nfile >= 1/perc){
nfile--;
}
print > "split_"nfile
}' input

var dict = File.ReadLines("test.txt")
.Where(line => !string.IsNullOrWhitespace(line))
.Select(line => line.Split(new char[] { '=' }, 2, 0))
.ToDictionary(parts => parts[0], parts => parts[1]);
or
enter code here
line="to=xxx#gmail.com=yyy#yahoo.co.in";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);
ans:
tokens[0]=to
token[1]=xxx#gmail.com=yyy#yahoo.co.in"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

using bash scripting for extracting data to various files - bash

awk 'log($1)/log(2) == int(log($1)/log(2)) { out="pow-" $1; print >out }' file.in This will, for the given data, create the files pow-N for N equal to 4, 8, 16 and 32. It will skip lines that does not have a number that is a power of 2 in its first column.

Thanks for your help. I figured out the possible solution: for i in `seq 0 $numline` do if [ -e $InDir/$file ] then awk -v itr=$i '{if($2== 2(itr)) {print $0}}' $file >> $OutDir/$(awk "BEGIN{print (2 $i)}")mb_$file fi done

Related

bash print only last 7 fields

mean ad standart deviation of multiple files or every three lines in a single line in linux

Splitting files by line across two files equally without pre-defined chunk length - Unix

Count and sum up specific decimal number (bash,awk,perl)

How to split a file into equal parts, without breaking individual lines? [duplicate]

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

using bash scripting for extracting data to various files - bash

awk 'log($1)/log(2) == int(log($1)/log(2)) { out="pow-" $1; print >out }' file.in This will, for the given data, create the files pow-N for N equal to 4, 8, 16 and 32. It will skip lines that does not have a number that is a power of 2 in its first column.

Thanks for your help. I figured out the possible solution: for i in `seq 0 $numline` do if [ -e $InDir/$file ] then awk -v itr=$i '{if($2== 2**(itr)) {print $0}}' $file >> $OutDir/$(awk "BEGIN{print (2 ** $i)}")mb_$file fi done

Related

bash print only last 7 fields

mean ad standart deviation of multiple files or every three lines in a single line in linux

Splitting files by line across two files equally without pre-defined chunk length - Unix

Count and sum up specific decimal number (bash,awk,perl)

How to split a file into equal parts, without breaking individual lines? [duplicate]

Categories

Resources

Thanks for your help. I figured out the possible solution: for i in `seq 0 $numline` do if [ -e $InDir/$file ] then awk -v itr=$i '{if($2== 2(itr)) {print $0}}' $file >> $OutDir/$(awk "BEGIN{print (2 $i)}")mb_$file fi done