Paste column to existing file in a loop - bash

I am using the paste command in a bash loop to add new columns to a CSV-file. I would like to reuse the CSV-file. Currently I am using a temporary file to accomplish this:
while [ $i -le $max ]
do
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text tmptxt.txt
#paste to temporary file
paste -d, existingfile.csv tmptxt.txt > tmpcsv.csv
#overwrite old csv with new csv
mv tmpcsv.csv existingfile.csv
((i++))
done
After adding some columns the copy is getting slow, because the file is becoming bigger and bigger (every tmptxt.txt has about 2 MB, adding to approx 100 MB).
A tmptxt.txt is a plain txt-file with one column and one value per row:
1
2
3
.
.
The existingfile.csv would then be
1,1,x
2,2,y
3,3,z
.,.,.
.,.,.
Is there any way to use the paste command to add a column to an existing file? Or is there any other way?
Thanks

Would it be feasible to split the operation in 2 ? One step for generating all the intermediate files; and another for generating all the final output file. The idea is to avoid rereading and rewriting over and over the final file.
The changes to the script would be something like this:
while [ $i -le $max ]
do
n=$(printf "%05d" $i) # to preserve lexical order if $max > 9
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text tmptxt$n.txt
((i++))
done
#make final file
paste -d, existingfile.csv tmptxt[0-9]*.txt > tmpcsv.csv
#overwrite old csv with new csv
mv tmpcsv.csv existingfile.csv

Assuming the number of lines output by the program is constant and is equal to number of lines in existingfile.csv (which should be the case since you are using paste)
Disclaimer: I'm not exactly sure if this would speed things up (depending on whether io redirection >> writes to the file exactly once or not). Anyway give it a try and let me know.
So the basic idea is
append the output in one go after the loop is done (note the change: wgrib now prints to - which is stdout)
use awk to move every linenum rows (linenum being the number of lines in existingfile.csv) to the end to the first linenum rows
Save to tempcsv.csv (because I can't find a way to save in the same file)
rename to / overwrite existingfile.csv
.
while [ $i -le $max ]; do
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text -
((i++))
done >> existingfile.csv
awk -v linenum=4 '
{ array[FNR%linenum]=array[FNR%linenum]","$0 }
END { for(i=1;i<linenum;i++) print array[i%linenum] }
' existingfile.csv > tempcsv.csv
mv tempcsv.csv existingfile.csv
If this is how I imagine it would work (internally), you should have 2 writes to existingfile.csv instead of $max number of writes. So hopefully this would speed things up.

Related

Paste hundreds of file with specific pattern name in bash/awk/c

I have 500 files and I want to merge them by adding columns.
My first file
3
4
1
5
My second file
7
1
4
2
Output should look like
3 7
4 1
1 4
5 2
But I have 500 files (sum_1.txt, sum_501.txt until sum_249501.txt), so I must have 500 column, so It will be very frustrating to write 500 file names.
Is it possible to do this easier? I try this, but it not makes 500 columns, but instead it makes a lot of rows
#!/bin/bash
file_name="sum"
tmp=$(mktemp) || exit 1
touch ${file_name}_calosc.txt
for first in {1..249501..500}
do
paste -d ${file_name}_calosc.txt ${file_name}_$first.txt >> ${file_name}_calosc.txt
done
Something like this (untested) should work regardless of how many files you have:
awk '
BEGIN {
for (i=1; i<=249501; i+=500) {
ARGV[ARGC++] = "sum_" i
}
}
{ vals[FNR] = (NR==FNR ? "" : vals[FNR] OFS) $0 }
END {
for (i=1; i<=FNR; i++) {
print vals[i]
}
}
'
It'd only fail if the total content of all the files was too big to fit in memory.
Your command says to paste two files together; to paste more files, give more files as arguments to paste.
You can paste a number of files together like
paste sum_{1..249501..500}_calosc.txt > sum_calosc.txt
but if the number of files is too large for paste, or the resulting command line is too long, you may still have to resort to temporary files.
Here's an attempt to paste 25 files at a time, then combine the resulting 20 files in a final big paste.
#!/bin/bash
d=$(mktemp -d -t pastemanyXXXXXXXXXXX) || exit
# Clean up when done
trap 'rm -rf "$d"; exit' ERR EXIT
for ((i=1; i<= 249501; i+=500*25)); do
printf -v dest "paste%06i.txt" "$i"
for ((j=1, k=i; j<=500; j++, k++)); do
printf "sum_%i.txt\n" "$k"
done |
xargs paste >"$d/$dest"
done
paste "$d"/* >sum_calosc.txt
The function of xargs is to combine its inputs into a single command line (or more than one if it would otherwise be too long; but we are specifically trying to avoid that here, because we want to control exactly how many files we pass to paste).

Split large csv file into multiple files and keep header in each part

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.

Redirect data from several files into one master file [duplicate]

This question already has answers here:
How to paste columns from separate files using bash?
(4 answers)
Closed 3 years ago.
I have multiple data files and I want to redirect some information from these files to another master file.
First I create the column headers in the master file. Then I attempt to transfer data from other files to the master file under the correct columns.
Create column headers in master file:
awk '
BEGIN {OFS=" "; print "%eval_id", "SF1", "power"}
' > output.dat
First column in master file is for loop index (1, 2, 3 ...):
for i in {1..2}; do
echo "$i" >> output.dat
done
Second column in master file, SF1, (extract data from sf1.dat which is a single column file)
Third column in master file, power, (extract data from power.dat which is also a single column file)
Outcome in 3 column format:
%eval_id SF1 power
1 23 300
2 45 650
Simplest Way
#!/bin/bash
awk 'BEGIN {OFS=" "; print "%eval_id", "SF1", "power"}' > output.dat
index=0
while IFS= read -r sf1 && IFS= read -r power <&3; do
index=$(( $index + 1))
echo -e $index"\t"$sf1"\t"$power >> output.dat
done <sf1.dat 3<power.dat
Explanation: Read your files inside your loop and perform your operations line by line.

While loop computed hash compare in bash?

I am trying to write a script to count the number of zero fill sectors for a dd image file. This is what I have so far, but it is throwing an error saying it cannot open file #hashvalue#. Is there a better way to do this or what am I missing? Thanks in advance.
count=1
zfcount=0
while read Stuff; do
count+=1
if [ $Stuff == "bf619eac0cdf3f68d496ea9344137e8b" ]; then
zfcount+=1
fi
echo $Stuff
done < "$(dd if=test.dd bs=512 2> /dev/null | md5sum | cut -d ' ' -f 1)"
echo "Total Sector Count Is: $count"
echo "Zero Fill Sector Count is: $zfcount"
Doing this in bash is going to be extremely slow -- on the order of 20 minutes for a 1GB file.
Use another language, like Python, which can do this in a few seconds (if storage can keep up):
python -c '
import sys
total=0
zero=0
file = open(sys.argv[1], "r")
while True:
a=file.read(512)
if a:
total = total + 1
if all(x == "\x00" for x in a):
zero = zero + 1
else:
break
print "Total sectors: " + str(total)
print "Zeroed sectors: " + str(zero)
' yourfilehere
Your error message comes from this line:
done < "$(dd if=test.dd bs=512 2> /dev/null | md5sum | cut -d ' ' -f 1)"
What that does is reads your entire test.dd, calculates the md5sum of that data, and parses out just the hash value, then, by merit of being included inside $( ... ), it substitutes that hash value in place, so you end up with that line essentially acting like this:
done < e6e8c42ec6d41563fc28e50080b73025
(except, of course, you have a different hash). So, your shell attempts to read from a file named like the hash of your test.dd image, can't find the file, and complains.
Also, it appears that you are under the assumption that dd if=test.dd bs=512 ... will feed you 512-byte blocks one at a time to iterate over. This is not the case. dd will read the file in bs-sized blocks, and write it in the same sized blocks, but it does not insert a separator or synchronize in any way with whatever is on the other side of its pipe line.

Getting specific lines of a file

I have this file with 25 million rows. I want to get specific 10 million lines from this file
I have the indices of these lines in another file. How can I do it efficiently?
Assuming that the list of lines is in a file list-of-lines and the data is in data-file, and that the numbers in list-of-lines are in ascending order, then you could write:
current=0
while read wanted
do
while ((current < wanted))
do
if read -u 3 line
then ((current++))
else break 2
fi
done
echo "$line"
done < list-of-lines 3< data-file
This uses the Bash extension that allows you to specify which file descriptor read should read from (read -u 3 to read from file descriptor 3). The list of line numbers to be printed is read from standard input; the data file is read from file descriptor 3. This makes one pass through each of the two files, which is within a constant factor of optimal.
If the list-of-lines is not sorted, replace the last line with the following, which uses the Bash extension called process substitution:
done < <(sort -n list-of-lines) 3< data-file
Assume that the file containing line indices is called "no.txt" and the data file is "input.txt".
awk '{printf "%08d\n", $1}' no.txt > no.1.txt
nl -n rz -w 8 input.txt | join - no.1.txt | cut -d " " -f1 --complement > output.txt
The output.txt will have the lines wanted. I am not sure if this is efficient enough. It seems to be faster than this script (https://stackoverflow.com/a/22926494/3264368) under my environment though.
Some explanations:
The 1st command preprocess the indices file so that the numbers are right adjusted with leading zeroes and width 8 (since number of rows in input.txt is known to be 25M)
The 2nd command will print the rows and line numbers with exactly the same format as in the preprocessed index file, then join them to get the wanted rows (cut to remove the line numbers).
Since you said the file with lines you're looking for is sorted, you can loop through the two files in awk:
awk 'BEGIN{getline nl < "line_numbers.txt"} NR == nl {print; getline nl < "line_numbers.txt"}' big_file.txt
This will read each line in each file precisely once.
Like your index file is index.txt and datafile is data.txt then you can do it using sed like as follows
#!/bin/bash
while read line_no
do
sed ''$line_no'q;d' data.txt
done < input.txt
You could run a loop that reads from the 25 million lined file and when the loop counter reaches a line number that you want tell it to write that line. EX:
String line = "";
int count = 0;
while((line = br.readLine())!=null)
{
if(count == indice)
{
System.out.println(line) //or file write
}

Resources