How to process and save data in chunks using awk? - bash

Let's say I'm opening a large (several GB) file where I cannot read in the entire file as once.
If it's a csv file, we would use:
for chunk in pd.read_csv('path/filename', chunksize=10**7):
# save chunk to disk
Or we could do something similar with pandas:
import pandas as pd
with open(fn) as file:
for line in file:
# save line to disk, e.g. df=pd.concat([df, line_data]), then save the df
How does one "chunk" data with an awk script? Awk will parse/process text into a format you desire, but I don't know how to "chunk" with awk. One can write a script script1.awk and then process your data, but this processes the entire file at once.
Related question, with more concrete example: How to preprocess and load a "big data" tsv file into a python dataframe?

awk reads a single record (chunk) at a time by design. By default a record is line of data, but you can specify a record using the RS (record separator) variable. Each code block is conditionally executed on the current record before the next is read:
$ awk '/pattern/{print "MATCHED", $0 > "output"}' file
The above script will read a line at a time from the input file and if the that line matchs pattern it will save the line in the file output prepended with MATCHED before reading the next line.

Related

awk command does not halt on windows for merging large csv files

I am executing the following awk command on Windows 10.
awk "(NR == 1) || (FNR > 1)" *.csv > bigMergeFile.csv
I want to merge all csv files into a single file named bigMergeFile.csv using only the header of the first file.
I successfully tested the code on small files (4 files each containing 5 cols and 4 rows). However, the code does not halt when I run it on large files (10 files, each with 8k rows, 32k cols, approximate size 1 GB). It only stops execution when the space runs out on hard drive. At that time, the size of resultant output file bigMergeFile.csv is 30GB. The combine files size of all input csv file is 9.5 GB.
I have tested the code on Mac OS and it works fine. Help will be appreciated.
My guess: bigMergeFile.csv ends in .csv so it's one of the input files your script is running on and it's growing as your script appends to it. It's like you wrote a loop like:
while ! end-of-file do
read line from start of file
write line to end of file
done
since you're doing basically a concat not a merge, set FS = "^$" to it won't waste time attempting to split fields you won't need anyway.

How to use file chunks based on characters instead of lines for grep?

I am trying to parse log files of the form below:
---
metadata1=2
data1=2,data3=5
END
---
metadata2=1
metadata1=4
data9=2,data3=2, data0=4
END
Each section between the --- and END is an entry. I want to select the entire entry that contains a field such as data1. I was able to solve it with the following command, but it is painfully slow.
pcregrep -M '(?s)[\-].*data1.*END' temp.txt
What am I doing wrong here?
Parsing this file with pcregrep might be challenging. The 'pcregrep' does not have the ability to break the files into logical records. The pattern that was specific will try to find matching records by combining multiple record together. Sometimes including unmatched records in the output.
For example, if the input is "--- data=a END --- data1=a END", then the above command will select both records, as it will form a match between the initial '---', and the trailing 'END'
For this kind of input, consider using AWK. It has the ability to read input with custom record separator (RS), which make it easy to convert the input into records, and apply the pattern. If you prefer, you can use Perl or Python.
Using awk RS to create "records", possible to apply the pattern test on every record
awk -v RS='END\n' '/data1/ { print $0 }' < log1
awk -v RS='END\n' '/data1/ { print NR, $0 }' < log1
The second command include the record number in the output, if useful.
While AWK is not as fast as pcregrep, in this case, it will not have trouble processing large input set.
I would use awk:
awk 'BEGIN{RS=ORS="END\n"}/\ydata1/' file
Explanation:
awk works based on input records. By default such a record is a line of input, but this behaviour can be changed by setting the record separator (and output record separator for the output).
By setting them to END\n, we can search whole records of your input.
The regular expression /\ydata1/ searches those records for the presence of the the term data1, the \y matches a word boundary, to prevent from matching metadata1.

Create CSV from specific columns in another CSV using shell scripting

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.
The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Splitting non-equally file in bash

I have a file in csv format. I know positions where I want to chip off a chunk from the file and write it as a new csv file.
split command splits a file into equal-sized chunks. I wonder if there exists an effective (the file is huge) way to split file into chunks of different sizes?
I assume you want to split the file at a newline character. If this is the case you can use the head and tail commands to grab a number of lines from the beginning and from the end of your file, respectively.
If you want to copy a new of lines from within the file you can use sed, e.g.
sed -e 1,Nd -e Mq file
where N should be replaced with the line number of the line preceding the first line to display and M should be the line number of the last line to display.

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources