Merging CSVs into one sees exponentially bigger size - bash

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?

I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

Related

awk command does not halt on windows for merging large csv files

I am executing the following awk command on Windows 10.
awk "(NR == 1) || (FNR > 1)" *.csv > bigMergeFile.csv
I want to merge all csv files into a single file named bigMergeFile.csv using only the header of the first file.
I successfully tested the code on small files (4 files each containing 5 cols and 4 rows). However, the code does not halt when I run it on large files (10 files, each with 8k rows, 32k cols, approximate size 1 GB). It only stops execution when the space runs out on hard drive. At that time, the size of resultant output file bigMergeFile.csv is 30GB. The combine files size of all input csv file is 9.5 GB.
I have tested the code on Mac OS and it works fine. Help will be appreciated.
My guess: bigMergeFile.csv ends in .csv so it's one of the input files your script is running on and it's growing as your script appends to it. It's like you wrote a loop like:
while ! end-of-file do
read line from start of file
write line to end of file
done
since you're doing basically a concat not a merge, set FS = "^$" to it won't waste time attempting to split fields you won't need anyway.

Insert bytes into file using shell

I would like to use a linux shell (bash, zsh, etc.) to insert a set of known bytes into a file at a certain position. Similar questions have been asked, but they modify in-place the bytes of a file. These questions don't address inserting new bytes at particular positions.
For example, if my file has a sequence of bytes like \x32\x33\x35 I might want to insert \x34 at position 2 so that this byte sequence in the file becomes \x32\x33\x34\x35.
You can achieve this using head, tail and printf together. For example; to insert \x34 at position 2 in file:
{ head -c 2 file; printf '\x34'; tail -c +3 file; } > new_file
For POSIX-compliance, \064 (octal representation of \x34) can be used.
To make this change in-place, just move new_file to file.
No matter which tool(s) you use, this operation will cost lots of CPU time for huge files.

How to split an mbox file into n-MB big chunks using the terminal?

So I've read through this question on SO but it does not quite help me any. I want to import a Gmail generated mbox file into another webmail service, but the problem is it only allows 40 MB huge files per import.
So I somehow have to split the mbox file into max. 40 MB big files and import them one after another. How would you do this?
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
I also looked at the split command, but Im afraid it would cutoff mails.
Thanks for any help!
I just improved a script from Mark Sechell's answer. As We can see, that script can parse the mbox file based on the amount of email per chunk. This improved script can parse the mbox file based on the defined-maximum-size for each chunk.
So, if you have size limitation in uploading or importing the mbox file, you can try the script below to split the mbox file into chunks with specified size*.
Save the script below to a text file, e.g. mboxsplit.txt, in the directory that contains the mbox file (e.g. named mbox):
BEGIN{chunk=0;filesize=0;}
/^From /{
if(filesize>=40000000){#file size per chunk in byte
close("chunk_" chunk ".txt");
filesize=0;
chunk++;
}
}
{filesize+=length()}
{print > ("chunk_" chunk ".txt")}
And then run/type this line in that directory (contains the mboxsplit.txt and the mbox file):
awk -f mboxsplit.txt mbox
Please note:
The size of the result may be larger than the defined size. It depends on the last email size inserted into the buffer/chunk before checking the chunk size.
It will not split the email body
One chunk may contain only one email if the email size is larger than the specified chunk size
I suggest you to specify the chunk size less or lower than the maximum upload/import size.
If your mbox is in standard format, each message will begin with From and a space:
From someone#somewhere.com
So, you could COPY YOUR MBOX TO A TEMPORARY DIRECTORY and try using awk to process it, on a message-by-message basis, only splitting at the start of any message. Let's say we went for 1,000 messages per output file:
awk 'BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}' mbox
then you will get output files called chunk_1.txt to chunk_n.txt each containing up to 1,000 messages.
If you are unfortunate enough to be on Windows (which is incapable of understanding single quotes), you will need to save the following in a file called awk.txt
BEGIN{chunk=0} /^From /{msgs++;if(msgs==1000){msgs=0;chunk++}}{print > "chunk_" chunk ".txt"}
and then type
awk -f awk.txt mbox
formail is perfectly suited for this task. You may look at formail's +skip and -total options
Options
...
+skip
Skip the first skip messages while splitting.
-total
Output at most total messages while splitting.
Depending on the size of your mailbox and mails, you may try
formail -100 -s <google.mbox >import-01.mbox
formail +100 -100 -s <google.mbox >import-02.mbox
formail +200 -100 -s <google.mbox >import-03.mbox
etc.
The parts need not be of equal size, of course. If there's one large e-mail, you may have only formail +100 -60 -s <google.mbox >import-02.mbox, or if there are many small messages, maybe formail +100 -500 -s <google.mbox >import-02.mbox.
To look for an initial number of mails per chunk, try
formail -100 -s <google.mbox | wc
formail -500 -s <google.mbox | wc
formail -1000 -s <google.mbox | wc
You may need to experiment a bit, in order to accommodate to your mailbox size. On the other hand, since this seems to be a one time task, you may not want to spend too much time on this.
My initial thought was to use the other script (formail) to save each mail as a single file and afterwards run a script to combine them to 40 MB huge files, but still I wouldnt know how to do this using the terminal.
If I understand you correctly, you want to split the files up, then combine them into a big file before importing them. That sounds like what split and cat were meant to do. Split splits the files based upon your size specification whether based upon line or bytes. It then adds a suffix to these files to keep them in order, You then use cat to put the files back together:
$ split -b40m -a5 mbox # this makes mbox.aaaaa, mbox.aaab, etc.
Once you get the files on the other system:
$ cat mbox.* > mbox
You wouldn't do this if you want to break the files so messages aren't split between files because you are going to import each file into the new mail system one at a time.

Alter the first 20 lines of a gzipped file without rewriting the entire file in BASH

I have 305 files. Each is ~10M lines. I only need to alter the first 20 lines of each file.
Specifically I need to add # as the first char of the first 18 Lines, delete the 19th line (but safer to say, delete all lines that are completely blank, and replace > with # on the 20th line.
The remaining 9.9999999M lines dont need to change at all.
If the files were not gzipped, I could do something like:
while read F; do
for i in $(seq 1 100); do
awk '{gsub(/#/,"##"); print $0}' $F
awk more commands
awk more commnds
done
done < "$FNAMES"
but what is really throwing a wrench is the fact the files are all gzipped. Is there any way to efficiently alter these 20 lines without unzipping and / or rewriting the whole file?
No, it is not possible. With adaptive compression schemes (such as the Lempel-Ziv system gzip uses), it adjusts the encoding based on what it sees as it goes through the file. This means that the way the end of the file gets compressed (and hence decompressed) depends on the beginning of the file. If you change just the beginning of the (compressed) file, you'll change how the end gets decompressed, essentially corrupting the file.
So decompressing, modifying, and recompressing is the only way to do it.

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources