How do I use shell to split large file into pieces [duplicate] - shell

This question already has answers here:
Split one file into multiple files based on delimiter
(12 answers)
Closed 8 years ago.
I have a very large file looks like this :
//abc/file1.js
some javascript code
//abc/file2.js
some javascript code
//abc/file3.js
some javascript code
Here I want to split this large file into pieces and store the pieces into file1.js,file2.js etc.

You can do this with awk. Print out each input line, but to a file name that changes whenever the input line indicates that a new file starts.
awk '
/^\/\/abc\// { filename = $1; sub(/.*\//, "", filename); next; }
filename { print >filename }
'
Remove the call to next if you want the header lines to be included, e.g. to have //abc/file1.js as the first line of file1.js. You may want to tweak the code that recognizes header lines depending on your requirements. Text prior to the first header line will not be printed anywhere; change filename { … } to 1 { … } if you want to print it to standard output.

Try csplit -k -f file - '/^\/\//' '{1000}' < largefile.
Adjust 1000 to a suitable number. If there are n files in largefile, use n-2 instead of 1000.
If you're using GNU csplit, you can simply use * instead of 1000.
If there are many files in largefile, you'll need also to use -n 4 or some higher value.

If you can edit the file, and you know exactly where you want to split, not by some byte offset, then just copy the new pieces to new files and save those new pieces and the existing files with the names you want. That is, using the editor itself.

Related

How to delete a series of positions within a file based on a list of numbers with Bash

I'm pretty new in Bash scripting and i have a problem to solve. I have a file that look like this:
>atac
ATTGGCAATTAAATTCTTTT
>lipa
ATTACCAAGTAAATTCTTTT
.
.
.
where each even lines have the same length, but can have different characters, and i need to remove, in each even lines, a series of position listed in a .txt file. The .txt have only a list of number, one for each lines, that correspond to the positions to be removed and look like this:
3
5
8
10
11
the expected output must keep the same length for each even line, but in each of them, the positions listed in the .txt file must have been deleted.
Any suggestion?
If the "position" in the txt file indicates always the index of the original string, this awk-oneliner will help you:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x=""}7' your.txt FS="" OFS="" file
>atac
ATGCATAATTCTTTT
>lipa
ATACAGAATTCTTTT
We mark (as "-") the deleted char so that you can verify if the result is correct:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x="-"}7' txt FS="" OFS="" file
>atac
AT-G-CA-T--AATTCTTTT
>lipa
AT-A-CA-G--AATTCTTTT

Replace a whole line using sed [duplicate]

This question already has answers here:
Difference between single and double quotes in Bash
(7 answers)
Closed 4 years ago.
I am very new to this all and have used this website to help me find the answers i'm looking for.
I want to replace a line in multiple files across multiple directories. However I have struggled to do this.
I have created multiple directories 'path_{0..30}', each directory has the same 'input' file, and another file 'opt_path_rx_00i.xyz' where i corresponds to the directory that the file is in (i = {0..30}).
I need to be able to change one of the lines (line 7) in the input file, so that it changes with the directory that the input file is in (path_{0..30}). The line is:
pathfile opt_path_rx_00i.xyz
Where i corresponds to the directory that the file is in (i={0..30})
However, i'm struggling to do this using sed. I manage to change the line for each input file in the respective directories, but i'm unable to ensure that the number i changes with the directory. Instead, the input file in each directory just changes line 7 to:
pathfile opt_path_rx_00i.xyz
where i, in this case, is the letter i, and not the numbers {0..30}.
I'll show what i've done below in order to make more sense.
for i in {0..30}
do
sed -i '7s/.*/pathfile-opt_path_rx_00$i.xyz/' path_$i/input
done
What I want to happen is, for example in directory path_3, line 7 in the input file will be:
pathfile opt_path_rx_003.xyz
Any help would be much appreciated
can you try with double quotes
for i in {0..30}; do
sed -i "7s/.*/pathfile-opt_path_rx_00$i.xyz/" "path_$i/input"
done

Delete lines in a file based on first row

I try to work on a whole series of txt files (actually .out, but behaves like a space delimited txt file). I want to delete certain lines in the text, based on the output compared to the first row.
So for example:
ID VAR1 VAR2
1 8 9
2 4 1
3 3 2
I want to delete all the lines with VAR1 < 0,5.
I found a way to do this manually in excel, but with 350+ files, this is going to be a long night, there are sure ways to do this more effective.. I worked on this set of files already in terminal (OSX).
This is a typical job for awk, the venerable language for file manipulation.
What awk does is match each line in a file to a condition, and provide an action for it. It also allows for easy elementary parsing of line columns. In this case, you want to test whether the second column is less than 0.5, and if so not print that line. Otherwise, print the line (in effect this removes lines for which the variable is less than 0.5.
Your variable is in column 2, which in awk is referred to as $2. Each full line is referred to by the variable $0.
So you would do something like this:
{ if ($2 < 0.5) {
}
else {
print $0
}
}
Or something like that, I haven't used awk for a while. The above code is an awk script. Apply it on your file, and redirect the output to a new file (which will have all the lines not satisfying the condition removed).

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.
You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file
Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)
You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

Resources