Use awk to separate text file into multiple files - shell

I've read a couple of other questions about this, but none of them seem to be working. I'm currently trying to split something like file A.txt using the delimiter "STOPHERE".
This is the code:
#!/bin/bash
awk 'BEGIN{
RS = "STOPHERE"
file = 0}
{
file++
print $0 > ("sepf" file)
}' A.txt
File A:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa lwdjnuqqfqaaaaaaaaaa qlknfqek fkgnl efekfnwegelflfne
ldnwefne f STOPHEREsdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn STOPHERE
fknjke bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbowqff STOPHERE i
asfjfenf STOPHERE
Into these:
sepf1:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa lwdjnuqqfqaaaaaaaaaa qlknfqek fkgnl efekfnwegelflfne
ldnwefne f
sepf2:
sdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn
sepf3:
#line starts here
fknjke bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbowqff
sepf4:
i
asfjfenf
So basically, the formatting has to stay exactly the same between the STOPHERE.
But for some reason, this is the kind of output I'm getting in some of the files:
Eg: sepf2
TOPHEREsdfnkjnf nnnnnnnnnnnnnnnnnnnnnnnasd fefffffffffffffflllo
aldn3orn
Any ideas as to why the "TOPHERE" remains??

GNU awk allows RS to be a regex. So you can provide multiple characters as a record separator. Your code can also be simplified as AWK provides a default value of 0.
So this will generate separate files for each record.
awk -v RS="STOPHERE" '{print $0 > ("sepf" ++file)}'

Related

Script to find and print common IDs between two files working but it's not optimal

I have a code that is working and does what I want, but it is extremely slow. It takes 1 or 2 days depending on the size of the input files. I know that there are alternatives that can be almost instant and that my code is slow because it's a recursive grep. I wrote another code in python that works as intended and is almost instant, but it does not print everything I need.
What I need is the common IDs between two files, and I want it to print the whole line. My python script does not do that, while the bash does it but it's too much slow.
This is my code in bash:
awk '{print $2}' file1.bim > sites.txt
for snp in `cat sites.txt`
do
grep -w $snp file2.bim >> file1_2_shared.txt
done
This is my code in python:
#!/usr/bin/env python3
import sys
argv1=sys.argv[1] #argv1 is the first .bim file
argv2=sys.argv[2] #argv2 is the second .bim file
argv3=sys.argv[3] #argv3 is the output .txt file name
def printcommonSNPs(inputbim1,inputbim2,outputtxt):
bim1 = open(inputbim1, "r")
bim2 = open(inputbim2, "r")
output = open(outputtxt,"w")
snps1 = []
line1 = bim1.readline()
line1 = line1.split()
snps1.append(line1[1])
for line1 in bim1:
line1 = line1.split()
snps1.append(line1[1])
bim1.close()
snps2 = []
line2 = bim2.readline()
line2 = line2.split()
snps2.append(line2[1])
for line2 in bim2:
line2 = line2.split()
snps2.append(line2[1])
bim2.close()
common=[]
common = list(set(snps1).intersection(snps2))
for SNP in common:
print(SNP, file=output)
printcommonSNPs(argv1,argv2,argv3)
My .bim input files are made this way:
1 1:891021 0 891021 G A
1 1:903426 0 903426 T C
1 1:949654 0 949654 A G
I would appreciate suggestions on what I could do to make it quick in bash (I suspect I can use an awk script, but I tried awk 'FNR==NR {map[$2]=$2; next} {print $2, map[$2]}' file1.bim file2.bim > Roma_sets_shared_sites.txt and it simply prints every line, so it's not working as I need), or how could I tell to print the whole line in python3.
It looks as if the problem can be solved like this:
grep -w -f <(awk '{ print $2 }' file1.bim) file2.bim
The identifiers (field $2) from file1.bim are to be treated as patterns to grep for in file2.bim. GNU grep takes a -f file argument which gives a list of patterns, one per line. We use <() process substitution in place of a file. It looks as if the -w option individually applies to the -f patterns.
This won't have the same output as your shell script if there are duplicate IDs in file1.bim. If the same pattern occurs more than once, that's the same as one instance. And of course the order is different. Grepping the entire second file for one identifier and hen the next and next, produces the matches in a different order. If that order has to be reproduced, it will take extra work.

Need help splitting a file with blanc lines (BASH)

I have a file containing several few thousands of lines. The file format is similar to this:
1
H
H 13.1641870 7.1039560 -5.9652740
3
O2H2
H 15.5567440 5.6184980 -4.5255100
H 15.8907030 4.2338600 -5.4917990
O 15.5020000 6.4310000 -7.0960000
O 13.7940000 5.5570000 -8.1620000
2
CH
H 13.0960830 7.7155820 -3.5224750
C 11.0480000 7.4400000 -5.5080000
.
.
.
.
What I want is to split the full file in several files where putting in each file all the information between empty lines. The problem is that the blank lines do not follow a pattern. Some parts of the text have 1 line and others have 10.
Could someone tell me how to separate the file using the blank lines as separator?
Using awk and the data in a file called mainfile
awk 'BEGIN { RS="[\n]+" } { print $0 >> "file"NR".txt" }' mainfile
Set the record separator to one or more line feeds and then print each record to a file dictated by the record number i.e. file1.txt etc
Would you please try the following:
awk -v RS="" '{print > "file" ++i ".txt"; close("file" i ".txt")}' input.txt
If the awk variable RS is set to the null string, then records are separated by blank lines.
It is recommended to close each file to avoid the "too many open files" error.

AWK post-procession of multi-column data

I am working with the set of txt file containing multi column information present in one line. Within my bash script I use the following AWK expression to take the filename from each of the txt filles as well as the number from the 5th column and save it in 2 column format in results.CSV file (piped to SED, which remove path of the file and its extension from the final CSV file):
awk '-F, *' '{if(FNR==2) printf("%s| %s \n", FILENAME,$5) }' ${tmp}/*.txt | sed 's|\/Users/gleb/Desktop/scripts/clusterizator/tmp/||; s|\.txt||' >> ${home}/"${experiment}".csv
obtaining something (for 5 txt filles) like this as CSV:
lig177_cl_5.2| -0.1400
lig331_cl_3.5| -8.0000
lig394_cl_1.9| -4.3600
lig420_cl_3.8| -5.5200
lig550_cl_2.0| -4.3200
How it would be possible to modify my AWK expression in order to exclude "_cl_x.x" from the name of each txt file as well as add the name of the CSV as the comment to the first line of the resulted CSV file:
# results.CSV
lig177| -0.1400
lig331| -8.0000
lig394| -4.3600
lig420| -5.5200
lig550| -4.3200
based on the rest of the pipe, I think you want to do something like this and get rid of sed invocations.
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s| %s\n", f, $5) }' "${tmp}"/*.txt >> "${home}/${experiment}.csv"
this will convert
/Users/gleb/Desktop/scripts/clusterizator/tmp/lig177_cl_5.2.txt
to
lig177
The pattern replacement is generic
/path/to/the/file/filename_otherstringshere...
will extract only filename. From the last / char to the first _ char. This is based the greedy matching of regex patterns.
For the output filename, it's easier to do it before awk call, since it's a one line only.
$ echo "${experiment}.csv" > "${home}/${experiment}.csv"
$ awk ... >> "${home}/${experiment}.csv"

How to collate multiple files in AWK?

I am trying to collate a series of .csv log files that are named by date (e.g., 2019-02-24.csv). There are a bunch of them, so I'm trying to script the process. I've crafted an AWK script that combines individual files:
awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFICE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> usage_history.csv
But I am failing when I try to string the AWK commands together with a control loop in BASH:
for i in {01..28}; do echo "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
When I run this, it prints out the correct commands to the command line, but the awk scripts are not executed (they only get printed). If I run it without echo, I get errors telling me that the file doesn't exist; though all files are present:
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
What am I missing in my loop?
Here is a condensed sample of the command and the error messages:
$ for i in {01..02}; do "awk ' FNR==1 { while (/\"_time\",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-$i.csv >> user_history.csv"; done
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-01.csv >> user_history.csv: No such file or directory
bash: awk ' FNR==1 { while (/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/) getline; } 1 { print } ' 2019-01-02.csv >> user_history.csv: No such file or directory
Could you please try following.
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-9]*.csv >> user_history.csv
Here following are the points why one could use this approach:
1- Use of for loop and calling awk command in that each time will be a overkill. We should use smart approach when awk could read multiple files then we should sue it.
2- Now comes the getline part which you tried in your code, so if we want to negate any string then simply negate it by using !/string_to_be_skipped/ so it will look for only those lines which are NOT having this string.
3- While mentioning file(multiple files) to single awk command I used 2019-01-[0-9]*.csv why because since you have NOT told if files will be created daily basis or not so in case we give it a loop style and that specific file is NOT present then we will get an error. For an example let's say I use following awk command where I intentionally removed file named(2019-01-02.csv).
awk '........' 2019-01-{01..29}.csv
awk: cannot open 2019-01-02.csv (No such file or directory)
So to avoid these kind of situations I have used 2019-01-[0-9]*.csv where it will only look for files which have digits after 2019-01-0 and will loop NOT run in a loop and complaint us that some xyz etc file is missing.
Try this:
for i in {01..28}; do awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-$i.csv >>user_history.csv;done
The commands after do should not be quoted.
And what you were doing essentially equals to ignore the title lines.
The {print} after 1 is unnecessary -- single 1 implies {print}. The 1 is to provide a true.
-- When there's only an expression but no block, the block implies to {print}.
-- And only a regexp equals $0~/regex/, and here I negated it.
If there's no other command inside the loop, you can simplify the loop with one awk command:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-{01..28}.csv >>user_history.csv
But this one will throw error and stop executing when one of the files not existed.
Another way is:
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 2019-01-[0-3][0-9].csv >>user_history.csv
This one will only match filenames, instead of loop for them.
It won't stop executing nor throw error, So if there's file missing you wouldn't know. And it will match extra files if exist.
For example it will read 2019-01-34.csv if it exists.
So if you want the warnings (warnings won't affect the results), but don't want the commands to stop, then use the first for loop one.
Pitfalls:
[0-3][1-9] won't match 10,20 and 30, but will match 32 to 39.
[0-9]* will match any longer number, but with 20 to 29 before 3 or likewise, it's string order.
Thanks to #Tiw and #RavinderSingh13 for their guidance. Here is the final awk script that is working well for my case where I have daily files from multiple days, months, and years (only 2018 and 2019 in this case):
awk '!/"_time",PIN,FULLNAME,OFFCODE,Acronym,Name/' 201[8-9]-[0-1][0-2]-[0-3][0-9].csv >> user_history.csv

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Resources