Merge two rows of a file - shell

I have a input file which has large data in below pattern. some part of data is shown below:
Data1
C
In;
CP
In;
D
In;
Q
Out;
Data2
CP
In;
D
In;
Q
Out;
Data3
CP
In;
CPN
In;
D
In;
QN
Out;
I want my output as
Data1(C,CP,D,Q)
In C;
In CP;
In D;
Out Q;
Data2 (CP,D,Q)
In CP;
In D;
Out Q;
Data3 (CP,CPN,D,QN)
In CP;
In CPN
In D
Out QN;
I tried code given in comment section below, But getting error. Corrections are welcome.

variation on #EdMorton suggestion - fixing the desired order of fields:
$ awk 'FNR==1{print;next}!(NR%2){a=$0; next} {printf "%s %s%s%s", $1,a,FS,ORS}' FS=';' file
Data
In A1;
In A2;
Out Z;

$ awk 'NR%2{print sep $0; sep=OFS; next} {printf "%s", $0}' file
Data
A1 In;
A2 In;
Z Out;

Could you please try following, written and tested with shown samples.
awk '
FNR==1{
print
next
}
val{
sub(/;$/,OFS val"&")
print
val=""
next
}
{
val=$0
}
END{
if(val!=""){
print val
}
}' Input_file
Issues in OP's attempt:
1st: awk code should be covered in ' so "$1=="A1" should be changed to '$1=="A1".
2nd: But condition for logic looking wrong to me because if we only looking specifically for A1 and In then other lines like Z and out will miss out, hence came up with above approach.
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line here.
next ##next will skip all further statements from here.
}
val{ ##Checking condition if val is NOT NULL then do following.
sub(/;$/,OFS val"&") ##Substituting last semi colon with OFS val and semi colon here.
print ##printing current line here.
val="" ##Nullify val here.
next ##next will skip all further statements from here.
}
{
val=$0 ##Setting current line value to val here.
}
END{ ##Starting END block of this program from here.
if(val!=""){ ##Checking condition if val is NOT NULL then do following.
print val ##Printing val here.
}
}' Input_file ##Mentioning Input_file name here.

and variation on #vgersh99 but setting FS twice: at the beginning and at the end.
awk -v FS='\n' -v RS= '
{
gsub(";", " ");
r2= $3 $2;
r3= $5 $4;
r4= $7 $6;
}
{print $1}
{print r2 FS}
{print r3 FS}
{print r4 FS}' FS=';' file
Data
In A1;
In A2;
Out Z;
'''
Does it give you error?

Related

awk to get value for a column of next line and add it to the current line in shellscript

I have a csv file lets say lines
cat lines
1:abc
6:def
17:ghi
21:tyu
I wanted to achieve something like this
1:6:abc
6:17:def
17:21:ghi
21::tyu
Tried the below code by didn't work
awk 'BEGIN{FS=OFS=":"}NR>1{nln=$1;cl=$2}NR>0{print $1,nln,$2}' lines
1::abc
6:6:def
17:17:ghi
21:21:tyu
Can you please help ?
Here is a potential AWK solution:
cat lines
1:abc
6:def
17:ghi
21:tyu
awk -F":" '{num[NR]=$1; letters[NR]=$2}; END{for(i=1;i<=NR;i++) print num[i] ":" num[i + 1] ":" letters[i]}' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Formatted:
awk '
BEGIN {FS=OFS=":"}
{
num[NR] = $1;
letters[NR] = $2
}
END {for (i = 1; i <= NR; i++)
print num[i], num[i + 1], letters[i]
}
' lines
1:6:abc
6:17:def
17:21:ghi
21::tyu
Basically this is your solution but I switched the order of the code blocks and added the END block to output the last record, you were close:
awk 'BEGIN{FS=OFS=":"}FNR>1{print p,$1,q}{p=$1;q=$2}END{print p,"",q}' file
Explained:
$ awk 'BEGIN {
FS=OFS=":" # delims
}
FNR>1 { # all but the first record
print p,$1,q # output $1 and $2 from the previous round
}
{
p=$1 # store for the next round
q=$2
}
END { # gotta output the last record in the END
print p,"",q # "" feels like cheating
}' file
Output:
1:6:abc
6:17:def
17:21:ghi
21::tyu
1st solution: Here is a tac + awk + tac solution. Written and tested with shown samples only.
tac Input_file |
awk '
BEGIN{
FS=OFS=":"
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2)
}
{
prev=$1
}
1
' | tac
Explanation: Adding detailed explanation for above code.
tac Input_file | ##Printing lines from bottom to top of Input_file.
awk ' ##Getting input from previous command as input to awk.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
prev=(prev?$2=prev OFS $2:$2=OFS $2) ##Creating prev if previous NOT NULL then add its value prior to $2 with prev OFS else add OFS $2 in it.
}
{
prev=$1 ##Setting prev to $1 value here.
}
1 ##printing current line here.
' | tac ##Sending awk output to tac to make it in actual sequence.
2nd solution: Adding Only awk solution with 2 times passing Input_file to it.
awk '
BEGIN{
FS=OFS=":"
}
FNR==NR{
if(FNR>1){
arr[FNR-1]=$1
}
next
}
{
$2=(FNR in arr)?(arr[FNR] OFS $2):OFS $2
}
1
' Input_file Input_file

need help on unix Script to read a data from specific position and use the extracted in query

Input file:
ADSDWETTYT017775227ACG
ADSDWETTYT029635225HCG
ADSDWETTYC018525223JCG
ADSDWETTYC987415221ACG
ADSDWETTCC891235219ACG
ADSDWETTTT074565217ACG
ADSDWETTYT567895213ACG
ADSDWETTYH037535215ACG
ADSDWETTYC051595211ACG
ADSDWETTYT052465209ACG
ADSDWETTYT067595207ACG
ADSDWETTYT077515205ACG
need to check the 10 position on the file contain/start with T, if its start with "T" then i need to take the value from 14 char from 16.
from the above file am expecting the below output,
'5227','5225','5217','5213','5209','5207','5205'
this result i should assigned to some constant like (result below) and should be used in the query where clause like below,
result=$(awk '
BEGIN{
conf="" };
{ if(substr($0,10,1)=="T"){
conf=substr($0,16,4);
{NT==1?s="'\''"conf:s=s"'\'','\''"conf}
}
}
END {print s"'\''"}
' $INPUT_FILE_PATH
db2 "EXPORT TO ${OUTPUT_FILE} OF DEL select STATUS FROM TRAN where TN_NR in (${result})"
I need some help to enhance the awk command and passing the constant in query where clause. kindly help.
With your shown samples, attempts; please try following awk code.
awk -v s1="'" 'BEGIN{OFS=", "} substr($0,10,1)=="T"{val=(val?val OFS:"") (s1 substr($0,16,4) s1)} END{print val}' Input_file
Adding non-one liner form of above code:
awk -v s1="'" '
BEGIN{ OFS=", " }
substr($0,10,1)=="T"{
val=(val?val OFS:"") (s1 substr($0,16,4) s1)
}
END{
print val
}
' Input_file
To save output of this code into a shell variable try following:
value=$(awk -v s1="'" 'BEGIN{OFS=", "} substr($0,10,1)=="T"{val=(val?val OFS:"") (s1 substr($0,16,4) s1)} END{print val}' Input_file)
Explanation: Adding detailed explanation for above code.
awk -v s1="'" ' ##Starting awk program from here setting s1 to ' here.
BEGIN{ OFS=", " } ##Setting OFS as comma space here.
substr($0,10,1)=="T"{ ##Checking condition if 10th character is T then do following.
val=(val?val OFS:"") (s1 substr($0,16,4) s1) ##Creating val which has values from current line as per OP requirement.
}
END{ ##Starting END block of this program from here.
print val ##Printing val here.
}
' Input_file ##Mentioning Input_file name here.
I'd use sed for this:
sed -En '/^.{9}T/ s/^.{15}(....).*/\1/p' file
And then to get your exact output, pipe that into
... | sed "s/.*/'&'/" | paste -sd,
I'd use perl over awk here for its better arrays (in particular, joining one into a string). Something like:
perl -nE "push #n, substr(\$_, 15, 4) if /^.{9}T/;
END { say join(',', map { \"'\$_'\" } #n) }" "$INPUT_FILE_PATH"

print columns based on column name

Let's say I have a file test.txt that contains
a,b,c,d,e
1,2,3,4,5
6,7,8,9,10
I want to print out columns based on matching column names, either from another text file or from an array. So for example if I was given
arr=(a b c)
I want my output to then be
a,b,c
1,2,3
6,7,8
How can I do this with bash utilities/awk/sed? My actual text file is 3GB (and the line I want to match column values on is actually line 3), so efficient solutions are appreciated. This is what I have so far:
for j in "${arr[#]}"; do awk -F ',' -v a=$j '{ for(i=1;i<=NF;i++) {if($i==a) {print $i}}}' test.txt; done
but the output I get is
a
b
c
which not only is missing the other rows, but each column name is printed on one line each.
With your shown samples, please try following. Code is reading 2 files another_file.txt(which has a b c as per samples) and actual Input_file named test.txt(which has all the values in it).
awk '
FNR==NR{
for(i=1;i<=NF;i++){
arr[$i]
}
next
}
FNR==1{
for(i=1;i<=NF;i++){
if($i in arr){
valArr[i]
header=(header?header OFS:"")$i
}
}
print header
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in valArr){
val=(val?val OFS:"")$i
}
}
print val
}
' another_file.txt FS="," OFS="," test.txt
Output will be as follows:
a,b,c
1,2,3
6,7,8
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE while reading another_text file here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
arr[$i] ##Creating arr with index of current field value.
}
next ##next will skip all statements from here.
}
FNR==1{ ##Checking if this is 1st line for test.txt file.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if($i in arr){ ##If current field values comes in arr then do following.
valArr[i] ##Creating valArr which has index of current field number.
header=(header?header OFS:"")$i ##Creating header which has each field value in it.
}
}
print header ##Printing header here.
next ##next will skip all statements from here.
}
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields of current line.
if(i in valArr){ ##Checking if i is present in valArr then do following.
val=(val?val OFS:"")$i ##Creating val which has current field value.
}
}
print val ##printing val here.
}
' another_file.txt FS="," OFS="," test.txt ##Mentioning Input_file names here.
Here is how you can do this in a single pass awk command:
arr=(a c e)
awk -v cols="${arr[*]}" 'BEGIN {FS=OFS=","; n=split(cols, tmp, / /); for (i=1; i<=n; ++i) hdr[tmp[i]]} NR==1 {for (i=1; i<=NF; ++i) if ($i in hdr) hnum[i]} {for (i=1; i<=NF; ++i) if (i in hnum) {printf "%s%s", (f ? OFS : ""), $i; f=1} f=0; print ""}' file
a,c,e
1,3,5
6,8,10
A more readable form:
awk -v cols="${arr[*]}" '
BEGIN {
FS = OFS = ","
n = split(cols, tmp, / /)
for (i=1; i<=n; ++i)
hdr[tmp[i]]
}
NR == 1 {
for (i=1; i<=NF; ++i)
if ($i in hdr)
hnum[i]
}
{
for (i=1; i<=NF; ++i)
if (i in hnum) {
printf "%s%s", (f ? OFS : ""), $i
f = 1
}
f = 0
print ""
}' file

Match columns between files and generate file with combination of data in terminal/powershell/command line Bash

I have two .txt files of different lengths and would like to do the following:
If a value in column 1 of file 1 is present in column 1 of file 3, print column 2 of file 2 and then the whole line that corresponds from file 1.
Have tried permutations of awk however am so far unsuccessful!
Thank you!
File 1:
MARKERNAME EA NEA BETA SE
10:1000706 T C -0.021786390809225 0.519667838651725
1:715265 G C 0.0310128798578049 0.0403763946716293
10:1002042 CCTT C 0.0337857775471699 0.0403300629299562
File 2:
CHR:BP SNP CHR BP GENPOS ALLELE1 ALLELE0 A1FREQ INFO
1:715265 rs12184267 1 715265 0.0039411 G C 0.964671
1:715367 rs12184277 1 715367 0.00394384 A G 0.964588
Desired File 3:
SNP MARKERNAME EA NEA BETA SE
rs12184267 1:715265 G C 0.0310128798578049 0.0403763946716293
Attempted:
awk -F'|' 'NR==FNR { a[$1]=1; next } ($1 in a) { print $3, $0 }' file1 file2
awk 'NR==FNR{A[$1]=$2;next}$0 in A{$0=A[$0]}1' file1 file2
With your shown samples, could you please try following.
awk '
FNR==1{
if(++count==1){ col=$0 }
else{ print $2,col }
next
}
FNR==NR{
arr[$1]=$0
next
}
($1 in arr){
print $2,arr[$1]
}
' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line of file(s).
if(++count==1){ col=$0 } ##Checking if count is 1 then set col as current line.
else{ print $2,col } ##Checking if above is not true then print 2nd field and col here.
next ##next will skip all further statements from here.
}
FNR==NR{ ##This will be TRUE when file1 is being read.
arr[$1]=$0 ##Creating arr with 1st field index and value is current line.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field present in arr then do following.
print $2,arr[$1] ##Printing 2nd field, arr value here.
}
' file1 file2 ##Mentioning Input_files name here.

How to run a bash script in a loop

i wrote a bash script in order to pull substrings and save it to an output file from two input files that looks like this:
input file 1
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
input file 2
gene1 10 20
gene2 40 50
genen x y
my script
>output_file
cat input_file2 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' input_file1 >> output file
done
done
how can i make it work in a loop for more than one string in the input file 1?
new input file looks like this:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotypen...
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
I would like to have a different out file for every genotype and that the file name would be the genotype name.
thank you!
If I'm understanding correctly, would you try the following:
awk '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > genotype
print substr($0, start[i], len[i]) >> genotype
}
close(genotype)
}' input_file2 input_file1
input_file1:
>genotype1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>genotype2
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
>genotype3
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Input_file2:
gene1 10 20
gene2 40 50
gene3 20 25
[Results]
genotype1:
>gene1
aaaaaaaaaa
>gene2
aaaaaaaaaa
>gene3
aaaaa
genotype2:
>gene1
bbbbbbbbbb
>gene2
bbbbbbbbbb
>gene3
bbbbb
genotype3:
>gene1
nnnnnnnnnn
>gene2
nnnnnnnnnn
>gene3
nnnnn
[EDIT]
If you want to store the output files to a different directory,
please try the following instead:
dir="./outdir" # directory name to store the output files
# you can modify the name as you want
mkdir -p "$dir"
awk -v dir="$dir" '
FNR==NR {
name[NR] = $1
start[NR] = $2
len[NR] = $3 - $2
count = NR
next
}
/^>/ {
sub(/^>/,"")
genotype=$0
next
}
{
for (i = 1; i <= count; i++) {
print ">" name[i] > dir"/"genotype
print substr($0, start[i], len[i]) >> dir"/"genotype
}
close(dir"/"genotype)
}' input_file2 input_file1
The 1st two lines are executed in bash to define and mkdir the destination directory.
Then the directory name is passed to awk via -v option
Hope this helps.
Could you please try following, where I am assuming that your Input_file1's column which starts with > should be compared with 1st column of Input_file2's first column (since samples are confusing so based on OP's attempt this has been written).
awk '
FNR==NR{
start_point[$1]=$2
end_point[$1]=$3
next
}
/^>/{
sub(/^>/,"")
val=$0
next
}
{
print val ORS substr($0,start_point[val],end_point[val])
val=""
}
' Input_file2 Input_file1
Explanation: Adding explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file named Input_file2 is being read.
start_point[$1]=$2 ##Creating an array named start_point with index $1 of current line and its value is $2.
end_point[$1]=$3 ##Creating an array named end_point with index $1 of current line and its value is $3.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if a line starts from > then do following.
sub(/^>/,"") ##Substituting starting > with NULL.
val=$0 ##Creating a variable val whose value is $0.
next ##next will skip all further statements from here.
}
{
print val ORS substr($0,start_point[val],end_point[val]) ##Printing val newline(ORS) and sub-string of current line whose start value is value of start_point[val] and end point is value of end_point[val].
val="" ##Nullifying variable val here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.

Resources