How to substring a particular length and append that substring text to end of line in unix - bash

Question Edited
Sincere Apologies for editing the question!!
I want to substring using start position and end position for 2 strings "20200224" and "LN". And append that result substring text to the end of line.
For Example,
EDITED INPUT TEXT
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-0.orc
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc
i want to substring "20200224" / "20200225" which is of start_position=20 and end_position=27 and append the same in the end of each line as below,
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-0.orc|20200224|LN
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code
=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224|LN
Like this more lines are there in the file.
I would like to search based on 2 set of strings "process_date=" and "source_country_code=" and want to take the values between " and / which is 20200224 and LN. Append the same in end of line with pipe | delimiter

Use of awk:
line='/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc'
Above can be achieved by reading file line by line.
var=`echo $line| tr '=' ' '| awk '{ print $2 }' | tr '/' ' ' | awk '{ print $1}'`
Reading anything after = then reading anything before /
echo $line$var

To match the 20th to 27th characters from the line, and paste it at the end of the line:
paste -d'|' input_file <(cut -c20-27 input_file)
/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224
/acct/process_date=20200225/data_src=ACB/source_country_code=MO/ACB_ACCT_HK_MO_01-part-0.orc|20200225
To match the first number in the line, and paste it at the end of the line:
sed -i 's/\([0-9]\+\)\(.*\)/\1\2|\1/' input_file
/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224
/acct/process_date=20200225/data_src=ACB/source_country_code=MO/ACB_ACCT_HK_MO_01-part-0.orc|20200225

Related

Extract String before bracket and create new line

I have data in below format
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW [123457, 89053]
FAFDJ-ER 1234 MNO [6532, 789, 234578]
I want to create the data in below format using sed or awk.
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW 123457
FFANKN 2345 QW 89053
FAFDJ-ER 1234 MNO 6532
FAFDJ-ER 1234 MNO 789
FAFDJ-ER 1234 MNO 234578
I can extract the data before bracket but I don't know how to concatenate the same with data from bracket repeatedly.
My Effort :--
# !/bin/bash
while IFS= read -r line
do
echo "$line"
cnt=`echo $line | grep -o "\[" | wc -l`
if [ $cnt -gt 0 ]
then
startstr=`echo $line | awk -F[ '{print $1}'`
echo $startstr
intrstr=`echo $line | cut -d "[" -f2 | cut -d "]" -f1`
echo $intrstr
else
echo "$line" >> newfile.txt
fi
done < 1.txt
I am able to get the first part and also keep the rows not having "[" in new file but I dont know how to get the values in "[" and pass it at end as number of variables in "[" keep changing randomly.
Regards
With your shown samples, please try following awkcode.
awk '
match($0,/\[[^]]*\]$/){
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ")
for(i=1;i<=num;i++){
print substr($0,1,RSTART-1) arr[i]
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/\[[^]]*\]$/){ ##Using match function to match from [ till ] at the end of line.
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ") ##Splitting matched values by regex above and passing into array named arr with delimiters comma and space.
for(i=1;i<=num;i++){ ##Running for loop till value of num.
print substr($0,1,RSTART-1) arr[i] ##printing sub string before matched along with element of arr with index of i.
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
Suggesting simple awk script:
awk 'NR==1{print}{for (i=2;i<NF;i++)print $1, $i}' FS="( \\\[)|(, )|(\\\]$)" input.1.txt
Explanation:
FS="( \\\[)|(, )|(\\\]$)" Set awk field seperator to be either [ , ]EOL
This will make the interesting fields $2 ---> $FN to be appended to $1
NR==1{print} print first line only as it is.
{for (i=2;i<NF;i++)print $1, $i} for 2nd line on, print: field $1 appended by current field.
This might work for you (GNU sed):
sed -E '/(.*)\[([^,]*), /{s//\1\2\n\1[/;P;D};s/[][]//g' file
Match the string up to the opening square bracket and also the string after before the comma and space.
Replace the entire match by the leading and trailing matching strings, followed be a newline and the leading matching string.
Print/delete the first line and repeat.
The last line of any repeat above will fail because there is not trailing comma space, in which case the opening and closing square brackets should also be removed.
Alternative:
sed -E ':a;s/([^\n]*)\[([^,]*), /\1\2\n\1[/;ta;s/[][]//g' file

Bash + sed/awk/cut to delete nth character

I trying to delete 6,7 and 8th character for each line.
Below is the file containing text format.
Actual output..
#cat test
18:40:12,172.16.70.217,UP
18:42:15,172.16.70.218,DOWN
Expecting below, after formatting.
#cat test
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
Even I tried with below , no luck
#awk -F ":" '{print $1":"$2","$3}' test
18:40,12,172.16.70.217,UP
#sed 's/^\(.\{7\}\).\(.*\)/\1\2/' test { Here I can remove only one character }
18:40:1,172.16.70.217,UP
Even with cut also failed
#cut -d ":" -f1,2,3 test
18:40:12,172.16.70.217,UP
Need to delete character in each line like 6th , 7th , 8th
Suggestion please
With GNU cut you can use the --complement switch to remove characters 6 to 8:
cut --complement -c6-8 file
Otherwise, you can just select the rest of the characters yourself:
cut -c1-5,9- file
i.e. characters 1 to 5, then 9 to the end of each line.
With awk you could use substrings:
awk '{ print substr($0, 1, 5) substr($0, 9) }' file
Or you could write a regular expression, but the result will be more complex.
For example, to remove the last three characters from the first comma-separated field:
awk -F, -v OFS=, '{ sub(/...$/, "", $1) } 1' file
Or, using sed with a capture group:
sed -E 's/(.{5}).{3}/\1/' file
Capture the first 5 characters and use them in the replacement, dropping the next 3.
it's a structured text, why count the chars if you can describe them?
$ awk '{sub(":..,",",")}1' file
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
remove the seconds.
The solutions below are generic and assume no knowledge of any format. They just delete character 6,7 and 8 of any line.
sed:
sed 's/.//8;s/.//7;s/.//6' <file> # from high to low
sed 's/.//6;s/.//6;s/.//6' <file> # from low to high (subtract 1)
sed 's/\(.....\).../\1/' <file>
sed 's/\(.{5}\).../\1/' <file>
s/BRE/replacement/n :: substitute nth occurrence of BRE with replacement
awk:
awk 'BEGIN{OFS=FS=""}{$6=$7=$8="";print $0}' <file>
awk -F "" '{OFS=$6=$7=$8="";print}' <file>
awk -F "" '{OFS=$6=$7=$8=""}1' <file>
This is 3 times the same, removing the field separator FS let awk assume a field to be a character. We empty field 6,7 and 8, and reprint the line with an output field separator OFS which is empty.
cut:
cut -c -5,9- <file>
cut --complement -c 6-8 <file>
Just for fun, perl, where you can assign to a substring
perl -pe 'substr($_,5,3)=""' file
With awk :
echo "18:40:12,172.16.70.217,UP" | awk '{ $0 = ( substr($0,1,5) substr($0,9) ) ; print $0}'
Regards!
If you are running on bash, you can use the string manipulation functionality of it instead of having to call awk, sed, cut or whatever binary:
while read STRING
do
echo ${STRING:0:5}${STRING:9}
done < myfile.txt
${STRING:0:5} represents the first five characters of your string, ${STRING:9} represents the 9th character and all remaining characters until the end of the line. This way you cut out characters 6,7 and 8 ...

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

print 1st string of a line if last 5 strings match input

I have a requirement to print the first string of a line if last 5 strings match specific input.
Example: Specified input is 2
India;1;2;3;4;5;6
Japan;1;2;2;2;2;2
China;2;2;2;2
England;2;2;2;2;2
Expected Output:
Japan
England
As you can see, China is excluded as it doesn't meet the requirement (last 5 digits have to be matched with the input).
grep ';2;2;2;2;2$' file | cut -d';' -f1
$ in a regex stands for "end of line", so grep will print all the lines that end in the given string
-d';' tells cut to delimit columns by semicolons
-f1 outputs the first column
You could use awk:
awk -F';' -v v="2" -v count=5 '
{
c=0;
for(i=2;i<=NF;i++){
if($i == v) c++
if(c>=count){print $1;next}
}
}' file
where
v is the value to match
count is the maximum number of value to print the wanted string
the for loop is parsing all fields delimited with a ; in order to find a match
This script doesn't need the 5 values 2 to be consecutive.
With sed:
sed -n 's/^\([^;]*\).*;2;2;2;2;2$/\1/p' file
It captures and output non ; first characters in lines ending with ;2;2;2;2;2
It can be shortened with GNU sed to:
sed -nE 's/^([^;]*).*(;2){5}$/\1/p' file
awk -F\; '/;2;2;2;2;2$/{print $1}' file
Japan
England

Edit data removing line breaks and putting everything in a row

Hi I'm new in shell scripting and I have been unable to do this:
My data looks like this (much bigger actually):
>SampleName_ZN189A
01000001000000000000100011100000000111000000001000
00110000100000000000010000000000001100000010000000
00110000000000001110000010010011111000000100010000
00000110000001000000010100000000010000001000001110
0011
>SampleName_ZN189B
00110000001101000001011100000000000000000000010001
00010000000000000010010000000000100100000001000000
00000000000000000000000010000000000010111010000000
01000110000000110000001010010000001111110101000000
1100
Note: After every 50 characters there is a line break, but sometimes less when the data finishes and there's a new sample name
I would like that after every 50 characters, the line break would be removed, so my data would look like this:
>SampleName_ZN189A
0100000100000000000010001110000000011100000000100000110000100000000000010000000000001100000010000000...
>SampleName_ZN189B
0011000000110100000101110000000000000000000001000100010000000000000010010000000000100100000001000000...
I tried using tr but I got an error:
tr '\n' '' < my_file
tr: empty string2
Thanks in advance
tr with "-d" deletes specified character
$ cat input.txt
00110000001101000001011100000000000000000000010001
00010000000000000010010000000000100100000001000000
00000000000000000000000010000000000010111010000000
01000110000000110000001010010000001111110101000000
1100
$ cat input.txt | tr -d "\n"
001100000011010000010111000000000000000000000100010001000000000000001001000000000010010000000100000000000000000000000000000010000000000010111010000000010001100000001100000010100100000011111101010000001100
You can use this awk:
awk '/^ *>/{if (s) print s; print; s="";next} {s=s $0;next} END {print s}' file
>SampleName_ZN189A
010000010000000000001000111000000001110000000010000011000010000000000001000000000000110000001000000000110000000000001110000010010011111000000100010000000001100000010000000101000000000100000010000011100011
>SampleName_ZN189B
001100000011010000010111000000000000000000000100010001000000000000001001000000000010010000000100000000000000000000000000000010000000000010111010000000010001100000001100000010100100000011111101010000001100
Using awk
awk '/>/{print (NR==1)?$0:RS $0;next}{printf $0}' file
if you don't care of the result which has additional new line on first line, here is shorter one
awk '{printf (/>/?RS $0 RS:$0)}' file
This might work for you (GNU sed):
sed '/^\s*>/!{H;$!d};x;s/\n\s*//2gp;x;h;d' file
Build up the record in the hold space and when encountering the start of the next record or the end-of-file remove the newlines and print out.
you can use this sed,
sed '/^>Sample/!{ :loop; N; /\n>Sample/{n}; s/\n//; b loop; }' file.txt
Try this
cat SampleName_ZN189A | tr -d '\r'
# tr -d deletes the given/specified character from the input
Using simple awk, Same will be achievable.
awk 'BEGIN{ORS=""} {print}' SampleName_ZN189A #Output doesn't contains an carriage return
at the end, If u want an line break at the end this works.
awk 'BEGIN{ORS=""} {print}END{print "\r"}' SampleName_ZN189A
# select the correct line break charachter (i.e) \r (or) \n (\r\n) depends upon the file format.

Resources