Regex for printing pattern from string - shell

i have a file with below content. i need to separate the content into 2 files
o/p1 should have content everything within first braces () and ` removed and only 1&2 columns printed.
o/p2 should have location with its value
$ cat dt.txt
CREATE EXTERNAL TABLE `rte.fteff_ft`(
`dt` date,
`wk_id` int,
`yq_id` int(10,00),
`te_ind` string,
`yw_dt` date,
`em_dt` date comment dfdsf sdfsdf)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://dfdf/data/ffff/ODE/TdddfT/'
TBLPROPERTIES (
'last_modified_by'='asdas',
'last_modified_time'='1639551681',
'numFiles'='1',
'totalSize'='2848434',
'transient_lastDdlTime'='1639551681')
i need output from the above in two files.
o/p1: a.txt
dt date,
wk_id int,
yq_id int(10,00),
te_ind string,
yw_dt date,
em_dt date
o/p2: b.txt
LOCATION
'hdfs://dfdf/data/ffff/ODE/TdddfT/'

First, use sed to run a couple of commands, to operate on the range of lines between 'CREATE EXTERNAL' and 'ROW DELIMITED FORMAT' where they occur at the start of the line, not including those lines. Then replace grave accent marks with nothing, then keep only the first 2 words.
sed -E '/CREATE EXTERNAL/,/ROW FORMAT DELIMITED/!d;//d;s/`//g; s/(([^ ]+ ){2}).*/\1/' dt.txt > a.txt
EDIT: To remove the commas at the end of the line, add another command of s/,$// . Make sure to anchor the comma to the end of the line else you'll get the comma in the int declaration.
sed -E '/CREATE EXTERNAL/,/ROW FORMAT DELIMITED/!d;//d;s/`//g;s/,$//; s/(([^ ]+ ){2}).*/\1/' dt.txt > a.txt
Second, use the -A option to grep to match the word 'LOCATION' on a line by itself plus the following 1 line.
grep -A 1 '^LOCATION$' dt.txt > b.txt

Related

Unix sed command - global replacement is not working

I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"
You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "
Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.
With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.

sed replace string with pipe and stars

I have the following string:
|**barak**.version|2001.0132012031539|
in file text.txt.
I would like to replace it with the following:
|**barak**.version|2001.01.2012031541|
So I run:
sed -i "s/\|\*\*$module\*\*.version\|2001.0132012031539/|**$module**.version|$version/" text.txt
but the result is a duplicate instead of replacing:
|**barak**.version|2001.01.2012031541|**barak**.version|2001.0132012031539|
What am I doing wrong?
Here is the value for module and version:
$ echo $module
barak
$ echo $version
2001.01.2012031541
Assumptions:
lines of interest start and end with a pipe (|) and have one more pipe somewhere in the middle of the data
search is based solely on the value of ${module} existing between the 1st/2nd pipes in the data
we don't know what else may be between the 1st/2nd pipes
the version number is the only thing between the 2nd/3rd pipes
we don't know the version number that we'll be replacing
Sample data:
$ module='barak'
$ version='2001.01.2012031541'
$ cat text.txt
**barak**.version|2001.0132012031539| <<<=== leave this one alone
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.0132012031539| <<<=== replace this one
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.0132012031539| <<<=== replace this one
One sed solution with -Extended regex support enabled and making use of a capture group:
$ sed -E "s/^(\|[^|]*${module}[^|]*).*/\1|${version}|/" text.txt
Where:
\| - first occurrence (escaped pipe) tells sed we're dealing with a literal pipe; follow-on pipes will be treated as literal strings
^(\|[^|]*${module}[^|]*) - first capture group that starts at the beginning of the line, starts with a pipe, then some number of non-pipe characters, then the search pattern (${module}), then more non-pipe characters (continues up to next pipe character)
.* - matches rest of the line (which we're going to discard)
\1|${version}| - replace line with our first capture group, then a pipe, then the new replacement value (${version}), then the final pipe
The above generates:
**barak**.version|2001.0132012031539|
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.01.2012031541| <<<=== replaced
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.01.2012031541| <<<=== replaced
An awk alternative using GNU awk:
awk -v mod="$module" -v vers="$version" -F \| '{ OFS=FS;split($2,map,".");inmod=substr(map[1],3,length(map[1])-4);if (inmod==mod) { $3=vers } }1' file
Pass two variables mod and vers to awk using $module and $version. Set the field delimiter to |. Split the second field into array map using the split function and using . as the delimiter. Then strip the leading and ending "**" from the first index of the array to expose the module name as inmod using the substr function. Compare this to the mod variable and if there is a match, change the 3rd delimited field to the variable vers. Print the lines with short hand 1
Pipe is only special when you're using extended regular expressions: sed -E
There's no reason why you need extended here, stick with basic regex:
sed "
# for lines matching module.version
/|\*\*$module\*\*.version|/ {
# replace the version
s/|2001.0132012031539|/|$version|/
}
" text.txt
or as an unreadable one-liner
sed "/|\*\*$module\*\*.version|/ s/|2001.0132012031539|/|$version|/" text.txt

How to remove part of the middle of a line/string by matching two known patterns in front and behind variable text to be removed

How to remove part of the middle of a line/string by matching two known patterns, one in front of text to be removed and one behind the text to be removed?
I have a Linux text file with thousands of one line, comma delimited records. unfortunately, all records are not the same format. Each line may have as many as four comma delimited fields of which only the first and last are constant, the two middle fields may, or may not, be present.
Examples of existing line (record) formats. Messy data but the first field is always present, as is the last field, starts with word ADDED.
FNAME LNAME, SOME COMMENT, JOINED DATE, ADDED TO DB DATE
FNAME LNAME, ADDED TO DB DATE
FNAME LNAME, SOME COMMENT, ADDED TO DB DATE
FNAME LNAME, JOINED DATE, ADDED TO DB DATE
Objective is to keep field one including the comma, throw away everything following the first comma, keeping the word "ADDED" and everything that follows to the end of line and insert a space between the first comma and the word ADDED.
For each line in parse the file from start of line to the first comma (keep this).
Parse rest of line up to the space before the word “Added” and throw it away.
Keep everything from the space before the word “ADDED” to end of line and concatenate the first part and last part to form one record per line with two fields separated by a comma and a space.
(if record is already in desired format, change nothing)
Final file to look like:
FNAME LNAME, ADDED TO DB DATE
or
Fred Flintstone, ADDED on January 1st 2015 By Barney Rubble
Thanks!
If you don't care about blank lines:
awk '{print $1,$NF}' FS=, OFS=, input
(Blank lines will be output as a single comma)
If you want to just skip blank lines, use:
awk 'NF>1{print $1,$NF}' FS=, OFS=, input
If you want to keep them:
awk '{printf( "%s%s\n", $1, NF>1 ? ","$NF : "")}' FS=, OFS=, input
Note that this will not ensure a single space after the comma, but will retain the spacing as in the final column of the original file. (that is, if there are 3 spaces after the final column in the original, you'll get 3 in the output). It's not clear to me from the description, but that seems like desirable behavior.
A Perl solution
perl -ne 'print join ", ", (split /,\s*/)[0,-1]' myfile
or
perl -pe 's/,.*(?=,)//' myfile
Both of those solutions work fine for me with the data you have given, but you may like to try
perl -pe 's/,.*(?=,\s*ADDED)//' myfile
You can use backreference:
sed 's/\(^[^,]*,\).* ADDED/\1 ADDED/' file
one more approach with awk could help here.
awk -F, '{val=$1;sub(/FNAME.*\,/,",");print val $0}' Input_file
Where I am making field separator as (,) then saving first field to variable named val, now substituting FNAME till comma with (,) in current line, now printing the value of variable val and new edited current line.
Using perl
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, "<", "file.txt" or die "$!: couldn't open file\n";
while(<$fh>) {
my #arr = split(/,/);
my $text = $arr[0] . ", " . $arr[$#arr];
print "$text\n";
}

How to use sed to insert a line before each line in a file with the original line's content surrounding by a string?

I am trying to use sed (GNU sed version 4.2.1) to insert a line before each line in a file with that line's content surrounding by a string.
Input:
truncate table ALPHA;
truncate table BETA;
delete from TABLE_CHARLIE where ID=1;
Expected Result:
SELECT 'truncate table ALPHA;' from dual;
truncate table ALPHA;
SELECT 'truncate table BETA;' from dual;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;' from dual;
delete from TABLE_CHARLIE where ID=1;
I have tried to make use of the ampersand (&) special character, but this does not seem to work. If I put anything after the ampersand on the replacement string, the output is not correct.
Attempt 1:
sed -e "s/\(.*\)/SELECT '&\n&/g" input.txt
output:
SELECT 'truncate table ALPHA;
truncate table ALPHA;
SELECT 'truncate table BETA;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;
delete from TABLE_CHARLIE where ID=1;
With the preceding code, I get the SELECT ' as expected, but once I attempt to add ' from dual; to the right side of string, things get out of whack.
Attempt 2:
sed -e "s/\(.*\)/SELECT '&' from dual;\n&/g" input.txt
output:
' from dual;cate table ALPHA;
truncate table ALPHA;
' from dual;cate table BETA;
truncate table BETA;
SELECT 'delete from TABLE_CHARLIE where ID=1;' from dual;
You can take advantage of the hold space to temporarily store the original line.
sed "h;s/.*/'SELECT '&' from dual;/;p;g" input.txt
or more readably:
sed "
h
s/.*/'SELECT '&' from dual;/
p
g" input.txt
Here's a breakdown of the command.
First, each line of the input is placed in the pattern space.
The h command copies the contents of the pattern space to the hold space.
The s command performs a substitution on the pattern space. The & represents whatever was matched. This command leaves the hold space unaffected.
The p command outputs the contents of the pattern space to standard output.
The g command copies the contents of the hold space to the pattern space.
By default, the contents of the pattern space are written to standard output before reading the next input line.
As Glenn Jackman points out, you can replace p;g with G. This builds up a two-line value in the pattern space that is then printed, rather than print two separate pattern spaces.
sed "h;s/.*/'SELECT '&' from dual;/;G" input.txt
Also, you can add comments to the sed command so that you can understand what the line noise does later :), if this is in a script.
sed "
# The input line is first copied to the pattern space
h # Copy the pattern space to the hold space
s/.*/'SELECT '&' from dual;/ # Modify the pattern space
p # Print the (modified) pattern space
g # Copy the hold space to the pattern space
# The output of the pattern space (the original input line) is now printed
" input.txt
If you're looking for an alternative to sed, these work:
awk '{printf "SELECT '\''%s'\'' from dual;\n%s\n", $0, $0}' file
perl -lpe "print qq{SELECT '\$_' from dual;}" file
Your second attempt works on both 4.2.1 and 4.2.2 versions of sed. I received same invalid input when I tried to save your input file with windows line endings (line feed and carriage return).
Use this command on your input file before running your sed command:
tr -d '\15\32' < winfile.txt > unixfile.txt
Or as you suggest, simply by using the dos2unix utility.
Here's how to do it with awk:
awk -v PRE="SELECT '" -v SU="' from dual;" '{print PRE$0SU; print}'`

Print, modify, print again Bash variable

I am looping over a CSV file. Each line of the file is formatted something like this (it's Open Street Maps data):
planet_85.287_27.665_51a5fb91,AcDbEntity:AcDbPolyline,{ [name] Purano
Bus Park-Thimi [type] route [route] microbus [ref] 10 } { [Id] 13.0
[Srid] 3857 [FieldsTableId]
This follows the format:
Layer,SubClasses,ExtendedEntity,Linetype,EntityHandle,Text
I want to add a new column for Name. I can find the name in a line by cutting off everything before [name] and after [. This code successfully creates a new-line delineated file of all of the names (which I open as a CSV and then copy-paste into the original file as a new column).
cat /path/to/myfile.csv | while read line
do
if [[ ${line} == *"name"* ]]
then
printf "$(echo $line | LC_ALL=C sed 's/^.*name\]//g'| LC_ALL=C cut -f1 -d'[') \n"
else
printf "\n"
fi
done >/path/to/newrow.csv
This system is clearly suboptimal - I would far prefer to print the entire final row. But when I replace that printf line with this:
printf "$line,$(echo $line | LC_ALL=C sed 's/^.*name\]//g'| LC_ALL=C cut -f1 -d'[') \n"
It prints the line but not the name. I've tried printing them in separate print statements, printing the line and then echoing the name, saving the name in a variable and then printing, and a number of other techniques, and each time I either a) only print the line, or b) print the name on a new line, which breaks the CSV format.
What am I doing wrong? How can I print the full original line with the name appended as a new column at the end?
NOTE: I am running this in Terminal on macOS Sierra on a MacBook Pro 15" Retina.
If I understand correctly, you want to extract the name between [name] and [type], and append as the new last CSV column. You can do that using capture groups:
sed -e 's/.*\[name\] \(.*\) \[type\].*/&,\1/' < input
Notice the \(.*\) in the middle. That captures the text between [name] and [type].
In the replacement string, & stands for the matched string, which is the entire line, as the pattern starts and ends with .*.
Next the , is a literal comma, and \1 stands for the content of the first capture group, the part matched within \(...\).

Resources