How to remove part of the middle of a line/string by matching two known patterns in front and behind variable text to be removed - bash

How to remove part of the middle of a line/string by matching two known patterns, one in front of text to be removed and one behind the text to be removed?
I have a Linux text file with thousands of one line, comma delimited records. unfortunately, all records are not the same format. Each line may have as many as four comma delimited fields of which only the first and last are constant, the two middle fields may, or may not, be present.
Examples of existing line (record) formats. Messy data but the first field is always present, as is the last field, starts with word ADDED.
FNAME LNAME, SOME COMMENT, JOINED DATE, ADDED TO DB DATE
FNAME LNAME, ADDED TO DB DATE
FNAME LNAME, SOME COMMENT, ADDED TO DB DATE
FNAME LNAME, JOINED DATE, ADDED TO DB DATE
Objective is to keep field one including the comma, throw away everything following the first comma, keeping the word "ADDED" and everything that follows to the end of line and insert a space between the first comma and the word ADDED.
For each line in parse the file from start of line to the first comma (keep this).
Parse rest of line up to the space before the word “Added” and throw it away.
Keep everything from the space before the word “ADDED” to end of line and concatenate the first part and last part to form one record per line with two fields separated by a comma and a space.
(if record is already in desired format, change nothing)
Final file to look like:
FNAME LNAME, ADDED TO DB DATE
or
Fred Flintstone, ADDED on January 1st 2015 By Barney Rubble
Thanks!

If you don't care about blank lines:
awk '{print $1,$NF}' FS=, OFS=, input
(Blank lines will be output as a single comma)
If you want to just skip blank lines, use:
awk 'NF>1{print $1,$NF}' FS=, OFS=, input
If you want to keep them:
awk '{printf( "%s%s\n", $1, NF>1 ? ","$NF : "")}' FS=, OFS=, input
Note that this will not ensure a single space after the comma, but will retain the spacing as in the final column of the original file. (that is, if there are 3 spaces after the final column in the original, you'll get 3 in the output). It's not clear to me from the description, but that seems like desirable behavior.

A Perl solution
perl -ne 'print join ", ", (split /,\s*/)[0,-1]' myfile
or
perl -pe 's/,.*(?=,)//' myfile
Both of those solutions work fine for me with the data you have given, but you may like to try
perl -pe 's/,.*(?=,\s*ADDED)//' myfile

You can use backreference:
sed 's/\(^[^,]*,\).* ADDED/\1 ADDED/' file

one more approach with awk could help here.
awk -F, '{val=$1;sub(/FNAME.*\,/,",");print val $0}' Input_file
Where I am making field separator as (,) then saving first field to variable named val, now substituting FNAME till comma with (,) in current line, now printing the value of variable val and new edited current line.

Using perl
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, "<", "file.txt" or die "$!: couldn't open file\n";
while(<$fh>) {
my #arr = split(/,/);
my $text = $arr[0] . ", " . $arr[$#arr];
print "$text\n";
}

Related

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

How to print certain fields in a column if one of the fields is less than a certain value?

I have a .txt file in that contains data about 100 colleges in the format
{COLLEGE NAME} {CITY, STATE} {RANK} {TUITION} {IN STATE TUITION} {ENROLLMENT}
For example here are two lines
YeshivaUniversity "New York, NY" 66 "$40,670 " "2,744"
FordhamUniversity "New York, NY" 60 "$47,317 " "8,855"
There are 98 more lines and the output should return all the colleges with tuition less than $30000?
Assuming that the field separator is space, how could I print the {COLLEGE NAME} {CITY, STATE} {TUITION} of colleges with {TUITION} less than $30,000? Is it possible to do with awk or sort?
I have tried some combinations of awk and the operators <=, but I get an error every time. For example
$ awk -F" " '{print $1, $2, $4<=30000}' data1a.txt
gives me a syntax error.
Using GNU awk, since it's got FPAT:
$ gawk '
BEGIN {
FPAT="([^ ]*)|(\"[^\"]+\")"
}
{
tuition=$4 # separate 4th column for cleaning
gsub(/[^0-9]/,"",tuition) # clean non-digits off
if(tuition<30000) # compare
print # and output
}'
Output for sample data:
(Next time, please post such sample that it has positive and negative cases.)
Also, it was mentioned in the comments: Delimited by single space and you have a space in name of University. That wasn't the case anymore when I saw your question but that could be tackled by counting the fields from the end, ie. $4 would be $(NF-1).

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Why do I get weird output in printf in awk for $0?

The input is following
Title: Aoo Boo
Author: First Last
I am trying to output
Aoo Boo, First Last, "
by using awk like this
awk 'BEGIN { FS="[:[:space:]]+" }
/Title/ { sub(/^Title: /,""); t = $0; } # save title
/Author/{ sub(/^Author: /,""); printf "%s,%s,\"\n", t, $0}
' t.txt
But the output is like ,"irst Last. Basically it prints everything from the beginning of the sentence.
But if I change $0 to $2, the output is as expected which is Boo,Last,"
Why is it incorrect? What is the right way to do?
You need to get rid of the Windows line endings in your text file if you want to use Unix utilities.
If you're lucky, you'll find you have the dos2unix program installed, and you'll only need to do this:
dos2unix t.txt
If not, you could do it with tr:
tr -d '\r' < t.txt > new_t.txt
For reference, what is going on is that Windows files have \r\n at the end of every line (actually, a CR control code followed by a NL control code). On Linux, the lines ends with the \n, so the \r is part of the data; when you print it out, the terminal interprets as a "carriage return", which moves the cursor to the beginning of the current line, rather than advancing to the next line. Since the value of t ends with a \r, the following text overwrites the value of t.
It works with $2 because you've reassigned FS to include [:space:]; that definition of field separators is more generous than the awk default, since it includes \r and \f, neither of which are default field separators. Consequently, $2 does not contain the \r, but $0 does.
This assumes there are no colons in titles or names...
awk -F': *' '
$1=="Title" {
sub(/[^[:print:]]/,"");
t=$2;
}
$1=="Author" {
sub(/[^[:print:]]/,"");
printf("%s, %s\n", t, $2);
}
' inputfile.txt
This works by finding the title and storing it in a variable, then finding the author and using that as a trigger to print everything according to your format. You can alter the format as you see fit.
It may break if there are extra colons on the line, as the colon is being used to split fields. It may also break if your input doesn't match your example.
Perhaps the most important thing in this example is the sub(...) functions, which strip off non-printable characters like the carriage return that rici noticed you have. The regular expression [^[:print:]] matches "printable" characters, which the carriage return is not. This script will substitute them into oblivion if they're there, but should do no harm if they are not.

How to combine two lines that share the same keyword?

lets say I have a file looking somewhat like this:
X NeedThis1 KEYWORD
.
.
NeedThis2 X KEYWORD
And I need to combine the two lines into one like this:
NeedThis2 NeedThis1 KEYWORD
It needs to be done for every line in that file that contains the same KEYWORD but it can't combine two lines that look like this (two X's at the first|second position)
X NeedThis1 KEYWORD
X NeedThis2 KEYWORD
I am considering myself bash-noob so any advice if it can be done with something like awk or sed would be appreciated.
awk '
{if ($1 == "X") end[$3] = $2; else start[$3] = $1}
END {for (kw in start) if (kw in end) print start[kw], end[kw], kw}
' file
Try this:
awk '
$1=="X" {key = $NF; value = $2; next}
$2=="X" && $NF==key {print value, $1, key}' file
Explanation:
When a line where first field is X, store the last field as key and second field as value.
Look for the next line where second field is X and last field matches the key stored from pervious action.
When found, print the value of last matched line along with first field of the current line and the key.
This will most definitely break if your data does not match the sample you have shown (if it has more spaces or fields in between), so feel free to adjust as per your needs.
I won't give you the full answer, but if you have some way to identify "KEYWORD" (not in your problem statement), then use a BASH associative array:
declare -A keys
while IFS= read -u3 -r line
do
set -- $line
eval keyword=\$$#
keys[$keyword]+=${line%$keyword}
done
you'll certainly have to do some more fiddling, but your problem statement is incomplete and some of the work needs to be an exercise for the reader.

Resources