How to read and replace Special characters in a fixed length file using shell script - shell

I have a fixed length file in which some records have different special characters like Еӏєпа
I'm able to select those records containing special characters/.
I want to read 2 columns from those records and update it with '*' padded with blanks
Sample Data :
1234562013-09-01 01:05:30Еӏєпа Нцвѡі A other
5657812011-05-05 02:34:56abu jaya B other
Specifically, the 3rd and 4th column containing special characters, should be replaced with a single '*' padded with blanks to fill the length
I need result like below
1234562013-09-01 01:05:30* * A2013-09-01 02:03:40other
5657812011-05-05 02:34:56abu jaya B2013-09-01 07:06:10other
Tried the following commands :
sed -r "s/^(.{56}).{510}/\1$PAD/g;s/^(.{511}).{1023}/\1$PAD/g" errorline.txt
cut -c 57-568
Could someone help me out with this?

I would go with awk, something like:
awk '/[LIST__OF_SPECIAL_CHARS]/ {
l=$0
# for 3rd col
# NOTE the * must be padded if you have a fixed length file
# This can be done with spaces and/or (s)printf, read the docs
if (substr($0,FROM,NUM_OF_CHARS) ~ /[LIST__OF_SPECIAL_CHARS]/) {
l=substr(l,1,START_OF_3RD_COL_MINUS_1) "*" substr(l,START_OF_4TH_COL)
}
# for 4th col
# NOTE the * must be padded if you have a fixed length file
# This can be done with spaces and/or (s)printf, read the docs
if (substr($0,START_OF_4TH_COL,NUM_OF_CHARS) ~ /[LIST__OF_SPECIAL_CHARS]/) {
l=substr(l,1,START_OF_4TH_COL_MINUS_1) "*" substr(l,END_OF_4TH_COL_PLUS_1)
}
# after printing this line, skip to next record.
print l
next
}
{ # prints every other record
print }' INPUTFILE

sed "/.\{56\}.*[^a-zA-Z0-9 ].*.\{7\}/ s/\(.\{56\}\).\{20\}\(.\{7\}\)/\1* * \2/"errorline.txt
where:
56 is the first part of your line that don't contain special char
20 is the second part taht contain maybe special char
7 is the last part, end of your string.
"* * " is the string that will replace your special char section.
Adapt those values to your string structure
This sed read all the file and replace only the lines with special char.

Related

search a string from file which contains only specific symbol and number in unix through Grep command

I want to search a line from file which contains "+", number and character. If any other character available in string then complete string should be discarded.
A1264
13255
1255+*
*6_54
54789+
Output should be
A1264
13255
54789+
Record number 3 and 4 should not come as it contains some other character also.
You can try something like:
grep -E '^[a-zA-Z0-9\+]+$'
This will accept only a to z chars (small and caps), digits and + sign
If you have other symbols you can edit the command line:
# grep -E '^[a-fA-F0-9©]+$' a1
A1264
13255
54789©

How to replace a string like "[1.0 - 4.0]" with a numeric value using awk or sed?

I have a CSV file that I am piping through a set of awk/sed commands.
Some lines in the CSV file look like this:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
where the 8th and 9th columns are a string representing a numeric range.
How can I use awk or sed to replace those fields with a numeric value? Either the beginning of the range, or the end of the range?
So this line would end up as
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
or
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,3.0,0.768
I got as far as removing the brackets but past that I'm stuck. I considered splitting on the " - ", but many lines in my file have a regular numeric value, not a range, in those last two columns, and that makes things messy (I don't want to end up with some lines having a different number of columns).
Here is a sed command that will take each range and break it up into two fields. It looks for strings like "[A - B]" and converts them to A,B. It can easily be modified to just use one of the values if needed by changing the \1,\2 portion. The regular expression assumes that all numbers have at least one digit on either side of a required decimal place. So, 1, .5, and 3. would not be valid. If you need that, the regex can be made to be more accommodating.
$ cat file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
$ sed -Ee 's|"\[([0-9]+\.[0-9]+) - ([0-9]+\.[0-9]+)\]"|\1,\2|g' file
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,3.0,0.384,0.768
Since your data is field-based, awk is the logical choice.
Note that while awk generally isn't aware of double-quoted fields, that is not a problem here, because the double-quoted fields do not have embedded , instances.
#!/usr/bin/env bash
useStart1=1 # set to `0` to use the *end* of the *penultimate* fields' range instead.
useStart2=1 # set to `0` to use the *end* of the *last* field's range instead.
awk -v useStart1=$useStart1 -v useStart2=$useStart2 '
BEGIN { FS=OFS="," }
{
split($(NF-1), tokens1, /[][" -]+/)
split($NF, tokens2, /[][" -]+/)
$(NF-1) = useStart1 ? tokens1[2] : tokens1[3]
$NF = useStart2 ? tokens2[2] : tokens2[3]
print
}
' <<'EOF'
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,"[1.1 - 3.0]","[0.384 - 0.768]"
EOF
The code above yields:
10368,"Verizon DSL",DSL,NY,NORTHEAST,-5,-4,1.1,0.384
Modifying the values of $useStart1 and $useStart2 yields the appropriate variations.

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

reformatting text file from rows to column

i have multiple files in a directory that i need to reformat and put the output in one file, the file structure is:
========================================================
Daily KPIs - DATE: 24/04/2013
========================================================
--------------------------------------------------------
Number of des = 5270
--------------------------------------------------------
Number of users = 210
--------------------------------------------------------
Number of active = 520
--------------------------------------------------------
Total non = 713
--------------------------------------------------------
========================================================
I need the output format to be:
Date,Numberofdes,Numberofusers,Numberofactive,Totalnon
24042013,5270,210,520,713
The directory has around 1500 files with the same format and im using Centos 7.
Thanks
First we need a method to join the elements of an array into a string (cf. Join elements of an array?):
function join_array()
{
local IFS=$1
shift
echo "$*"
}
Then we can cycle over each of the files and convert each one into a comma-separated list (assuming that the original file have a name ending in *.txt).
for f in *.txt
do
sed -n 's/[^:=]\+[:=] *\(.*\)/\1/p' < $f | {
mapfile -t fields
join_array , "${fields[#]}"
}
done
Here, the sed command looks inside each input file for lines that:
begin with a substring that contains neither a : nor a = character (the [^:=]\+ part);
then follow a : or a = and an arbitrary number of spaces (the [:=] * part);
finally, end with an arbitrary substring (the *\(.*\) part).
The last substring is then captured and printed instead of the original string. Any other line in the input files is discared.
After that, the output of sed is read by mapfile into the indexed array variable fields (the -t ensures that trailing newlines from each line read are discarded) and finally the lines are joined thanks to our previously-defined join_array method.
The reason whereby we need to wrap mapfile inside a subshell is explained here: readarray (or pipe) issue.

remove special character in a csv unix and fix the new line

Below is my sample data in the csv .
20160711,"M","N1","F","S","A","good data with.....some special character and space
space ..
....","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
20160711,"M","N1","F","S","A","R","M","072","00126"
In above in a field I have good data along with junk data and line splited to new line .
I want to remove this special character (due to this special char and space,the line was moved to the next line) as well as merge this split line to a single line.
currently I am using something like below which is taking lots of time :
tr -cd '\11\12\15\40-\176' | gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' MY_FILE.csv > MY_FILE.csv.tmp
attached a screenshot of original data in the file .
You could use
tr -c '[:print:]\r\n' ' ' <bad.csv >better.csv
to get rid of the non-printable chars…
sed '/[^"]$/ { N ; s/\n// }' better.csv | sed '/[^"]$/ { N ; s/\n// }' >even_better.csv
would cover most cases (i.e. would fail to trap an extra line break just after a random quote)
– Samson Scharfrichter
One problem that you will likely have with a traditional unix tool like awk is that while it supports field separators, it does not support quote+comma-style CSV formatting like the one in your screenshot or sample data. Awk can separate fields in a record using a field separator, but it has no concept of quote armour around your fields, so embedded commas are also considered field separators.
If you're comfortable with that because none of your plaintext data includes commas, and none of your "non-printable" data includes commas by accident, then you can just consider the quotes to be part of the field. They're printable characters, after all.
If you want to join your multi-line records into a single line and strip any non-printable characters, the following awk one-liner might do:
awk -F, 'NF<10{$0=last $0;last=$0} NF<10{next} {last="";sub(/[^[:print:]]/,"")} 1' inputfile
Note that this works except in cases where the line break is between the last comma and the content of the last field because from awk's perspective an empty field is valid and there's no need to join. If this logic doesn't match your data, you get another fun programming task as a result. :)
Let's break out the awk script and see what it does.
awk -F, ' # Set comma as the field separator...
NF<10 { # For any lines that have fewer than 10 fields...
$0=last $0 # Insert the last "saved" line here,
last=$0 # and save the newly joined line for the next round.
}
NF<10 { # If we still have fewer than 10 lines,
next # repeat.
}
{
sub(/[^[:print:]]/,"") # finally, substitute an empty string
} # for all non-printables,
1' inputfile # And print the current line.

Resources