remove/ replace unprintable characters from txt file using shell script - shell

I am trying to remove a newline characters from with in quotes in file
I am able to achieve that using the below code
awk -F"\"" '!length($NF){print;next}{printf("%s ", $0)}' filename.txt>filenamenew.txt
Note I am creating a new file filenamenew.txt is this avoidable can i do the command in place the reason I ask is because files are huge.
my file is pipe delimited
sample input file
"id"|"name"
"1"|"john
doe"
"2"|"second
name
in the list"
using the above code I get the following output
"id"|"name"
"1"|"john doe"
"2"|"second name in the list"
but I have a huge files and i see in some of the lines have ^M character in between quotes example
second sample input file
"id"|"name"
"1"|"john
doe"
"^M2"|"second^M^M
name
in the list"
o/p using above code
"id"|"name"
"1"|"john doe"
name in the list"
so basically if there is a ^M in the line that string is not being printed but i read online ^M is equal to \r so i used
tr -d'\r'< filename.txt
I also tried
awk-F"|"{sub(/^M/,"")}1
but it did not remove those characters (^M)
A little background on why i am doing this
I am extracting data from a relational table and loading into flat file and checking if the counts between table and file matched but since there is \n in columns count(*) vs wc-l in file is not matching.
final resolution:
i don't want to delete these unprintable characters in the long run but want to replace it with some character or value(so that counts between table and file matches) and then when i am loading it back to a table i want to again replace the value that i have added effectively as a place holder with \n or ^M what was originally present so that there is no tampering of data from my side.
Any suggestions is appreciated.
thanks.

Related

Single file contain files name and scores | text processing

I have a folder called files that has 100 files, each one has one value inside;such as: 0.974323
This my code to generate those files and store the single value inside:
DIR="/home/XX/folder"
INPUT_DIR="/home/XX/folder/eval"
OUTPUT_DIR="/home/XX/folder/files"
for i in $INPUT_DIR/*
do
groovy $DIR/calculate.groovy $i > $OUTPUT_DIR/${i##*/}_rates.txt
done
That will generate 100 files inside /home/XX/folder/files, but what I want is one single file that has in each line two columns separated by tab contain the score and the name of the file (which is i).
the score \t name of the file
So, the output will be:
0.9363728 \t resultFile.txt
0.37229 \t outFile.txt
And so on, any help with that please?
Assuming your Groovy program outputs just the score, try something like
#!/bin/sh
# ^ use a valid shebang
# Don't use uppercase for variables
dir="/home/XX/folder"
input_dir="/home/XX/folder/eval"
output_dir="/home/XX/folder/files"
# Always use double quotes around file names
for i in "$input_dir"/*
do
groovy "$dir/calculate.groovy" "$i" |
sed "s%^%$i\t%"
done >"$output_dir"/tabbed_file.txt
The sed script assumes that the file names do not contain percent signs, and that your sed recognizes \t as a tab (some variants will think it's just a regular t with a gratuitous backslash; replace it with a literal tab, or try ctrl-v tab to enter a literal tab at the prompt in many shells).
A much better fix is probably to change your Groovy program so that it accepts an arbitrary number of files as command-line arguments, and includes the file name in the output (perhaps as an option).

AIX script for file information

I have a file, in AIX server, with multiple record entries in below format
Name(ABC XYZ) Gender(Male)
AGE(26) BDay(1990-12-09)
My problem is I want to extract the name and the b'day from the file for all the records. I am trying to list it like below:
ABC XYZ 1990-12-09
Can someone please help me with the scripting
Something like this maybe:
awk -F"[()]" '/Name/ && /Gender/{name=$2} /BDay/{print name,$4}' file.txt
That says... "treat opening and closing parentheses as field separators. If you see a line go by that contains Name and Gender, save the second field in the variable name. If you see a line go by that contains the word Bday, print out the last name you saw and also the fourth field on the current line."

Programmatically delete all text between 2 characters in osx terminal

I have a thousand of txt files
1.txt
2.txt
3.txt
in each files, several times I have tags among my text:
{somethinghere...blablabla} than the text I want to keep than again {somethinghere...blablabla}
I'm not very pratical in mac osx command line, can someone help me to write a command opening each file, parsing it, and deleting all text included by two "{"?
To be clear:
First of all I need to open each file, than parse the text. When the loop finds a "{" it starts deleting till it founds a "}". When done parsing it saves and close the file. That's what I need to do.
$ sed -i.bak -e 's#{[^}]*}##g' *.txt
-i.bak make a backup copy of each modified files. If you don't want backups, on OsX use -i'' (the quotes are not necessary on Linux)
in substitutions, the delimiter can be another character than /, here I choose #, so : s#<REGEX>#<REMPLACEMENT># (the basic form for substitutions are s///)
In the regex, we search a litteral { and all but not a } with [^}]. * means 0 or more occurences. Last, we search the closing } and we replace the matching part by nothing, so it delete what was matching
the g modifier #the end means not only one match but all

Reading data from file to execute Shell Script

I have a 'testfiles' files that has list of files
Ex-
Tc1
Tc2
calling above file in script
test=`cat testfiles`
for ts in $test
do
feed.sh $ts >>results
done
This script runs fine when there only 1 test file in 'testfiles',but when there are multiple files ,it fails with 'file not found'
Let me know if this is correct approach
you ll have to read files one by one since you are taking testfiles='Tc1 Tc2' cat is searching for file named 'Tc1 Tc2' which does not exist so use cut command with " " as the delimiter and rad files one by one in a loop.or u can use sed command also to seperate file names
Your approach should work if the filenames have no spaces or other tricky characters. An approach that handles spaces in file names successfully is:
while IFS= read -r ts
do
feed.sh "$ts" >>results
done <testfiles
If your file names have newline characters in them, then the above won't work and you would need to create testfiles with the names separated by a null character in place of a newline.
Let's consider the original code. When bash substitutes for $test in the for statement, all the file names appear on the same line and bash will perform word splitting which will make a mess of any file names containing white space. The same happens on the line feed.sh $ts. Since $ts is not quoted, it will also undergo word splitting.

Excel saves tab delimited files without newline (UNIX/Mac os X)

This is a common issue I have and my solution is a bit brash. So I'm looking for a quick fix and explanation of the problem.
The problem is that when I decide to save a spreadsheet in excel (mac 2011) as a tab delimited file it seems to do it perfectly fine. Until I try to parse the file line by line using Perl. For some reason it slurps the whole document in one line.
My brutish solution is to open the file in a web browser and copy and paste the information into the tab delimited file in TextEdit (I never use rich text format). I tried introducing a newline in the end of the file before doing this fix and it does not resolve the issue.
What's going on here? An explanation would be appreciated.
~Thanks!~
The problem is the actual character codes that define new lines on different systems. Windows systems commonly use a CarriageReturn+LineFeed (CRLF) and *NIX systems use only a LineFeed (LF).
These characters can be represented in RegEx as \r\n or \n (respectively).
Sometimes, to hash through a text file, you need to parse New Line characters. Try this for DOS-to-UNIX in perl:
perl -pi -e 's/\r\n/\n/g' input.file
or, for UNIX-to-DOS using sed:
$ sed 's/$'"/`echo \\\r`/" input.txt > output.txt
or, for DOS-to-UNIX using sed:
$ sed 's/^M$//' input.txt > output.txt
Found a pretty simple solution to this. Copy data from Excel to clipboard, paste it into a google spreadsheet. Download google spreadsheet file as a 'tab-separated values .tsv'. This gets around the problem and you have tab delimiters with an end of line for each line.
Yet another solution ...
for a tab-delimited file, save the document as a Windows Formatted Text (.txt) file type
for a comma-separated file, save the document as a `Windows Comma Separated (.csv)' file type
Perl has a useful regex pattern \R which will match any common line ending. It actually matches any vertical whitespace -- the same as \v -- or the CR LF combination, so it's the same as \r\n|\v
This is useful here because you can slurp your entire file into a single scalar and then split /\R/, which will give you a list of file records, already chomped (if you want to keep the line terminators you can split /\R\K/ instead
Another option is the PerlIO::eol module. It provides a new Perl IO layer that will normalize line endings no matter what the contents of the file are
Once you have loaded the module with use PerlIO::eol you can use it in an open statement
open my $fh, '<:eol(LF)', 'myfile.tsv' or die $!;
or you can use the open pragma to set it as the default layer for all input file handles
use open IN => ':raw:eol(LF)';
which will work fine with an input file from any platform

Resources