Removing non-displaying characters from a file

Removing non-displaying characters from a file - bash

$ cat weirdo
Lunch now?
$ cat weirdo | grep Lunch
$ vi weirdo
^#L^#u^#n^#c^#h^# ^#n^#o^#w^#?^#
I have some files that contain text with some non-printing characters like ^# which cause my greps to fail (as above).
How can I get my grep work? Is there some way that does not require altering the files?

It looks like your file is encoded in UTF-16 rather than an 8-bit character set. The '^#' is a notation for ASCII NUL '\0', which usually spoils string matching.
One technique for loss-less handling of this would be to use a filter to convert UTF-16 to UTF-8, and then using grep on the output - hypothetically, if the command was 'utf16-utf8', you'd write:
utf16-utf8 weirdo | grep Lunch
As an appallingly crude approximation to 'utf16-utf8', you could consider:
tr -d '\0' < weirdo | grep Lunch
This deletes ASCII NUL characters from the input file and lets grep operate on the 'cleaned up' output. In theory, it might give you false positives; in practice, it probably won't.

The tr command is made for that:
cat weirdo | tr -cd '[:print:]\r\n\t' | grep Lunch

You may have some success with the strings(1) tool like in:
strings file | grep Launch
See man strings for more details.

you can try
awk '{gsub(/[^[:print:]]/,"") }1' file

Related

Bash 'cut' command for Mac

I want to cut everything with a delimiter ":" The input file is in the following format:
data1:data2
data11:data22
...
I have a linux command
cat merged.txt | cut -f1 -d ":" > output.txt
On mac terminal it gives an error:
cut: stdin: Illegal byte sequence
what is the correct way to do it on a mac terminal?

Your input file (merged.txt) probably contains bytes/byte sequences that are not valid in your current locale. For example, your locale might specify UTF-8 character encoding, but the file be in some other encoding and cannot be parsed as valid UTF-8. If this is the problem, you can work around it by telling tr to assume the "C" locale, which basically tells it to process the input as a stream of bytes without paying attention to encoding.
BTW, cat file | is what's commonly referred to as a Useless Use of Cat (UUOC) -- you can just use a standard input redirect < file instead, which cleaner and more efficient. Thus, my version of your command would be:
LC_ALL=C cut -f1 -d ":" < merged.txt > output.txt
Note that since the LC_ALL=C assignment is a prefix to the tr command, it only applies to that one command and won't mess up other operations that should assume UTF-8 (or whatever your normal locale is).

Your cut command works for me on my Mac, you can try awk for the same result
awk -F: '{print $1}' merged.txt
data1
data11

Grep multiple strings from text file

Okay so I have a textfile containing multiple strings, example of this -
Hello123
Halo123
Gracias
Thank you
...
I want grep to use these strings to find lines with matching strings/keywords from other files within a directory
example of text files being grepped -
123-example-Halo123
321-example-Gracias-com-no
321-example-match
so in this instance the output should be
123-example-Halo123
321-example-Gracias-com-no

With GNU grep:
grep -f file1 file2
-f FILE: Obtain patterns from FILE, one per line.
Output:
123-example-Halo123
321-example-Gracias-com-no

You should probably look at the manpage for grep to get a better understanding of what options are supported by the grep utility. However, there a number of ways to achieve what you're trying to accomplish. Here's one approach:
grep -e "Hello123" -e "Halo123" -e "Gracias" -e "Thank you" list_of_files_to_search
However, since your search strings are already in a separate file, you would probably want to use this approach:
grep -f patternFile list_of_files_to_search

I can think of two possible solutions for your question:
Use multiple regular expressions - a regular expression for each word you want to find, for example:
grep -e Hello123 -e Halo123 file_to_search.txt
Use a single regular expression with an "or" operator. Using Perl regular expressions, it will look like the following:
grep -P "Hello123|Halo123" file_to_search.txt
EDIT:
As you mentioned in your comment, you want to use a list of words to find from a file and search in a full directory.
You can manipulate the words-to-find file to look like -e flags concatenation:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' '
This will return something like -e "Hello123" -e "Halo123" -e "Gracias" -e" Thank you", which you can then pass to grep using xargs:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' ' | dir_to_search/*
As you can see, the last command also searches in all of the files in the directory.
SECOND EDIT: as PesaThe mentioned, the following command would do this in a much more simple and elegant way:
grep -f words_to_find.txt dir_to_search/*

Grep/Sed/Awk Options

How could you grep or use sed or awk to parse for a dynamic length substring? Here are some examples:
I need to parse out everything except for the "XXXXX.WAV" in these strings, but the strings are not a set length.
Sometimes its like this:
{"filename": "/assets/JFM/imaging/19001.WAV"},
{"filename": "/assets/JFM/imaging/19307.WAV"},
{"filename": "/assets/JFM/imaging/19002.WAV"}
And sometimes like this:
{"filename": "/assets/JFM/LN_405999/101.WAV"},
{"filename": "/assets/JFM/LN_405999/102.WAV"},
{"filename": "/assets/JFM/LN_405999/103.WAV"}
Is there a great dynamic way to parse for just the .WAV? Maybe if I start at "/" and parse until "?
Edit:
Expected output like this:
19001.WAV
19307.WAV
19002.WAV
Or:
101.WAV
101.WAV
103.WAV

Just use grep as proposed in comments:
grep -o '[^/]\{1,\}\.WAV' yourfile
If the wav file always contains numbers, this seems more explicit (same result):
grep -o '[0-9]\{1,\}\.WAV'

Assuming there are [ and ] lines at the beginning and end of your file, it looks like your input is JSON, in which case I would recommend installing and using jq rather than text-based utilities, and doing something like this:
jq -r '.[]|.filename|split("/")[-1]'
But failing that, any of the tools listed will work just fine.
grep -o '[^/]*\.WAV'
or
sed -ne 's,.*/\([^/]*\.WAV\).*$,\1,p'
or
awk -F'"' '/WAV/ {split($4,a,"/"); print a[length(a)]}'
In each case there are a variety of other possible solutions as well.

Or with sed
$ sed 's,.*/,,; s,".*,,' x
101.WAV
102.WAV
103.WAV
Explanation:
s,.*/,, - delete everything up to and including the rightmost /
s,".*,, - delete everything starting with the leftmost " to the end of the line

another awk
awk -F'[/"]' '{print $(NF-1)}' file
19001.WAV
19307.WAV
19002.WAV

Try this -
awk -F'[{":}/]' '{print $(NF-2)}' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.WAV' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.[[:alpha:]]{3}' f
19001.WAV
19307.WAV
19002.WAV
You can easily change the value of digit and character as per your need for different example in egrep but awk will work fine for both case.

All of the programs you listed use regex to parse the names, so I will show you an example using grep, being probably the most basic one for this case.
There are a couple of options, depending on the exact way you define the XXX part before the ".wav".
Option 1, as you pointed out is just the file name, i.e., everything after the last slash:
grep -hoi "[^/]\+\.WAV"
This reads as "any character besides slash" ([^/]) repeated at least once (\+), followed by a literal .WAV (\.WAV).
Option 2 would be to only grab the digits before the extension:
grep -hoi "[[:digit:]]\+\.WAV"
OR
grep -hoi "[0-9]\+\.WAV"
These read as "digits" ([[:digit:]] and [0-9] mean the same thing) repeated at least once (\+), followed by a literal .WAV (\.WAV).
In all cases, I recommend using the flags -h, -o, -i, which I have concatenated into a single option -hoi. -h suppresses the file name from the output. -o makes grep only output the portion that matches. -i makes the match case insensitive, so should your extension ever change to .wav instead of .WAV, you'll be fine.
Also, in all cases, the input is up to you. You can pipe it in from another program, which will look like
program | grep -hoi "[^/]\+\.WAV"
You can get it from a file using stdin redirection:
grep -hoi "[^/]\+\.WAV" < somefile.txt
Or you can just pass the filename to grep:
grep -hoi "[^/]\+\.WAV" somefile.txt

awk -F/ '{print substr($5,1,7)}' file
101.WAV
102.WAV
103.WAV

Reading files get stuck in bash

I am running this command on my log files,
grep "." file | tr '|' '\n' | sed -r "s/(.{3}).*?\.cpp/\1TRY/g" | tr '\n''|'
It runs as expected i.e. keeping the first three letters same to files with .cpp extension, adds TRY to it.
So if input is: abcdef.cpp
ouput is: abcTRY
(keeping words without extension as it is)
But is stops running(gets stuck) after some time, any suggestions on what might be the problem.

Remove the non-greedy quantifier.
sed -r "s/^(.{3})[^.]*\.cpp/\1TRY/"

Use lines in a file as filenames for grep?

I have a file which contains filenames (and the full path to them) and I want to search for a word within all of them.
some pseudo-code to explain:
grep keyword <all files specified in files.txt>
or
cat files.txt > grep keyword
cat files txt | grep keyword
the problem is that I can only get grep to search the filenames, not the contents of the actual files.

cat files.txt | xargs grep keyword
or
grep keyword `cat files.txt`
or (equivalent to previous but harder to mis-read)
grep keyword $(cat files.txt)
should do the trick.
Pitfalls:
If files.txt contains file names with spaces, either solution will malfunction, because "This is a filename.txt" will be interpreted as four files, "This", "is", "a", and "filename.txt". A good reason why you shouldn't have spaces in your filenames, ever.
There are ways around this, but none of them is trivial. (find ... -print0 / xargs -0 is one of them.)
The second (cat) version can result in a very long command line (which might fail when exceeding the limits of your environment). The first (xargs) version handles long input automatically; xargs offers several options to control the details.

Both of the answers from DevSolar work (tested on Linux Ubuntu), but the xargs version is preferable if there may be many files, since it will avoid running into command line length limits.
so:
cat files.txt | xargs grep keyword
is the way to go

tr '\n' '\0' <files.txt | LANG=C xargs -r0 grep -F keyword
tr will delimit names with NUL character so that spaces not significant (note the corresponding -0 option to xargs).
xargs -r will start a single grep process for a "large" number of files, but not start any grep process if there are no files.
LANG=C means use quick routines for matching, rather than slow locale ones
grep -F means use quick string matching rather than slow regular expression matching

bash, ksh & zsh version:
grep keyword $(<files.txt)

Long time when last created a bash shell script, but you could store the result of the first grep (the one finding all filenames) in an array and iterate over it, issuing even more grep commands.
A good starting point should be the bash scripting guide.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Removing non-displaying characters from a file - bash

The tr command is made for that: cat weirdo | tr -cd '[:print:]\r\n\t' | grep Lunch

You may have some success with the strings(1) tool like in: strings file | grep Launch See man strings for more details.

you can try awk '{gsub(/[^[:print:]]/,"") }1' file

Related

Bash 'cut' command for Mac

Grep multiple strings from text file

Grep/Sed/Awk Options

Reading files get stuck in bash

Use lines in a file as filenames for grep?

Categories

Resources