Bash 'cut' command for Mac - bash

I want to cut everything with a delimiter ":" The input file is in the following format:
data1:data2
data11:data22
...
I have a linux command
cat merged.txt | cut -f1 -d ":" > output.txt
On mac terminal it gives an error:
cut: stdin: Illegal byte sequence
what is the correct way to do it on a mac terminal?

Your input file (merged.txt) probably contains bytes/byte sequences that are not valid in your current locale. For example, your locale might specify UTF-8 character encoding, but the file be in some other encoding and cannot be parsed as valid UTF-8. If this is the problem, you can work around it by telling tr to assume the "C" locale, which basically tells it to process the input as a stream of bytes without paying attention to encoding.
BTW, cat file | is what's commonly referred to as a Useless Use of Cat (UUOC) -- you can just use a standard input redirect < file instead, which cleaner and more efficient. Thus, my version of your command would be:
LC_ALL=C cut -f1 -d ":" < merged.txt > output.txt
Note that since the LC_ALL=C assignment is a prefix to the tr command, it only applies to that one command and won't mess up other operations that should assume UTF-8 (or whatever your normal locale is).

Your cut command works for me on my Mac, you can try awk for the same result
awk -F: '{print $1}' merged.txt
data1
data11

Related

Bash: how to get the complete substring of a match in a string?

I have a TXT file, which is shipped from a Windows machine and is encoded in ISO-8859-1. My Qt application is supposed to read this file but QString supports only UTF-8 (I want to avoid working with QByteArray). I've been sturggling to find a way to do that in Qt so I decided to write a small script that does the conversion for me. I have no problem writing it for exactly my case but I would like to make it more general - for all ISO-8859 encoding.
So far I have the following:
#!/usr/bin/env bash
output=$(file -i $1)
# If the output contains any sort of ISO-8859 substring
if echo "$output" | grep -qi "ISO-8859"; then
# Retrieve actual encoding
encoding=...
# run iconv to convert
iconv -f $encoding $1 -t UTF-8 -o $1
else
echo "Text file not encoded in ISO-8859"
fi
The part that I'm struggling with is how to get the complete substring that has been successfully mached in the grep command.
Let's say I have the file helloworld.txt and it's encoded in ISO-8859-15. In this case
$~: ./fixEncodingToUtf8 helloworld.txt
stations.txt: text/plain; charset=iso-8859-15
will be the output in the terminal. Internally the grep finds the iso-8859 (since I use the -i flag it processes the input in a case-insensitive way). At this point the script needs to "extract" the whole substring namely not just iso-8859 but iso-8859-15 and store it inside the encoding variable to use it later with iconv (which is case insensitive (phew!) when it comes to the name of the encodings).
NOTE: The script above can be extended even further by simply retrieving the value that follows charset and using it for the encoding. However this has one huge flaw - what if the input file has an encoding that has a larger character set than UTF-8 (simple example: UTF-16 and UTF-32)?
Or using bash features like below
$ str="stations.txt: text/plain; charset=iso-8859-15"
$ echo "${str#*=}"
iso-8859-15
To save in variable
$ myvar="${str#*=}"
You can use cut or awk to get at this:
awk:
encoding=$(echo $output | awk -F"=" '{print $2}')
cut:
encoding=$(echo $output | cut -d"=" -f2)
I think you could just feed this over to your iconv command directly and reduce your script to:
iconv -f $(file $1 | cut -d"=" -f2) -t UTF-8 file
Well, in this case it is rather pointless…
$ file --brief --mime-encoding "$1"
iso-8859-15
file manual
-b, --brief
Do not prepend filenames to output lines (brief mode).
...
--mime-type, --mime-encoding
Like -i, but print only the specified element(s).

Getting rid of some special symbol while reading from a file

I am writing a small script which is getting some configuration options from a settings file with a certain format (option=value or option=value1 value2 ...).
settings-file:
SomeOption=asdf
IFS=HDMI1 HDMI2 VGA1 DP1
SomeOtherOption=ghjk
Script:
for VALUE in $(cat settings | grep IFS | sed 's/.*=\(.*\)/\1/'); do
echo "$VALUE"x
done
Now I get the following output:
HDMI1x
HDMI2x
VGA1x
xP1
Expected output:
HDMI1x
HDMI2x
VGA1x
DP1x
I obviously can't use the data like this since the last read entry is mangled up somehow. What is going on and how do I stop this from happening?
Regards
Generally you can use awk like this:
awk -F'[= ]' '$1=="IFS"{for(i=2;i<=NF;i++)print $i"x"}' settings
-F'[= ] splits the line by = or space. The following awk program checks if the first field, the variable name equals IFS and then iterates trough column 2 to the end and prints them.
However, in comments you said that the file is using Windows line endings. In this case you need to pre-process the file before using awk. You can use tr to remove the carriage return symbols:
tr -d '\r' settings | awk -F'[= ]' '$1=="IFS"{for(i=2;i<=NF;i++)print $i"x"}'
The reason is likely that your settings file uses DOS line endings.
Once you've fixed that (with dos2unix for example), your loop can also be modified to the following, removing two utility invocations:
for value in $( sed -n -e 's/^IFS.*=\(.*\)/\1/p' settings ); do
echo "$value"x
done
Or you can do it all in one go, removing the need to modify the settings file at all:
tr -d '\r' <settings |
for value in $( sed -n -e 's/^IFS.*=\(.*\)/\1/p' ); do
echo "$value"x
done

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Awk replace a column with its hash value

How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
So, you don't really want to be doing this with awk. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.
Given input like this:
this is a test
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
$2=cksum
print
}' < sample
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses openssl to generate the md5 checksum. We first build a shell command line in tmp to pass the selected column to the md5 command. Then we pipe the output into the cksum variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:
this 7e1b6dbfa824d5d114e96981cededd00 a test
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample
This might work using Bash/GNU sed:
<<<"this is a test" sed -r 's/(\S+\s)(\S+)(.*)/echo "\1 $(md5sum <<<"\2") \3"/e;s/ - //'
this 7e1b6dbfa824d5d114e96981cededd00 a test
or a mostly sed solution:
<<<"this is a test" sed -r 'h;s/^\S+\s(\S+).*/md5sum <<<"\1"/e;G;s/^(\S+).*\n(\S+)\s\S+\s(.*)/\2 \1 \3/'
this 7e1b6dbfa824d5d114e96981cededd00 a test
Replaces is from this is a test with md5sum
Explanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string G to the pattern space (md5sum result) and using substitution arrange to suit.
You can also do that with perl :
echo "aze qsd wxc" | perl -MDigest::MD5 -ne 'print "$1 ".Digest::MD5::md5_hex($2)." $3" if /([^ ]+) ([^ ]+) ([^ ]+)/'
aze 511e33b4b0fe4bf75aa3bbac63311e5a wxc
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with read than awk, though I haven't done any benchmarking.
the input (scratch001.txt):
foo|bar|foobar|baz|bang|bazbang
baz|bang|bazbang|foo|bar|foobar
transformed using read:
while IFS="|" read -r one fish twofish red fishy bluefishy; do
twofish=`echo -n $twofish | md5sum | tr -d " -"`
echo "$one|$fish|$twofish|$red|$fishy|$bluefishy"
done < scratch001.txt
produces the output:
foo|bar|3858f62230ac3c915f300c664312c63f|baz|bang|bazbang
baz|bang|19e737ea1f14d36fc0a85fbe0c3e76f9|foo|bar|foobar

Removing non-displaying characters from a file

$ cat weirdo
Lunch now?
$ cat weirdo | grep Lunch
$ vi weirdo
^#L^#u^#n^#c^#h^# ^#n^#o^#w^#?^#
I have some files that contain text with some non-printing characters like ^# which cause my greps to fail (as above).
How can I get my grep work? Is there some way that does not require altering the files?
It looks like your file is encoded in UTF-16 rather than an 8-bit character set. The '^#' is a notation for ASCII NUL '\0', which usually spoils string matching.
One technique for loss-less handling of this would be to use a filter to convert UTF-16 to UTF-8, and then using grep on the output - hypothetically, if the command was 'utf16-utf8', you'd write:
utf16-utf8 weirdo | grep Lunch
As an appallingly crude approximation to 'utf16-utf8', you could consider:
tr -d '\0' < weirdo | grep Lunch
This deletes ASCII NUL characters from the input file and lets grep operate on the 'cleaned up' output. In theory, it might give you false positives; in practice, it probably won't.
The tr command is made for that:
cat weirdo | tr -cd '[:print:]\r\n\t' | grep Lunch
You may have some success with the strings(1) tool like in:
strings file | grep Launch
See man strings for more details.
you can try
awk '{gsub(/[^[:print:]]/,"") }1' file

Resources