how to read one line to calculate the md5 - bash

I am using Linux bash version 4.1.2
I have a tab-delimited input_file having 5 fields and I want to calculate the MD5 for each line and put the md5sum at the end of each line.
The expected output_file should therefore has 6 fields for each line.
Here is my coding:
cat input_file | while read ONELINE
do
THEMD5=`echo "$ONELINE" | md5sum | awk '{print $1}'`
echo -e "${ONELINE}\t${THEMD5}"
done > output_file
The coding works well most of the time.
However, if ONELINE is ended with single/double tabs, the trailing tab(s) will disappear!
As a result, the output_file will sometimes contain lines of 4 or 5 fields, due to the missing tab(s).
I have tried to add IFS= or IFS='' or IFS=$'\n' or IFS-$'\012' in the while statement, but still cannot solve the problem.
Please help.
Alvin SIU

The following is quite certainly correct, if you want trailing newlines included in your md5sums (as your original code has):
while IFS= read -r line; do
read sum _ < <(printf '%s\n' "$line" | md5sum -)
printf '%s\t%s\n' "$line" "$sum"
done <input_file
Notes:
Characters inside IFS are stripped by read; setting IFS= is sufficient to prevent this effect.
Without the -r argument, read also interprets backslash literals, stripping them.
Using echo -e is dangerous: It interprets escape sequences inside your line, rather than emitting them as literals.
Using all-uppercase variable names is bad form. See the relevant spec (particularly the fourth paragraph), keeping in mind that shell variables and environment variables share a namespace.
Using echo in general is bad form when dealing with uncontrolled data (particularly including data which can contain backslash literals). See the relevant POSIX spec, particularly the APPLICATION USAGE and RATIONALE sections.
If you want to print the lines in a way that makes hidden characters visible, consider using '%q\t%s\n' instead of '%s\t%s\n' as a format string.

Related

Why does "cut" command skip first line in this "while read line" loop?

I'm writing a bash script, and I need to take the second field of every line in a file, and save them in another file. I know there are many possible ways to do this, BUT, I tried first using while read line; do, and I got stuck. Now, I really want to know what is happening.
For example, input file would be:
line1 11111
line2 222222
line3 333
line4 4444
(The field separtor is "\t").
This is what I was doing:
inputfile=$1
cat $"inputfile" | while read -r line
do
cut -f2 >> results_file
done
The problem is, the output would be:
222222
333
4444
(skipping the first line)
I´ve alredy tested hundreds of modifications, and tried to used other commands instead of cut(like, sed, grep...). I would appreciate some help, or someone pointing me in the right direction.
Thank you very much!
You are not using the variable $line set by read. Try instead
inputfile=$1
cat "$inputfile" | while read -r line
do
echo "$line" | cut -f2 >> results_file
done
In your original code, the while loop is actually run only once, not four times; try putting echo 'Hello!' in the loop to your original code. You would see the message only once, not four times. I guess, without echo "$line" | part, cut -f2 ... part consumes the pipe away.
That is, your while loop first consumes the first line of the stdin and puts this line in the variable $line, leaving the next three lines for later use. But $line is never used. Instead, the remaining three lines are consumed by the command cut.
All commands within a command group are within the scope of any redirections applied to a command group (or any compound command):
— https://mywiki.wooledge.org/BashGuide/CompoundCommands
The pipe operator creates a subshell environment for each command.
— https://mywiki.wooledge.org/BashGuide/InputAndOutput
We can interpret the quotes as "the stdin to your while loop (i.e., the output of cat "$inputfile") is accessed by cut, unless you sever its access by creating a new subshell e.g., by another pipe echo "$line" | ...."
By the way, you can just use cut -f2 "$inputfile" >> results_file without the while loop.
With respect to your comment Does it mean to use "\t at the end" as a separator - no. You're confusing what was suggested, $'\t' with '\t$'. $'\t' means "the literal tab character generated from the escape sequence \t".
You also said in your comment your real 2nd fields are URLs to be curled. You shouldn't be using a UUOC and cut anyway, here's how to really do this:
while IFS=$'\t' read -r key url; do
val=$(curl "$url" | whatever)
printf '%s\t%s\n' "$key" "$val"
done < "$inputfile" > results_file
Replace whatever with whatever command you use to produce the output you want from the curl output.

Using sed with a regex to replace strings

I want to replace some string which contain specific words with another word.
Here is my code
#!/bin/bash
arr='foo/foo/baz foo/bar/baz foo/baz/baz';
for i in ${arr[#]}; do
echo $i | sed -e 's|foo/(bar\|baz)/baz|test|g'
done
Result
foo/foo/baz
foo/bar/baz
foo/baz/baz
Expected
foo/foo/baz
foo/test/baz
foo/test/baz
There are several things you can improve. The reason you are using the alternate delimiters '|' for the sed substitution expression (to avoid the "picket fence" appearance of \/\/\/ complicates the use of '|' as the OR (alternative) regex component. Choose an alternative delimiter that does not also server as part of the regular expression, '#' works fine.
Next there is no reason to loop, simply use a here string to redirect the contents of arr to sed and place it all in a command substitution with the "%s\n" format specifier to provide the newline separated output. (that's a mouthful, but it is actually nothing more than)
arr='foo/foo/baz foo/bar/baz foo/baz/baz'
printf "%s\n" $(sed 's#/\(bar\|baz\)/#/test/#g' <<< $arr))
Example Use/Output
To test it out, just select the expressions above and middle-mouse paste the selection into your terminal, e.g.
$ arr='foo/foo/baz foo/bar/baz foo/baz/baz'
> printf "%s\n" $(sed 's#/\(bar\|baz\)/#/test/#g' <<< $arr)
foo/foo/baz
foo/test/baz
foo/test/baz
Look things over and let me know if you have further questions.
How about something like this:
sed -e 's/\(bar\|baz\)\//test\//g'

shell: prefixing output with spaces with paste

A lot of time one needs to prefix 4 spaces to some shell output and transform it into valid markdown code. E.g. When posting a question or answer here on stackoverflow.
It's actually quite easy to do with sed:
some_command | sed -e 's/^/ /'
But I'd like to do it with paste if possible. Because paste takes 2 files as input, all I came up with was this:
some_command | paste 4_space_file -
where 4_space_file is actually a file whose whole content was 4 spaces.
Is there a neater way to achieve this with paste without having an actual file on the hard drive?
Literal Answers Using Paste
First, to answer your literal question:
some_command | paste <(printf ' \n') -
...yields the same output as passing paste the name of a file with a single line having four spaces and a newline as its content. However, the output from paste in this case is not four-character indents for each line; the first line has four spaces and a tab prepended, subsequent lines are prefixed with only a tab.
If you wanted to generate an input of the appropriate length while still using paste, then you'd end up with something uglier. Say (with bash 4.0 or newer):
ls | {
mapfile -t lines # read output from ls into an array
# our answer, here, is to move to three spaces in the input, and use paste -d' ' to
# ...add a fourth space during processing.
paste -d' ' \
<(yes ' ' | head -n "${#lines[#]}") \
<(printf '%s\n' "${lines[#]}")
}
<() is process substitution syntax, which expands to a filename which, when read from, will yield the output from the code contained.
Better Answers
For a native bash approach, you might also consider defining a function:
ident4() { while IFS= read -r line; do printf ' %s\n' "$line"; done; }
...for later use:
some_command | indent4
Unlike paste, this actually inserts exactly four spaces (with no intervening tab) on every line, for the exact number of lines in your input (no need to synthesize the correct length).
Also consider awk:
awk '{ print " " $0; }'

"filename too long" bash mv command old files

#! /bin/sh -
cd /PHOTAN || exit
fn=$(ls -t | tail -n -30)
mv -f -- "${fn}" /old
all I want todo is keep most recent 30 files... but cant get past the mv
"File name too long" problem
please help'
The notation "${fn}" adds all the file names into a single argument string, separated by spaces. Just for once, assuming you don't have to worry about file names with spaces in them, you need:
mv -f -- ${fn} /old
If you have file names with spaces in them, then you've got problems starting with parsing the output of the ls command.
But what if you do have to worry about spaces in your filenames?
Then, as I stated, you have major problems, starting with the issues of parsing the output of ls.
$ echo > 'a b'
$ echo > ' c d '
$
Two nice file names with spaces in them. They cause merry hell. I'm about to assume you're on Linux or something similar enough. You need to use bash arrays, the stat command, printf, sort -z, sed -z. Or you should simply outlaw filenames with spaces; it is probably easier.
names=( * )
The array names contains each file name as a separate array element, leading and trailing and embedded blanks all handled correctly.
names=( * )
for file in "${names[#]}"
do printf "%s\0" "$(stat -c '%Y' "$file") $file"
done |
sort -nzr |
sed -nze '1,30s/^[0-9][0-9]* //p' |
tr '\0' '\n'
The for loop evaluates the modification time of each file separately, and combines the modification time, a space, and the file name into a single string followed by a null byte to mark the end of the string. The sort command sorts the 'lines' numerically, assuming the lines are terminated by null bytes because of the -z option, and places the most recent file names first. The sed command prints the first 30 'lines' (file names) only; the tr command replaces null bytes with newlines (but in doing so, loses the identity of file name boundaries).
The code works even with file names containing newlines, but only on systems where sed and sort support the (non-standard) -z option to process null-terminated input 'lines' — that means systems using GNU sed and sort (even BSD sed as found on Mac OS X does not, though the Mac OS X sort is GNU sort and does support -z).
Ugh! The shell was designed for spaces to appear between and not within file names.
As noted by BroSlow in a comment, if you assume 'no newlines in filenames', then the code can be simpler and more nearly portable — but it is still tricky:
ls -t |
tail -30 |
{
list=()
while IFS='' read -r file
do list+=( "$file" )
done
mv -f -- "${list[#]}" /old
}
The IFS='' is needed so that leading and trailing spaces in filenames are preserved (and tabs, too).
I note in passing that the Korn shell would not require the braces but Bash does.

How can I read words (instead of lines) from a file?

I've read this question about how to read n characters from a text file using bash. I would like to know how to read a word at a time from a file that looks like:
example text
example1 text1
example2 text2
example3 text3
Can anyone explain that to me, or show me an easy example?
Thanks!
The read command by default reads whole lines. So the solution is probably to read the whole line and then split it on whitespace with e.g. for:
#!/bin/sh
while read line; do
for word in $line; do
echo "word = '$word'"
done
done <"myfile.txt"
The way to do this with standard input is by passing the -a flag to read:
read -a words
echo "${words[#]}"
This will read your entire line into an indexed array variable, in this case named words. You can then perform any array operations you like on words with shell parameter expansions.
For file-oriented operations, current versions of Bash also support the mapfile built-in. For example:
mapfile < /etc/passwd
echo ${MAPFILE[0]}
Either way, arrays are the way to go. It's worth your time to familiarize yourself with Bash array syntax to make the most of this feature.
Ordinarily, you should read from a file using a while read -r line loop. To do this and parse the words on the lines requires nesting a for loop inside the while loop.
Here is a technique that works without requiring nested loops:
for word in $(<inputfile)
do
echo "$word"
done
In the context given, where the number of words is known:
while read -r word1 word2 _; do
echo "Read a line with word1 of $word1 and word2 of $word2"
done
If you want to read each line into an array, read -a will put the first word into element 0 of your array, the second into element 1, etc:
while read -r -a words; do
echo "First word is ${words[0]}; second word is ${words[1]}"
declare -p words # print the whole array
done
In bash, just use space as delimiter (read -d ' '). This method requires some preprocessing to translate newlines into spaces (using tr) and to merge several spaces into a single one (using sed):
{
tr '\n' ' ' | sed 's/ */ /g' | while read -d ' ' WORD
do
echo -n "<${WORD}> "
done
echo
} << EOF
Here you have some words, including * wildcards
that don't get expanded,
multiple spaces between words,
and lines with spaces at the begining.
EOF
The main advantage of this method is that you don't need to worry about the array syntax and just work as with a for loop, but without wildcard expansion.
I came across this question and the proposed answers, but I don't see listed this simple possibile solution:
for word in `cat inputfile`
do
echo $word
done
This can be done using AWK too:
awk '{for(i=1;i<=NF;i++) {print $i}}' text_file
You can combine xargs which reads word delimited by space or newline and echo to print one per line:
<some-file xargs -n1 echo
some-command | xargs -n1 echo
That also works well for large or slow streams of data because it does not need to read the whole input at once.
I’ve used this to read 1 table name at a time from SQLite which prints table names in a column layout:
sqlite3 db.sqlite .tables | xargs -n1 echo | while read table; do echo "1 table: $table"; done

Resources