'Read' command stripping '\n' string - bash

I want to extract data from a file which looks like this :
BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh
The idea is to show on the screen (without creating a temporary file, nor modifying the original file) what follows :
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
To sum up :
Strip the characters before ':'
Put the filenames before their corresponding directory
Sort the filenames by alphabetical order
Do a carriage return between each filename and its corresponding directory
I succeed doing all this, but there is still an ugly thing in my code concerning point #4 :
cut -f 2 -d ':' $big_file | \
sort -u | \
while read file ; do
echo "$(basename "$file")zipzapzupzop$(dirname "$file")" # <-- ugly thing #1
done | \
sort -dfb | \
while read line ; do
echo $line
done | \
sed 's/zipzapzupzop/\n/' # <-- ugly thing #2
At the beginning, I had written :
echo "$(basename "$file")\n$(dirname "$file")"
in place of ugly thing#1, in order to be able to do
echo -e "$line"
in the second while boucle. However, the read command strips each time the '\n' string, so that I obtain
alt-notify-send.shn/home/michael/Scripts
bk.bakn/home/michael/Scripts
bk.shn/home/michael/Scripts
demande_password.shn/home/michael/Scripts
gbk.shn/home/michael/Scripts
usb_backup.shn/home/michael/Scripts
yad_0.17.1.1-1_i386.debn/home/michael/Scripts
I tried to protect the '\' character by another '\', but the result is the same.
man read
is of no help either. So, is it a proper way to do this ?

read is a shell builtin, and man read may be giving you the docs for the (mostly unrelated) syscall.
read -r will prevent read from processing \ sequences.
The whole thing could have been done with a single awk script though:
awk '
{
start = index($0, ":") + 1
end = match($0, "[^/]*$")
out[NR] = substr($0, end) "\n" substr($0, start, end - start - 1)
}
END {
asort(out)
for (i = 1; i <= NR; i++)
print out[i]
}'

If you don't need to handle spaces in filenames, you can do this:
cat $bigfile | sed 's/.*://' | while read file; do
echo "$(basename $file) $(dirname $file)"
done | sort | awk '{print $1"\n"$2}'

You can do it with the following pipeline (should be on one line, I've split it and added comments for readability):
| sed -e 's/^[^:]*://' # Remove from start of line to first ':'
-e 's?/\([^/]*$\)? \1?' # Replace final '/' with a space
| sort -k2 # Sort on column 2 (filename)
| awk '{print $2"\n"$1}' # Reverse fields
See the following transcript:
echo 'BK20120802130531:/home/michael/Scripts/usb_backup.sh
BK20120802130531:/home/michael/Scripts/yad_0.17.1.1-1_i386.deb
BK20120802130731:/home/michael/Scripts/gbk.sh
BK20120802130131:/home/michael/Scripts/alt-notify-send.sh
BK20120802130131:/home/michael/Scripts/bk.bak
BK20120802130131:/home/michael/Scripts/bk.sh
BK20120802130131:/home/michael/Scripts/demande_password.sh'
| sed -e 's/^[^:]*://'
-e 's?/\([^/]*$\)? \1?'
| sort -k2
| awk '{print $2"\n"$1}'
alt-notify-send.sh
/home/michael/Scripts
bk.bak
/home/michael/Scripts
bk.sh
/home/michael/Scripts
demande_password.sh
/home/michael/Scripts
gbk.sh
/home/michael/Scripts
usb_backup.sh
/home/michael/Scripts
yad_0.17.1.1-1_i386.deb
/home/michael/Scripts
Just keep in mind that sort may not work as expected with lines containing spaces.

Assuming you do not have hash tags in your filenames you could use this coreutils pipeline:
cut -d: -f2- infile \
| sed -r 's,(.*)/([^/]*)$,\2#\1,' \
| sort -t'#' \
| tr '#' '\n'
cut removes the first part.
sed splits the path, swaps filename and directory and delimits them with a #.
sort hash tag delimited text.
tr finally replaces the hash tag with a newline.
If you know the number of path elements, you can use the simpler version:
cut -d: -f2- infile \
| sort -t/ -k4,4 \
| sed 's,(.*)/([^/]*)$,\2\n\1,'

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

How to use awk to select text from a file starting from a line number until a certain string

I have this file where I want to read it starting from a certain line number, until a string. I already used
awk "NR>=$LINE && NR<=$((LINE + 121)) {print}" db_000022_model1.dlg
to read from a specific line until and incremented line number, but right now I need to make it stop by itself at a certain string in order to be able to use it on other files.
DOCKED: ENDBRANCH 7 22
DOCKED: TORSDOF 3
DOCKED: TER
DOCKED: ENDMDL
I want it to stop after it reaches
DOCKED: ENDMDL
#!/bin/bash
# This script is for extracting the pdb files from a sorted list of scored
# ligands
mkdir top_poses
for d in $(head -20 summary_2.0.sort | cut -d, -f1 | cut -d/ -f1)
do
cd "$d"||continue
# find the cluster with the highest population within the dlg
RUN=$(grep '###*' "$d.dlg" | sort -k10 -r | head -1 | cut -d\| -f3 | sed 's/ //g')
LINE=$(grep -ni "BEGINNING GENETIC ALGORITHM DOCKING $RUN of 100" "$d.dlg" | cut -d: -f1)
echo "$LINE"
# extract the best pose and correct the format
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
# convert the pdbqt file into pdb
#obabel -ipdbqt $d.pdbqt -opdb -O../top_poses/$d.pdb
cd ..
done
When I try the
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
Just like that in the shell terminal, it works. But in the script it outputs an empty file.
Depending on your requirements for handling DOCKED: ENDMDL occurring before your target line:
awk -v line="$LINE" 'NR>=line; /DOCKED: ENDMDL/{exit}' db_000022_model1.dlg
or:
awk -v line="$LINE" 'NR>=line{print; if (/DOCKED: ENDMDL/) exit}' db_000022_model1.dlg

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

Word frequency tally script is too slow

Background
Created a script to count the frequency of words in a plain text file. The script performs the following steps:
Count the frequency of words from a corpus.
Retain each word in the corpus found in a dictionary.
Create a comma-separated file of the frequencies.
The script is at: http://pastebin.com/VAZdeKXs
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
sed -e 's/ /\n/g' -e 's/[^a-zA-Z\n]//g' corpus.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
echo Creating corpus lexicon...
rm -f corpus-lexicon.txt
for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done
echo Creating lexicon...
rm -f lexicon.txt
for i in $(cat corpus-lexicon.txt); do
egrep -m 1 "^[0-9 ]* $i\$" frequency.txt | \
awk '{print $2, $1}' | \
tr ' ' ',' >> lexicon.txt;
done
Problem
The following lines continually cycle through the dictionary to match words:
for i in $(awk '{if( $2 ) print $2}' frequency.txt); do
grep -m 1 ^$i\$ dictionary.txt >> corpus-lexicon.txt;
done
It works, but it is slow because it is scanning the words it found to remove any that are not in the dictionary. The code performs this task by scanning the dictionary for every single word. (The -m 1 parameter stops the scan when the match is found.)
Question
How would you optimize the script so that the dictionary is not scanned from start to finish for every single word? The majority of the words will not be in the dictionary.
Thank you!
You can use grep -f to search for all of the words in one pass over frequency.txt:
awk '{print $2}' frequency.txt | grep -Fxf dictionary.txt > corpus-lexicon.txt
-F to search for fixed strings.
-x to match whole lines only.
-f to read the search patterns from dictionary.txt
In fact, you could even combine this with the second loop and eliminate the intermediate corpus-lexicon.txt file. The two for loops can be replaced by a single grep:
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Notice that I changed -x to -w.
This is typically one of those scripts that you'd write in Perl for speed. But if, like me, you hate write-only programming languages, you can do it all in Awk:
awk '
BEGIN {
while ((getline < "dictionary.txt") > 0)
dict[$1] = 1
}
($2 && $2 in dict) { print $2 }
' < frequency.txt > corpus-lexicon.txt
No need for the rm -f corpus-lexicon.txt in this version.
Use a real programming language. All of the app start ups and file scans are killing you. For instance, here's an example I just whipped up in Python (minimizing lines of code):
import sys, re
words = re.findall(r'(\w+)',open(sys.argv[1]).read())
counts = {}
for word in words:
counts[word] = counts.setdefault(word,0) + 1
open(sys.argv[2],'w').write("\n".join([w+','+str(c) for (w,c) in counts.iteritems()]))
Testing a against a large text file I had sitting aound (1.4MB, 80,000 words according to wc), this completes in under a second (18k unique words) on a 5 year old powermac.

Only get hash value using md5sum (without filename)

I use md5sum to generate a hash value for a file.
But I only need to receive the hash value, not the file name.
md5=`md5sum ${my_iso_file}`
echo ${md5}
Output:
3abb17b66815bc7946cefe727737d295 ./iso/somefile.iso
How can I 'strip' the file name and only retain the value?
A simple array assignment works... Note that the first element of a Bash array can be addressed by just the name without the [0] index, i.e., $md5 contains only the 32 characters of md5sum.
md5=($(md5sum file))
echo $md5
# 53c8fdfcbb60cf8e1a1ee90601cc8fe2
Using AWK:
md5=`md5sum ${my_iso_file} | awk '{ print $1 }'`
You can use cut to split the line on spaces and return only the first such field:
md5=$(md5sum "$my_iso_file" | cut -d ' ' -f 1)
On Mac OS X:
md5 -q file
md5="$(md5sum "${my_iso_file}")"
md5="${md5%% *}" # remove the first space and everything after it
echo "${md5}"
Another way is to do:
md5sum filename | cut -f 1 -d " "
cut will split the line to each space and return only the first field.
By leaning on head:
md5_for_file=`md5sum ${my_iso_file}|head -c 32`
One way:
set -- $(md5sum $file)
md5=$1
Another way:
md5=$(md5sum $file | while read sum file; do echo $sum; done)
Another way:
md5=$(set -- $(md5sum $file); echo $1)
(Do not try that with backticks unless you're very brave and very good with backslashes.)
The advantage of these solutions over other solutions is that they only invoke md5sum and the shell, rather than other programs such as awk or sed. Whether that actually matters is then a separate question; you'd probably be hard pressed to notice the difference.
If you need to print it and don't need a newline, you can use:
printf $(md5sum filename)
md5=$(md5sum < $file | tr -d ' -')
md5=`md5sum ${my_iso_file} | cut -b-32`
md5sum puts a backslash before the hash if there is a backslash in the file name. The first 32 characters or anything before the first space may not be a proper hash.
It will not happen when using standard input (file name will be just -), so pixelbeat's answer will work, but many others will require adding something like | tail -c 32.
if you're concerned about screwy filenames :
md5sum < "${file_name}" | awk NF=1
f244e67ca3e71fff91cdf9b8bd3aa7a5
other messier ways to deal with this :
md5sum "${file_name}" | awk NF=NF OFS= FS=' .*$'
or
| awk '_{ exit }++_' RS=' '
f244e67ca3e71fff91cdf9b8bd3aa7a5
to do it entirely inside awk :
mawk 'BEGIN {
__ = ARGV[ --ARGC ]
_ = sprintf("%c",(_+=(_^=_<_)+_)^_+_*++_)
RS = FS
gsub(_,"&\\\\&",__)
( _=" md5sum < "((_)(__)_) ) | getline
print $(_*close(_)) }' "${file_name}"
f244e67ca3e71fff91cdf9b8bd3aa7a5
Well, I had the same problem today, but I was trying to get the file MD5 hash when running the find command.
I got the most voted question and wrapped it in a function called md5 to run in the find command. The mission for me was to calculate the hash for all files in a folder and output it as hash:filename.
md5() { md5sum $1 | awk '{ printf "%s",$1 }'; }
export -f md5
find -type f -exec bash -c 'md5 "$0"' {} \; -exec echo -n ':' \; -print
So, I'd got some pieces from here and also from 'find -exec' a shell function in Linux
For the sake of completeness, a way with sed using a regular expression and a capture group:
md5=$(md5sum "${my_iso_file}" | sed -r 's:\\*([^ ]*).*:\1:')
The regular expression is capturing everything in a group until a space is reached. To get a capture group working, you need to capture everything in sed.
(More about sed and capture groups here: How can I output only captured groups with sed?)
As delimiter in sed, I use colons because they are not valid in file paths and I don't have to escape the slashes in the filepath.
Another way:
md5=$(md5sum ${my_iso_file} | sed '/ .*//' )
md5=$(md5sum < index.html | head -c -4)

Resources