How to add more than one delimiter in Bash - bash

I'm trying to figure out how to set two delimiters: one is newline and another one is space so when I read numbers from file that is filled like this
1 2 3
4
5
6
I get one number by order like 1,2,3,4,5,6. I'm using read command to read numbers. Thanks!

Your question is pretty unclear but based on the sample and expected output the trivial solution would be to use tr:
$ cat file | tr '[ \n]' ,
1,2,3,4,5,,6,
but there is a space after 5 so you need to use -s to squeeze the repeats:
$ cat file | tr -s '[ \n]' ,
1,2,3,4,5,6,
which still leaves you with a nasty trailing comma and a missing newline from the end. That can be handled with sed or awk. (with Sapphire and Steel narrator's voice) Awk has been assigned:
$ cat file | tr -s '[ \n]' , | awk 'sub(/,$/,"")' # fails if output is just 0
1,2,3,4,5,6 # add ||1 or ||$0!="" to fix
Wait. Since we started awk why bother with the tr at all:
$ awk '{
gsub(/ +/,",",p) # replace space runs with a single comma
printf "%s",p
p=(p~/,$/||NR==1?"":",") $0 # 5 followed by space leaves a comma in the end so...
}
END {
print p
}' file
1,2,3,4,5,6
Well that turned out looking complicated and I just now noticed you did mention using the read command so maybe I'm way off with my solutions and I should've used bash scripting from the beginning:
s="" # using this neat trick I just learned here ;D
while read line # read a full line
do
for word in $line # read word by word
do
echo -n $s$word # output separator and the word
s=, # now set the separator
done
done < file
echo # newline to follow
1,2,3,4,5,6
Yes, it is Saturday evening and I have no life.

Related

read values of txt file from bash [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 5 years ago.
I'm trying to read values from a text file.
I have test1.txt which looks like
sub1 1 2 3
sub8 4 5 6
I want to obtain values '1 2 3' when I specify 'sub1'.
The closest I get is:
subj="sub1"
grep "$subj" test1.txt
But the answer is:
sub8 4 5 6
I've read that grep gives you the next line to the match, so I've tried to change the text file to the following:
test2.txt looks like:
sub1
1 2 3
sub8
4 5 6
However, when I type
grep "$subj" test2.txt
The answer is:
sub1
It should be something super simple but I've tried awk, seg, grep,egrep, cat and none is working...I've also read some posts somehow related but none was really helpful
Awk works: awk '$1 == "'"$subj"'" { print $2, $3, $4 }' test1.txt
The command outputs fields two, three, and four for all lines in test1.txt where the first field is $subj (i.e.: the contents of the variable named subj).
With your original text file format:
target=sub1
while IFS=$' \t\n' read -r key values; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
This requires no external tools, using only functionality built into bash itself. See BashFAQ #1.
As has come up during debugging in comments, if you have a traditional Apple-format text file (CR newlines only), then you might want something more like:
target=sub1
while IFS=$' \t\n' read -r -d $'\r' key values || [[ $key ]]; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
Alternately, using awk (for a standard UNIX text file):
target="sub1"
awk -v target="$target" '$1 == target { $1 = ""; print; }' <test1.txt
...or, for a file with CR-only newlines:
target="sub1"
tr '\r' '\n' <test1.txt | awk -v target="$target" '$1 == target { $1 = ""; print; }'
This version will be slower if the text file being read is small (since awk, like any other external tool, takes time to start up); but faster if it's large (since awk's operation is much faster than that of bash's built-ins once it's done starting up).
grep "sub1" test1.txt | cut -c6-
or
grep -A 1 "sub1" test2.txt | tail -n 1
You doing it right, but it seems like test1.txt has a wrong value in it.
with grep foo you get all lines with foo in it. use grep -m1 foo to find the first line with foo in it only.
then you can use cut -d" " -f2- to get all the values behind foo, while seperated by empty spaces.
In the end the command would look like this ...
$ subj="sub1"
$ grep -m1 "$subj" test1.txt | cut -d" " -f2-
But this doenst explain why you could not find sub1 in the first place.
Did you read the proper file ?
There's a bunch of ways to do this (and shorter/more efficient answers than what I'm giving you), but I'm assuming you're a beginner at bash, and therefore I'll give you something that's easy to understand:
egrep "^$subj\>" file.txt | sed "s/^\S*\>\s*//"
or
egrep "^$subj\>" file.txt | sed "s/^[^[:blank:]]*\>[[:blank:]]*//"
The first part, egrep, will search for you subject at the beginning of the line in file.txt (that's what the ^ symbol does in the grep string). It also is looking for a whole word (the \> is looking for an end of word boundary -- that way sub1 doesn't match sub12 in the file.) Notice you have to use egrep to get the \>, as grep by default doesn't recognize that escape sequence. Once done finding the lines, egrep then passes it's output to sed, which will strip the first word and trailing whitespace off of each line. Again, the ^ symbol in the sed command, specifies it should only match at the beginning of the line. The \S* tells it to read as many non-whitespace characters as it can. Then the \s* tells sed to gobble up as many whitespace as it can. sed then replaces everything it matched with nothing, leaving the other stuff behind.
BTW, there's a help page in Stack overflow that tells you how to format your questions (I'm guessing that was the reason you got a downvote).
-------------- EDIT ---------
As pointed out, if you are on a Mac or something like that you have to use [:alnum:] instead of \S, and [:blank:] instead of \s in your sed expression (as these are portable to all platforms)
awk '/sub1/{ print $2,$3,$4 }' file
1 2 3
What happens? After regexp /sub1/ the three following fields are printed.
Any drawbacks? It affects the space.
Sed also works: sed -n -e 's/^'"$subj"' *//p' file1.txt
It outputs all lines matching $subj at the beginning of a line after having removed the matching word and the spaces following. If TABs are used the spaces should be replaced by something like [[:space:]].

Optimize shell script for multiple sed replacements

I have a file containing a list of replacement pairs (about 100 of them) which are used by sed to replace strings in files.
The pairs go like:
old|new
tobereplaced|replacement
(stuffiwant).*(too)|\1\2
and my current code is:
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
sed -r "s/`echo "$old"`/`echo "$new"`/g" -i file
done
I cannot help but think that there is a more optimal way of performing the replacements. I tried turning the loop around to run through lines of the file first but that turned out to be much more expensive.
Are there any other ways of speeding up this script?
EDIT
Thanks for all the quick responses. Let me try out the various suggestions before choosing an answer.
One thing to clear up: I also need subexpressions/groups functionality. For example, one replacement I might need is:
([0-9])U|\10 #the extra brackets and escapes were required for my original code
Some details on the improvements (to be updated):
Method: processing time
Original script: 0.85s
cut instead of awk: 0.71s
anubhava's method: 0.18s
chthonicdaemon's method: 0.01s
You can use sed to produce correctly -formatted sed input:
sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file
I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):
custom program > sed > LANG=C sed > perl > LANG=C perl > replace
If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.
From replace.c:
Replace strings in textfile
This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.
...
The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.
More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:
$ sed -e 's/A/B/g; ...' file.txt | \
sed -e 's/B/C/g; ...' | \
sed -e 's/C/D/g; ...' | \
sed -e 's/D/E/g; ...' > out
Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:
$ LANG=C sed ...
You can cut down unnecessary awk invocations and use BASH to break name-value pairs:
while IFS='|' read -r old new; do
# echo "$old :: $new"
sed -i "s~$old~$new~g" file
done < replacement_list
IFS='|' will give enable read to populate name-value in 2 different shell variables old and new.
This is assuming ~ is not present in your name-value pairs. If that is not the case then feel free to use an alternate sed delimiter.
Here is what I would try:
store your sed search-replace pair in a Bash array like ;
build your sed command based on this array using parameter expansion
run command.
patterns=(
old new
tobereplaced replacement
)
pattern_count=${#patterns[*]} # number of pattern
sedArgs=() # will hold the list of sed arguments
for (( i=0 ; i<$pattern_count ; i=i+2 )); do # don't need to loop on the replacement…
search=${patterns[i]};
replace=${patterns[i+1]}; # … here we got the replacement part
sedArgs+=" -e s/$search/$replace/g"
done
sed ${sedArgs[#]} file
This result in this command:
sed -e s/old/new/g -e s/tobereplaced/replacement/g file
You can try this.
pattern=''
cat replacement_list | while read i
do
old=$(echo "$i" | awk -F'|' '{print $1}') #due to the need for extended regex
new=$(echo "$i" | awk -F'|' '{print $2}')
pattern=${pattern}"s/${old}/${new}/g;"
done
sed -r ${pattern} -i file
This will run the sed command only once on the file with all the replacements. You may also want to replace awk with cut. cut may be more optimized then awk, though I am not sure about that.
old=`echo $i | cut -d"|" -f1`
new=`echo $i | cut -d"|" -f2`
You might want to do the whole thing in awk:
awk -F\| 'NR==FNR{old[++n]=$1;new[n]=$2;next}{for(i=1;i<=n;++i)gsub(old[i],new[i])}1' replacement_list file
Build up a list of old and new words from the first file. The next ensures that the rest of the script isn't run on the first file. For the second file, loop through the list of replacements and perform them each one by one. The 1 at the end means that the line is printed.
{ cat replacement_list;echo "-End-"; cat YourFile; } | sed -n '1,/-End-/ s/$/³/;1h;1!H;$ {g
t again
:again
/^-End-³\n/ {s///;b done
}
s/^\([^|]*\)|\([^³]*\)³\(\n\)\(.*\)\1/\1|\2³\3\4\2/
t again
s/^[^³]*³\n//
t again
:done
p
}'
More for fun to code via sed. Try maybe for a time perfomance because this start only 1 sed that is recursif.
for posix sed (so --posix with GNU sed)
explaination
copy replacement list in front of file content with a delimiter (for line with ³ and for list with -End-) for an easier sed handling (hard to use \n in class character in posix sed.
place all line in buffer (add the delimiter of line for replacement list and -End- before)
if this is -End-³, remove the line and go to final print
replace each first pattern (group 1) found in text by second patttern (group 2)
if found, restart (t again)
remove first line
restart process (t again). T is needed because b does not reset the test and next t is always true.
Thanks to #miku above;
I have a 100MB file with a list of 80k replacement-strings.
I tried various combinations of sed's sequentially or parallel, but didn't see throughputs getting shorter than about a 20-hour runtime.
Instead I put my list into a sequence of scripts like "cat in | replace aold anew bold bnew cold cnew ... > out ; rm in ; mv out in".
I randomly picked 1000 replacements per file, so it all went like this:
# first, split my replace-list into manageable chunks (89 files in this case)
split -a 4 -l 1000 80kReplacePairs rep_
# next, make a 'replace' script out of each chunk
for F in rep_* ; do \
echo "create and make executable a scriptfile" ; \
echo '#!/bin/sh' > run_$F.sh ; chmod +x run_$F.sh ; \
echo "for each chunk-file line, strip line-ends," ; \
echo "then with sed, turn '{long list}' into 'cat in | {long list}' > out" ; \
cat $F | tr '\n' ' ' | sed 's/^/cat in | replace /;s/$/ > out/' >> run_$F.sh ;
echo "and append commands to switch in and out files, for next script" ; \
echo -e " && \\\\ \nrm in && mv out in\n" >> run_$F.sh ; \
done
# put all the replace-scripts in sequence into a main script
ls ./run_rep_aa* > allrun.sh
# make it executable
chmod +x allrun.sh
# run it
nohup ./allrun.sh &
.. which ran in under 5 mins, a lot less than 20 hours !
Looking back, I could have used more pairs per script, by finding how many lines would make up the limit.
xargs --show-limits </dev/null 2>&1 | grep --color=always "actually use:"
Maximum length of command we could actually use: 2090490
So just under 2MB; how many pairs would that be for my script ?
head -c 2090490 80kReplacePairs | wc -l
76923
So it seems I could have used 2 * 40000-line chunks
to expand on chthonicdaemon's solution
live demo
#! /bin/sh
# build regex from text file
REGEX_FILE=some-patch.regex.diff
# test
# set these with "export key=val"
SOME_VAR_NAME=hello
ANOTHER_VAR_NAME=world
escape_b() {
echo "$1" | sed 's,/,\\/,g'
}
regex="$(
(echo; cat "$REGEX_FILE"; echo) \
| perl -p -0 -e '
s/\n#[^\n]*/\n/g;
s/\(\(SOME_VAR_NAME\)\)/'"$(escape_b "$SOME_VAR_NAME")"'/g;
s/\(\(ANOTHER_VAR_NAME\)\)/'"$(escape_b "$ANOTHER_VAR_NAME")"'/g;
s/([^\n])\//\1\\\//g;
s/\n-([^\n]+)\n\+([^\n]*)(?:\n\/([^\n]+))?\n/s\/\1\/\2\/\3;\n/g;
'
)"
echo "regex:"; echo "$regex" # debug
exec perl -00 -p -i -e "$regex" "$#"
prefixing lines with -+/ allows empty "plus" values, and protects leading whitespace from buggy text editors
sample input: some-patch.regex.diff
# file format is similar to diff/patch
# this is a comment
# replace all "a/a" with "b/b"
-a/a
+b/b
/g
-a1|a2
+b1|b2
/sg
# this is another comment
-(a1).*(a2)
+b\1b\2b
-a\na\na
+b
-a1-((SOME_VAR_NAME))-a2
+b1-((ANOTHER_VAR_NAME))-b2
sample output
s/a\/a/b\/b/g;
s/a1|a2/b1|b2/;;
s/(a1).*(a2)/b\1b\2b/;
s/a\na\na/b/;
s/a1-hello-a2/b1-world-b2/;
this regex format is compatible with sed and perl
since miku mentioned mysql replace:
replacing fixed strings with regex is non-trivial,
since you must escape all regex chars,
but you also must handle backslash escapes ...
naive escaper:
echo '\(\n' | perl -p -e 's/([.+*?()\[\]])/\\\1/g'
\\(\n

printing first word in every line of a txt file unix bash

So I'm trying to print the first word in each line of a txt file. The words are separated by one blank.
cut -c 1 txt file
Thats the code I have so far but it only prints the first character of each line.
Thanks
To print a whole word, you want -f 1, not -c 1. And since the default field delimiter is TAB rather than SPACE, you need to use the -d option.
cut -d' ' -f1 filename
To print the last two words not possible with cut, AFAIK, because it can only count from the beginning of the line. Use awk instead:
awk '{print $(NF-1), $NF;}' filename
you can try
awk '{print $1}' your_file
read word _ < file
echo "$word"
What's nice about this solution is it doesn't read beyond the first line of the file. Even awk, which has some very clean, terse syntax, has to be explicitly told to stop reading past the first line. read just reads one line at a time. Plus it's a bash builtin (and a builtin in many shells), so you don't need a new process to run.
If you want to print the first word in each line:
while read word _; do printf '%s\n' "$word"; done < file
But if the file is large then awk or cut will win out for reading every line.
You can use:
cut -d\ -f1 file
Where:
-d is the delimiter (here using \ for a space)
-f is the field selector
Notice that there is a space after the \.
-c is for characters, you want -f for fields, and -d to indicate your separator of space instead of the default tab:
cut -d " " -f 1 file

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Grep penultimate line

Like the title says, how can I filter with grep (or similar bash tool) the line-before-the-last-line of a (variable length) file?
That is, show everything EXCEPT the penultimate line.
Thanks
You can use a combination of head and tail like this for example:
$ cat input
one
two
three
HIDDEN
four
$ head -n -2 input ; tail -n 1 input
one
two
three
four
From the coreutils head documentation:
‘-n k’
‘--lines=k’
Output the first k lines. However, if k starts with a ‘-’, print all but the last k lines of each file. Size multiplier suffixes are the same as with the -c option.
So the head -n -2 part strips all but the last two lines of its input.
This is unfortunately not portable. (POSIX does not allow negative values in the -n parameter.)
grep is the wrong tool for this. You can wing it with something like
# Get line count
count=$(wc -l <file)
# Subtract one
penultimate=$(expr $count - 1)
# Delete that line, i.e. print all other lines.
# This doesn't modify the file, just prints
# the requested lines to standard output.
sed "${penultimate}d" file
Bash has built-in arithmetic operators which are more elegant than expr; but expr is portable to other shells.
You could also do this in pure sed but I don't want to think about it. In Perl or awk, it would be easy to print the previous line and then at EOF print the final line.
Edit: I thought about sed after all.
sed -n '$!x;1!p' file
In more detail; unless we are at the last line ($), exchange the pattern space and the hold space (remember the current line; retrieve the previous line, if any). Then, unless this is the first line, print whatever is now in the pattern space (the previous line, except when we are on the last line).
awk oneliner: (test with seq 10):
kent$ seq 10|awk '{a[NR]=$0}END{for(i=1;i<=NR;i++)if(i!=NR-1)print a[i]}'
1
2
3
4
5
6
7
8
10
Using ed:
printf '%s\n' H '$-1d' wq | ed -s file # in-place file edit
printf '%s\n' H '$-1d' ',p' wq | ed -s file # write to stdout

Resources