Can I use sed to generate UUIDs inline instead of echo? - bash

This bash snippet works to add a UUID and a tab character (\t) at the start of each line of a file.
while read; do
echo "$(uuidgen | tr 'A-Z' 'a-z')\t$REPLY"
done < sourcefile.tsv > temp_destination.tsv
(Note the reason for the pipe to TR is to convert them to lowercase in MacOS version of UUID-generation).
Although that performs well for smaller files, it doesn't seem efficient.
sed -i '' "s/^/$(uuidgen | tr 'A-Z' 'a-z')\t/" sourcefile.tsv
Again, using MacOS bash so the '' after the -i flag is required since I don't want a backup file.
I think sed would perform better, but I think I have to have the UUID generation in some sort of loop.
I'm just looking to make this faster and/or perform more efficiently. It's working, but it's pretty slow on a 20,000-line file, and all other attempts by me have stumped me.
EDIT: I tested my bash script just outputting the UUIDs to a while loop without any of the other subprocesses. With my configuration, I can generate about 250-300 per second, so updating a 20,000-line file will take a minimum of 72 seconds just because of the weak link of UUID generation. As described below, using Perl or Python will likely be faster.
EDIT 2: This little python script kill the bash script. This snippet only does part of what I need, but just for comparison, it generated about 200,000 UUIDs in a second, or 1,000,000 in 5 seconds compared to the 250-300 in the bash subprocess. Wow, what a difference.
#!/usr/bin/env python3
#this generates 1,000,000 UUIDs in about 5 seconds
import uuid
import sys
sys.stdout = open('lots-of-uuid.txt', 'w')
i = 1
while i < 1000000:
print(uuid.uuid4())
i +=1
sys.stdout.close()

Did you try something like that:
{
uuidgen | tr 'A-Z' 'a-z'
echo -n "\t"
cat 'sourcefile.tsv'
} > temp_destination.tsv
You may think it is not much different from your "read" version, but it is:
You don't capture the result of uuidgen
cat will probably perform faster than read + $REPLY

Try this out:
while read; do printf "%s\t%s\n" $(uuidgen) "$REPLY"; done < input.tsv > output.tsv
No monkeying around with building strings.

Using sed
$ sed -i '' 's/.*/printf &#;\Luuidgen/e;s/\([^#]*\)#\(.*\)/\2\t\1/' sourcefile.tsv

This might work for you (GNU sed):
sed -i 'h;s/.*/uuidgen/e;s/.*/\L&/;G;s/\n/\t/' file
Make a copy of the current line.
Replace the current line by an evaluated uuidgen command and convert the result to lowercase.
Append the copy and replace the newline by a tab.

Related

Sed through files without using for loop?

I have a small script which basically generates a menu of all the scripts in my ~/scripts folder and next to each of them displays a sentence describing it, that sentence being the third line within the script commented out. I then plan to pipe this into fzf or dmenu to select it and start editing it or whatever.
1 #!/bin/bash
2
3 # a script to do
So it would look something like this
foo.sh a script to do X
bar.sh a script to do Y
Currently I have it run a for loop over all the files in the scripts folder and then run sed -n 3p on all of them.
for i in $(ls -1 ~/scripts); do
echo -n "$i"
sed -n 3p "~/scripts/$i"
echo
done | column -t -s '#' | ...
I was wondering if there is a more efficient way of doing this that did not involve a for loop and only used sed. Any help will be appreciated. Thanks!
Instead of a loop that is parsing ls output + sed, you may try this awk command:
awk 'FNR == 3 {
f = FILENAME; sub(/^.*\//, "", f); print f, $0; nextfile
}' ~/scripts/* | column -t -s '#' | ...
Yes there is a more efficient way, but no, it doesn't only use sed. This is probably a silly optimization for your use case though, but it may be worthwhile nonetheless.
The inefficiency is that you're using ls to read the directory and then parse its output. For large directories, that causes lots of overhead for keeping that list in memory even though you only traverse it once. Also, it's not done correctly, consider filenames with special characters that the shell interprets.
The more efficient way is to use find in combination with its -exec option, which starts a second program with each found file in turn.
BTW: If you didn't rely on line numbers but maybe a tag to mark the description, you could also use grep -r, which avoids an additional process per file altogether.
This might work for you (GNU sed):
sed -sn '1h;3{H;g;s/\n/ /p}' ~/scripts/*
Use the -s option to reset the line number addresses for each file.
Copy line 1 to the hold space.
Append line 3 to the hold space.
Swap the hold space for the pattern space.
Replace the newline with a space and print the result.
All files in the directory ~/scripts will be processed.
N.B. You may wish to replace the space delimiter by a tab or pipe the results to the column command.

number and string manipulation in bash

I wrote a little bash script (my first) that does the following:
sed -e 's/Alpha=0/Alpha=x/' -e 's/Beta=0/Beta=y/' <file.pov >tmpfile.pov
povray Width=480 Height=360 +Itmpfile.pov +Ofile_x_y.png
everything works as intented but now I would like to pack these two lines in a loop for x=0:30:180 and y=0:30:90 (edit: I mean all possible combinations of x in {0,30,60,90,120,180} and y in {0,30,60,90}).
So for example for x=60 and y=30 the code should behave like this:
sed -e 's/Alpha=0/Alpha=60/' -e 's/Beta=0/Beta=30/' <file.pov >tmpfile.pov
povray Width=480 Height=360 +Itmpfile.pov +Ofile_60_30.png
I am aware, it should not be too hard, but for some reason, I just couldnt work it out by myself.
Sorry to bother you with my newbie questions!
Use seq with for loops:
for x in `seq 0 30 180`
do for y in `seq 0 30 90`
do sed -e 's/Alpha=0/Alpha='$x'/' -e 's/Beta=0/Beta='$y'/' <file.pov >tmpfile.pov
fn="file_${x}_${y}.png"
povray Width=480 Height=360 +Itmpfile.pov +O${fn}
done done
seq is a great tool even if you don't need the numbers
for i in `seq 10`
do call_me_ten_times
done
here is a tricky bash hack, but it helps to reduce two nested for loop into a single while loop:
while read x y; do
sed -e 's/Alpha=0/Alpha='"$x"'/' -e 's/Beta=0/Beta='"$y"'/' <file.pov >"tmpfile${x}_$y.pov"
done < <(echo {0..180..30}' '{0..90..30} | tr ' ' '\n' | paste - -)
the process substitution at last generates all combinations of the range you specified.
one remaining problem should be the output file name. since you are overwriting the file every time. i think it would be better if you make the filename changes as it goes. perhaps in my way.

Doubts about bash script efficiency

I have to accomplish a relatively simple task, basically i have an enormous amount of files with the following format
"2014-01-27","07:20:38","data","data","data"
Basically i would like to extract the first 2 fields, convert them into a unix epoch date, add 6 hours to it (due timezone difference), and replace the first 2 original columns with the resulting milliseconds (unix epoch, since 19700101 converted to mills)
I have written a script that works fine, well, the issue is that is very very slow, i need to run this over 150 files with a total line count of more then 5.000.000 and i was wondering if you had any advice about how could i make it faster, here it is:
#!/bin/bash
function format()
{
while read line; do
entire_date=$(echo ${line} | cut -d"," -f1-2);
trimmed_date=$(echo ${entire_date} | sed 's/"//g;s/,/ /g');
seconds=$(date -d "${trimmed_date} + 6 hours" +%s);
millis=$((${seconds} * 1000));
echo ${line} | sed "s/$entire_date/\"$millis\"/g" >> "output"
done < $*
}
format $*
You are spawning a significant number of processes for each input line. Probably half of those could easily be factored away, by quick glance, but I would definitely recommend a switch to Perl or Python instead.
perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e'
I'd like to recommend Text::CSV but I do not have it installed here, and if you have requirements to not touch the fields after the second at all, it might not be what you need anyway. This is quick and dirty but probably also much simpler than a "proper" CSV solution.
The real meat is the str2time function from Date::Parse, which I imagine will be a lot quicker than repeatedly calling date (ISTR it does some memoization internally so it can do nearby dates quickly). The regex replaces the first two fields with the output; note the /e flag which allows Perl code to be evaluated in the replacement part. The (?<=^") and (?=") zero-width assertions require these matches to be present but does not include them in the substitution operation. (I originally substituted the enclosing double quotes, but with this change, they are retained, as apparently you want to keep them.)
Change the die to a warn if you want the script to continue in spite of errors (maybe redirect standard error to a file then!)
I have tried to avoid external commands (except date) to gain time. Tests show that it is 4 times faster than your code. (Okay, the tripleee's perl solution is 40 times faster than mine !)
#! /bin/bash
function format()
{
while IFS=, read date0 date1 datas; do
date0="${date0//\"/}"
date1="${date1//\"/}"
seconds=$(date -d "$date0 $date1 + 6 hours" +%s)
echo "\"${seconds}000\",$datas"
done
}
output="output.txt"
# Process each file in argument
for file ; do
format < "$file"
done >| "$output"
exit 0
Using the exist function mktime in awk, tested, it is faster than perl.
awk '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file
Here is the test result.
$ wc -l file
1244 file
$ time awk '{t=$2 " " $4;gsub(/[-:]/," ",t);printf "\"%s\",%s\n",(mktime(t)+6*3600)*1000,substr($0,25)}' FS=\" OFS=\" file > /dev/null
real 0m0.172s
user 0m0.140s
sys 0m0.046s
$ time perl -MDate::Parse -pe 'die "$0:$ARGV:$.: Unexpected input $_"
unless s/(?<=^")([^"]+)","([^"]+)(?=")/ (str2time("$1 $2")+6*3600)*1000 /e' file > /dev/null
real 0m0.328s
user 0m0.218s
sys 0m0.124s

Read the n-th line of multiple files into a single output

I have some dump files called dump_mydump_0.cfg, dump_mydump_250.cfg, ..., all the way up to dump_mydump_40000.cfg. For each dump file, I'd like to take the 16th line out, read them, and put them into one single file.
I'm using sed, but I came across some syntax errors. Here's what I have so far:
for lineNo in 16 ;
for fileNo in 0,40000 ; do
sed -n "${lineNo}{p;q;}" dump_mydump_file${lineNo}.cfg >> data.txt
done
Considering your files are named with intervals of 250, you should get it working using:
for lineNo in 16; do
for fileNo in {0..40000..250}; do
sed -n "${lineNo}{p;q;}" dump_mydump_file${fileNo}.cfg >> data.txt
done
done
Note both the bash syntax corrections -- do, done, and {0..40000..250} --, and the input file name, that should depend on ${fileNo} instead of ${lineNo}.
Alternatively, with (GNU) awk:
awk "FNR==16{print;nextfile}" dump_mydump_{0..40000..250}.cfg > data.txt
(I used the filenames as shown in the OP as opposed to the ones which would have been generated by the bash for loop, if corrected to work. But you can edit as needed.)
The advantage is that you don't need the for loop, and you don't need to spawn 160 processes. But it's not a huge advantage.
This might work for you (GNU sed):
sed -ns '16wdata.txt' dump_mydump_{0..40000..250}.cfg

How to print the number of characters in each line of a text file

I would like to print the number of characters in each line of a text file using a unix command. I know it is simple with powershell
gc abc.txt | % {$_.length}
but I need unix command.
Use Awk.
awk '{ print length }' abc.txt
while IFS= read -r line; do echo ${#line}; done < abc.txt
It is POSIX, so it should work everywhere.
Edit: Added -r as suggested by William.
Edit: Beware of Unicode handling. Bash and zsh, with correctly set locale, will show number of codepoints, but dash will show bytes—so you have to check what your shell does. And then there many other possible definitions of length in Unicode anyway, so it depends on what you actually want.
Edit: Prefix with IFS= to avoid losing leading and trailing spaces.
Here is example using xargs:
$ xargs -d '\n' -I% sh -c 'echo % | wc -c' < file
I've tried the other answers listed above, but they are very far from decent solutions when dealing with large files -- especially once a single line's size occupies more than ~1/4 of available RAM.
Both bash and awk slurp the entire line, even though for this problem it's not needed. Bash will error out once a line is too long, even if you have enough memory.
I've implemented an extremely simple, fairly unoptimized python script that when tested with large files (~4 GB per line) doesn't slurp, and is by far a better solution than those given.
If this is time critical code for production, you can rewrite the ideas in C or perform better optimizations on the read call (instead of only reading a single byte at a time), after testing that this is indeed a bottleneck.
Code assumes newline is a linefeed character, which is a good assumption for Unix, but YMMV on Mac OS/Windows. Be sure the file ends with a linefeed to ensure the last line character count isn't overlooked.
from sys import stdin, exit
counter = 0
while True:
byte = stdin.buffer.read(1)
counter += 1
if not byte:
exit()
if byte == b'\x0a':
print(counter-1)
counter = 0
Try this:
while read line
do
echo -e |wc -m
done <abc.txt

Resources