How to quickly concatenate large numbers of strings - bash

Background: I am running unit tests and one requires calling a PSQL function with a high number of URLs (ie. 2000+). and this is extremely slow as shown in this Minimal Working Example (MWE)
MWE:
#!/bin/bash
# Generate random 4096 character alphanumeric
# string (upper and lowercase)
URL="http://www.$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w $((4096-15)) | head -n 1).com"
# Create a comma separated list of 2000 URLs
for i in $(seq 2000)
do
URLS="$URLS,$URL"
done
We call it and measure the run time like so
$ time ./generate_urls.sh
real 1m30.681s
user 1m14.648s
sys 0m16.000s
Question: Is there a faster, more efficient way to achieve this same result?

Instead of concatenating over and over, just print them all and store the result.
URLS=$(
for i in $(seq 2000) ; do
printf %s, "$URL"
done
)
echo "${URLS%,}" # Remove the final comma.
Takes less than 2 secs on my machine. Even when I move the URL generation inside the loop, it takes just about 8 secs.

If you always want 2000 URLs then this code is much faster than the code in the question:
# Create a comma separated list of 2000 (5*5*5*4*4) URLs
urls=$url,$url,$url,$url,$url # x 5
urls=$urls,$urls,$urls,$urls,$urls # x 5
urls=$urls,$urls,$urls,$urls,$urls # x 5
urls=$urls,$urls,$urls,$urls # x 4
urls=$urls,$urls,$urls,$urls # x 4
See Correct Bash and shell script variable capitalization for an explanation of why I changed the variable names to lowercase.

Related

Explanation to an assignment

I am NOT looking for an answer to this problem. I am having trouble understanding what I should be trying to accomplish in this assignment. I welcome Pseudo code or hints if you would like. but what I really need is an explanation to what I need to be making, and what the output should be/look like. please do not write out a lot of code though I would like to try that on my own.
(()) = notes from me
The assignment is:
a program (prog.exe) ((we are given this program)) that reads 2 integers (m, n) and 1 double (a) from an input data file named input.in. For example, the sample input.in given file contains the values
5 7 1.23456789012345
when you run ./prog.exe the output is a long column of floating-point numbers
in additions to the program, there is a file called ain.in that contains a long column of double precision values.
copy prog.exe and ain.in to working directory
Write a bash script that does that following:
-Runs ./prog.exe for all combonations of
--m=0,1,...,10
--n=0,1,...,5
--a=every value in the file ain.in
-this is essentially a triple nested loop over m,n and the ain.in values
-for each combination of m,n and ain.in value above:
-- generate the appropriate input file input.in
-- run the program and redirect the output to some temporary output file.
--extract the 37th and 51st values from this temporary output file and store these in a file called average.in
-when the 3 nested loops terminate the average.in file should contain a long list of floating point values.
-your script should return the average of the values contained in average.in
HINTS: seq , awk , output direction, will be useful here
thank you to whoever took the time to even read through this.
This is my second bash coding assignment and im still trying to get a grasp on it and a better explanation would be very helpful. thanks again!
this is one way of generating all input combinations without explicit loops
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2-
The idea is to write a bash script that will test prog.exe with a variety of input conditions. This means recreating input.in and running prog.exe many times. Each time you run prog.exe, input.in should contain a different three numbers, e.g.,
First run:
0 0 <first line of ain.in>
Second run:
0 0 <second line of ain.in>
. . . last run:
10 5 <last line of ain.in>
You can use seq and for loops to accomplish this.
Then, you need to systematically save the output of each run, e.g.,
./prog.exe > tmp.out
# extract line 37 and 51 and append to average.ln
sed -n '37p; 51p; 51q' tmp.out >> average.ln
Finally, after testing all the combinations, use awk to compute the average of all the lines in average.in.
One-liner inspired by #karakfa:
join -j9 <(join -j9 <(seq 0 10) <(seq 0 5)) ain.in | cut -d' ' -f2- |
sed "s/.*/echo & >input.in;./prog.exe>tmp.out; sed -n '37p;51p;51q' tmp.out/" |
sh | awk '{sum+=$1; n++} END {print sum/n}'

bash: split ascii file into n parts; iterate over ONLY those files

I have an ASCII file of a few thousand lines, processed one line at a time by a bash script. Because the processing is embarrassingly parallel, I'd like to split the file into parts of roughly the same size, preserving line breaks, one part per CPU core. Unfortunately the file suffixes made by split r/numberOfCores aren't easily iterated over.
split --numeric-suffixes=1 r/42 ... makes files foo.01, foo.02, ..., foo.42, which can be iterated over with for i in `seq -w 1 42 ` because -w adds a leading zero). But if the 42 changes to something smaller than 10, the files still have the leading zero but the seq doesn't, so it fails. This concern is valid, because nowadays some PCs have fewer than 10 cores, some more than 10. A ghastly workaround:
[[ $numOfCores < 10 ]] && optionForSeq="" || optionForSeq="-w"
The naive solution for f in foo.* is risky: the wildcard might match files other than the ones that split made.
An ugly way to make the suffixes seq-friendly, but with the same risk:
split -n r/numOfCores infile foo.
for i in `seq 1 $numOfCores`; do
mv `ls foo.* | head -1` newPrefix.$i
done
for i in `seq 1 $numofCores`; do
... newPrefix.$i ...
done
Is there a cleaner, robust way of splitting the file into n parts, where 1<=n<=64 isn't known until runtime, and then iterating over those parts? split only into a freshly created directory?
(Edit: To clarify "if the 42 changes to something smaller than 10," the same code should work on a PC with 8 cores and on another PC with 42 cores.)
A seq-based solution is clunky. A wildcard-based solution is risky. Is there an alternative to split? (csplit with line numbers would be even clunkier.) A gawk one-liner?
How about using a format string with seq?
$ seq -f '%02g' 1 4
01
02
03
04
$ seq -f '%02g' 1 12
01
02
03
...
09
10
11
12
With GNU bash 4:
Use printf to format your numbers:
for ((i=1;i<=4;i++)); do printf -v num "%02d" $i; echo "$num"; done
Output:
01
02
03
04
Are you sure this is not a job for GNU Parallel?
cat file | parallel --pipe -N1 myscript_that_reads_one_line_from_stdin
This way you do not need to have the temporary files at all.
If your script can read more than one line (so it is in practice a UNIX filter), then this should be very close to optimal:
parallel --pipepart -k --roundrobin -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into one part per core on the fly. If some lines are harder to process than others (i.e. you can get "stuck" for a while on a single line), then this solution might be better:
parallel --pipepart -k -a file myscript_that_reads_from_stdin
It will spawn one job per core and split file into 10 part per core on the fly, thus running on average 10 jobs per core in total.
GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.
If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
Installation
If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README
Learn more
See more examples: http://www.gnu.org/software/parallel/man.html
Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html
Sign up for the email list to get support: https://lists.gnu.org/mailman/listinfo/parallel
Get the filenames with ls and then use a regex:
for n in $(ls foo.* |grep "^foo\.[0-9][0-9]*$") ; do

Given the number of tokens, how to find corresponding closest number of lines within a file?

Given a huge file which contains text (one sentence per line), the task is to extract N tokens (for example 100 million tokens out of 3 billion), since I can not break a sentence into parts, I need to find closest number of lines that contains given number of tokens.
I tried following code:
perl -p -e 's/\n/ #/g' huge_file | cut -d' ' -f1-100000000 | grep -o ' #' | wc -w
which replace newline symbol with symbol ' #' (we are basically joining sentences into single line) and counting number of symbol ' #' which should correspond to number of sentence (huge_file doesn't contain '#' symbol). However, grep can not process the large line and giving 'grep: memory exhausted' error. Is there any other efficient way for accomplishing task, that would also work for very large files?
I had a bit of a hard time understanding what you're asking. But I think you're tackling it very badly. Running perl as a super-sed, then cut, then grep then wc is horribly inefficient.
If I understand correctly, you want as many lines as it takes to get at least 100M words.
Why not instead:
#!/usr/bin/env perl
use strict;
use warnings;
my $wordcount = 0;
#use 'magic' filehandle - read piped input or
#command line specified 'myscript.pl somefilename' - just like sed/grep
while ( <> )
#split on whitespace, count number of fields. Or words in this case.
$wordcount += scalar split;
#chomp; if you don't want the line feed here
#print current line
print;
#bail out if our wordcount is above a certain number.
last if $wordcount >= 100_000_000
#NB $. is line number if you wanted to just do a certain number of lines.
}
#already printed the content with line feeds intact.
#this prints the precise count we've printed.
print $wordcount," words printed\n";
This will iterate your file, and as soon as you've seen 100M words, it'll bail out - meaning you no longer have to read the whole file, nor do you need to invoke a daisy chain of commands.
This would oneliner if you're really insistent:
perl -p -e '$wordcount += scalar split; last if $wordcount > 100_000_000;'
Again - I couldn't quite tell what the significance of line feeds and # symbols were, so I haven't done anything with them. But s/\n/ #/ works fine in the above code block, as does chomp; to remove trailing linefeeds if that's what you're after.

Reverse lines in a file two by two

I'm trying to reverse the lines in a file, but I want to do it two lines by two lines.
For the following input:
1
2
3
4
…
97
98
I would like the following output:
97
98
…
3
4
1
2
I found lots of ways to reverse a file line by line (especially on this topic: How can I reverse the order of lines in a file?).
tac. The simplest. Doesn't seem to have an option for what I want, even if I tried to play around with options -r and -s.
tail -r (not POSIX compliant). Not POSIX compliant, my version doesn't seem to have anything to do that.
Remains three sed formula, and I think a little modification would do the trick. But I'm not even understanding what they're doing, and thus I'm stuck here.
sed '1!G;h;$!d'
sed -n '1!G;h;$p'
sed 'x;1!H;$!d;x'
Any help would be appreciated. I'll try to understand these formula and to give answer to this question by myself.
Okay, I'll bite. In pure sed, we'll have to build the complete output in the hold buffer before printing it (because we see the stuff we want to print first last). A basic template can look like this:
sed 'N;G;h;$!d' filename # Incomplete!
That is:
N # fetch another line, append it to the one we already have in the pattern
# space
G # append the hold buffer to the pattern space.
h # save the result of that to the hold buffer
$!d # and unless the end of the input was reached, start over with the next
# line.
The hold buffer always contains the reversed version of the input processed so far, and the code takes two lines and glues them to the top of that. In the end, it is printed.
This has two problems:
If the number of input lines is odd, it prints only the last line of the file, and
we get a superfluous empty line at the end of the input.
The first is because N bails out if no more lines exist in the output, which happens with an odd number of input lines; we can solve the problem by executing it conditionally only when the end of the input was not yet reached. Just like the $!d above, this is done with $!N, where $ is the end-of-input condition and ! inverts it.
The second is because at the very beginning, the hold buffer contains an empty line that G appends to the pattern space when the code is run for the very first time. Since with $!Nwe don't know if at that point the line counter is 1 or 2, we should inhibit it conditionally on both. This can be done with 1,2!G, where 1,2 is a range spanning from line 1 to line 2, so that 1,2!G will run G if the line counter is not between 1 and 2.
The whole script then becomes
sed '$!N;1,2!G;h;$!d' filename
Another approach is to combine sed with tac, such as
tac filename | sed -r 'N; s/(.*)\n(.*)/\2\n\1/' # requires GNU sed
That is not the shortest possible way to use sed here (you could also use tac filename | sed -n 'h;$!{n;G;};p'), but perhaps easier to understand: Every time a new line is processed, N fetches another line, and the s command swaps them. Because tac feeds us the lines in reverse, this restores pairs of lines to their original order.
The key difference to the first approach is the behavior for an odd number of lines: with the second approach, the first line of the file will be alone without a partner, whereas with the first it'll be the last.
I would go with this:
tac file | while read a && read b; do echo $b; echo $a; done
Here is an awk you can use:
cat file
1
2
3
4
5
6
7
8
awk '{a[NR]=$0} END {for (i=NR;i>=1;i-=2) print a[i-1]"\n"a[i]}' file
7
8
5
6
3
4
1
2
It store all line in an array a, then print it out in reverse, two by two.

using xargs as an argument for cut

Say i have a file a.txt containing a word, followed by a number, followed by a newline on
and 3
now 2
for 2
something 7
completely 8
different 6
I need to select the nth char from every word (specified by the number next to the word)
cat a.txt | cut -d' ' -f2 | xargs -i -n1 cut a.txt -c {}
I tried this command, which selects the numbers and uses xargs to put them into the -c option from cut, but the cut command gets executed on every line, instead of a.txt being looped (which I had expected to happen) How can I resolve this problem?
EDIT: Since it seems to be unclear, i want to select a character from a word. The character which I need to select can be found next to the word, for example:
and 3, will give me d. I want to do this for the entire file, which will then form a word :)
A pure shell solution:
$ while read word num; do echo ${word:$((num-1)):1}; done < a.txt
d
o
o
i
e
r
This is using a classic while; do ... ; done shell loop and the read builtin. The general format is
while read variable1 variable2 ... variableN; do something; done < input_file
This will iterate over each line of your input file splitting it into as many variables as you've given. By default, it will split at whitespace but you can change that by changing the $IFS variable. If you give a single variable, the entire line will be saved, if you give more, it will populate as many variables as you give it and save the rest in the last one.
In this particular loop, we're reading the word into $word and the number into $num. Once we have the word, we can use the shell's string manipulation capabilities to extract a substring. The general format is
${string:start:length}
So, ${string:0:2} would extract the first two characters from the variable $string. Here, the variable is $word, the start is the number minus one (this starts counting at 0) and the length is one. The result is the single letter at the position given by the number.
I would suggest that you used awk:
awk '{print substr($1,$2,1)}' file
substr takes a substring of the first field starting from the number contained in the second field and of length 1.
Testing it out (using the original input from your question):
$ cat file
and 3
now 2
for 2
something 7
completely 8
different 6
$ awk '{print substr($1,$2,1)}' file
d
o
o
i
e
r

Resources