I created a command, which works, but not exactly as I want. So I would like to upgrade this command to right output.
My command:
awk '{print $1}' | sort | uniq -c | sort -nr
Output of my command:
Input file:
Output I need(hashtags instead of numbers): (8): ######## (2): ##

cat | sort | uniq -c | awk 'ht="#"{for(i=1;i<$1;i++){ht=ht"#"} str=sprintf("%s (%d): %s", $2,$1, ht); print str}'
expecting file with content like:

Using xargs with sh and printf. Comments in between the lines. Live version at tutorialspoint.
# sorry cat
cat <<EOF |
# for each 2 arguments
xargs -n2 sh -c '
# format the output as "$2 ($1): "
printf "%s (%s): " "$2" "$1"
# repeat the character `#` $1 times
seq "$1" | xargs printf "#%.0s"
# lastly a newline
printf "\n"
' --
I think we could shorten that a bit with:
xargs -n2 sh -c 'printf "%s (%s): %s\n" "$2" "$1" $(printf "#%.0s" $(seq $1))' --
or maybe just echo, if the input is sufficiently safe:
xargs -n2 sh -c 'echo "$2 ($1): $(printf "#%.0s" $(seq $1))"' --

You can upgrade your command by adding another awk to the list, or you can just use a single awk for the whole thing:
awk '{a[$1]++}
END { for(i in a) {
printf "%s (%d):" ,i,a[i]
for(j=0;j<a[i];++j) printf "#"; printf "\n"
}' file


Passing Arguments to GNU parallel

I'm trying to use awk and GNU parallel to filter the files based on the values in column 1 and column 2 and dump the result in a single .csv.gz file. Thanks to the answer here, I could manage to write to do the job in parallel.
doit() {
pigz -dc $1 | awk -F, '$1>0.5 && $2<1.5'
export -f doit
find $1 -name '*.csv.gz' | parallel doit | pigz > output.csv.gz
and then run the script in the terminal.
./ /path/to/files
I'm wondering how I can pass 0.5 and 1.5 as arguments of
./ /path/to/files 0.5 1.5
This is may be an easier, or more explicit, way of passing variables and parameters around:
# Pick up second and third parameters, defaulting to 0.5 and 1.5 if unspecified
doit() {
echo "File: $file, a=$a, b=$b"
cat "$1" | awk -F, -v a="$a" -v b="$b" '$1>a && $2<b'
export -f doit
find "$dir" -name '*.tst' | parallel doit {} "$a" "$b"
doit() {
# $1 $2 $3 are arguments to doit
# '$1' and '$2' are variables in awk
pigz -dc $1 | awk -F, '$1>'$2' && $2<'$3
export -f doit
find $1 -name '*.csv.gz' | parallel doit {} $2 $3 | pigz > output.csv.gz
Call as:
paste <(seq 10 | shuf) <(seq 10 | shuf) | gzip > h.csv.gz
./ . 5 6
zcat output.csv.gz

Convert substring through command

Basically, how do I make a string substitution in which the substituted string is transformed by an external command?
For example, given the line 5aaecdab287c90c50da70455de03fd1e ./2015/01/26/GOPR0083.MP4, how to pipe the second part of the line (./2015/01/26/GOPR0083.MP4) to command xargs stat -c %.6Y and then replace it with the result so that we end up with 5aaecdab287c90c50da70455de03fd1e 1422296624.010000?
This can be done with a script, however a one-liner would be nice.
while read longhex fname; do
echo "$longhex $(stat -c %.6Y "$fname")"
if [ $# -ne 1 ]; then
echo Usage: ${0##*/} infile 1>&2
exit 1
hashtime < $1
exit 0
# one liner
awk 'BEGIN { args="stat -c %.6Y " } { printf "%s ", $1; cmd=args $2; system(cmd); }' infile
A one-liner using GNU sed, which will process the whole file:
sed -E "s/([[:xdigit:]]+) +(.*)/stat -c '\1 %.6Y' '\2'/e" file
or, using plain bash
while read -r hash pathname; do stat -c "$hash %.6Y" "$pathname"; done < file
It's typical to use awk sed cut to reformat input. For example:
line="5aaecdab287c90c50da70455de03fd1e ./2015/01/26/GOPR0083.MP4"
echo "$line" |
cut -d' ' -f2- |
xargs stat -c %.6Y

in bash printf format how to use the same value in muliples placeholders

I want to put the number of lines of a file in two place holders of a printf string
"%s lines: %s\n"
^ ^
|here and |here
So I get it with two wc -l::
$ do_stuff() {
printf "%s ## lines: %5s\n" \
`cat $1 | wc -l` \
`cat $1 | wc -l`;
$ do_stuff ./lpm/
426 ## lines: 426
it works!
Is there a way like in python to give only one value to the string ?::
In [1]: '{0} line {0}'.format(426)
Out[1]: '426 line 426'
Using brace expansion:
printf '%s lines: %s\n' "$(wc -l <"$1")"{,}

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
{"captureTime": "1534303617.738","ua": "..."}
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

Bash: "xargs cat", adding newlines after each file

I'm using a few commands to cat a few files, like this:
cat somefile | grep example | awk -F '"' '{ print $2 }' | xargs cat
It nearly works, but my issue is that I'd like to add a newline after each file.
Can this be done in a one liner?
(surely I can create a new script or a function that does cat and then echo -n but I was wondering if this could be solved in another way)
cat somefile | grep example | awk -F '"' '{ print $2 }' | while read file; do cat $file; echo ""; done
Using GNU Parallel it may be even faster (depending on your system):
cat somefile | grep example | awk -F '"' '{ print $2 }' | parallel "cat {}; echo"
awk -F '"' '/example/{ system("cat " $2 };printf "\n"}' somefile
