Selective output from SoX stats option? - bash

I'm using Kubuntu to run SoX. I have the following code to get info from sound files:
for file in *.mp3; do echo -e '\n--------------------\n'$file'\n'; sox $file -n stats; done > stats.txt 2>&1 | tail -1
It produces output that looks like this:
--------------------
soundfile_name.mp3
DC offset -0.000287
Min level -0.585483
Max level 0.572299
Pk lev dB -4.65
RMS lev dB -19.55
RMS Pk dB -12.98
RMS Tr dB -78.44
Crest factor 5.56
Flat factor 0.00
Pk count 2
Bit-depth 29/29
Num samples 628k
Length s 14.237
Scale max 1.000000
Window s 0.050
Could someone amend the command to limit the output so that it looks like this?
--------------------
soundfile_name.mp3
Pk lev dB -4.65
RMS lev dB -19.55
RMS Pk dB -12.98
RMS Tr dB -78.44
thanks

Given that the lines of interest have the word "dB" in common, you could filter SoX output with grep -w dB:
for file in *.mp3; do echo -e '\n--------------------\n'$file'\n'; sox $file -n stats | grep -w dB; done > stats.txt 2>&1
Resulting content of stats.txt:
--------------------
soundfile_name.mp3
Pk lev dB -4.65
RMS lev dB -19.55
RMS Pk dB -12.98
RMS Tr dB -78.44

Related

curl output to csv table

I have a bash code that generates an output that is currently saved in a .txt file. I'm trying to instead place these data points in a csv table. Can you help me with this?
The sample output looks like this:
****** A Day at the Races ******
* 19371937
* PassedPassed
* 1h 51m
IMDb RATING
7.5/10
14K
****** The King and the Chorus Girl ******
* 19371937
* ApprovedApproved
* 1h 34m
IMDb RATING
6.2/10
376
****** Room Service ******
* 19381938
* ApprovedApproved
* 1h 18m
IMDb RATING
6.6/10
5.2K
****** At the Circus ******
* 19391939
* PassedPassed
* 1h 27m
IMDb RATING
6.8/10
6K
I'm trying to change this into a csv that contains: Movie title, year of release, notes, run time, imdb rating and number of reviews as columns.
for example, for the first datapoint above, the csv datapoint should
look like:
Movie title: 'A day at the races'
Year of release: 1937
Notes: Passed
Run time: 1h 51m
IMDB rating: 7.5/10
Number of reviews: 14k
The code used for generating the above output:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt

How to capture row count of a table in a variable in Unix

hive -e "select count (*) from table where year=2019 and month=04 and day=15"
This command gives me result as 15 in below format
+----+
| a |
+----+
| 15 |
+----+
How do I get the value as just 15 instead of above format?
the below code will be helpful for you.
a=$(hive -e "select count (*) from table where year=2019 and month=04 and day=15")
echo $a
hive -e "select count (*) from table where year=2019 and month=04 and day=15" | grep -o '[0-9]*'
The -o switch outputs only that part from the input, which actually corresponds to the pattern.

Dropping hive partition based on certain condition in runtime

I have a table in hive built using the following command:
create table t1 (x int, y int, s string) partitioned by (wk int) stored as sequencefile;
The table has the data below:
select * from t1;
+-------+-------+-------+--------+--+
| t1.x | t1.y | t1.s | t1.wk |
+-------+-------+-------+--------+--+
| 1 | 2 | abc | 10 |
| 4 | 5 | xyz | 11 |
| 7 | 8 | pqr | 12 |
+-------+-------+-------+--------+--+
Now the ask is to drop the oldest partition when partition count is >=2
Can this be handled in hql or through any shell script and how?
Considering I will be using dbname as variable like hive -e 'use "$dbname"; show partitions t1
If your partitions are ordered by date, you could write a shell script in which you could use hive -e 'SHOW PARTITIONS t1' to get all partitions, in your example, it will return:
wk=10
wk=11
wk=12
Then you can issue hive -e 'ALTER TABLE t1 DROP PARTITION (wk=10)' to remove the first partition;
So something like:
db=mydb
if (( `hive -e "use $db; SHOW PARTITIONS t1" | grep wk | wc -l` < 2)) ; then
exit;
fi
partition=`hive -e "use $db; SHOW PARTITIONS t1" | grep wk | head -1`;
hive -e "use $db; ALTER TABLE t1 DROP PARTITION ($partition)";

same dataset different prediction results

I have a very simple dataset, see below (let's call it a.vw):
-1 |a 1 |b c57
1 |a 2 |b c3
2 namespaces (a and b), and after reading wiki, I know that vw will automatically make the real features like a^1 or b^c57.
However, before I knew it, I actually made a vw file like this (call it b.vw):
-1 |a a_1 |b b_c57
1 |a a_2 |b b_c3
As you can see, I just add prefix for each feature manually.
Now I train models on both files with same configuration, like this:
cat a.vw | vw --loss_function logistic --passes 1 --hash all -f a.model --invert_hash a.readable --random_seed 1
cat b.vw | vw --loss_function logistic --passes 1 --hash all -f b.model --invert_hash b.readable --random_seed 1
then I checked the readable model files, they have exactly the same weights for each feature, see below:
$ cat a.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^1:112195:-0.235305
a^2:1080:0.243315
b^c3:46188:0.243315
b^c57:166454:-0.235305
$ cat b.readable
Version 8.2.1
Id
Min label:-50
Max label:50
bits:18
lda:0
0 ngram:
0 skip:
options:
Checksum: 295637807
:0
Constant:116060:-0.0539969
a^a_1:252326:-0.235305
a^a_2:85600:0.243315
b^b_c3:166594:0.243315
b^b_c57:227001:-0.235305
Finally, I did prediction using both models on both datasets respectively, like this:
$ cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet
$ cat b.vw | vw -t -i b.model -p b.pred --link logistic --quiet
Now, here comes the problem, a.pred holds very different results from b.pred, see below:
$ cat a.pred
0.428175
0.547189
$ cat b.pred
0.371776
0.606502
WHY? Does it mean we have to manually add prefix for features?
If you try cat a.vw | vw -t -i a.model -p a.pred --link logistic --quiet --hash all you'll get:
$ cat a.pred
0.371776
0.606502
It seems --hash argument value doesn't stored in model file and you need it to be specified at test step too. It doesn't matter for b.vw as it has no pure numeric features but comes into play with a.vw. I'm not sure if it's a bug. But you may report it.

List of last generated file on each day from 7 days list

I've a list of files in the following format:
Group_2012_01_06_041505.csv
Region_2012_01_06_041508.csv
Region_2012_01_06_070007.csv
XXXX_YYYY_MM_DD_HHMMSS.csv
What is the best way to compile a list of last generated file for each day per group from last 7 days list?
Version that worked on HP-UX
for d in 6 5 4 3 2 1 0
do
DATES[d]=$(perl -e "use POSIX;print strftime '%Y_%m_%d%',localtime time-86400*$d;")
done
for group in `ls *.csv | cut -d_ -f1 | sort -u`
do
CSV_FILES=$working_dir/*.csv
if [ ! -f $CSV_FILES ]; then
break # if no file exists do not attempt processing
fi
for d in "${DATES[#]}"
do
file_nm=$(ls ${group}_$d* 2>>/dev/null | sort -r | head -1)
if [ "$file_nm" != "" ]
then
# Process file
fi
done
done
You can explicitly iterate over the group/time combinations:
for d in {1..6}
do
DATES[d]=`gdate +"%Y_%m_%d" -d "$d day ago"`
done
for group in `ls *csv | cut -d_ -f1 | sort -u`
do
for d in "${DATES[#]}"
do
echo "$group $d: " `ls ${group}_$d* 2>>/dev/null | sort -r | head -1`
done
done
Which outputs the following for your example data set:
Group 2012_01_06: Group_2012_01_06_041505.csv
Group 2012_01_05:
Group 2012_01_04:
Group 2012_01_03:
Group 2012_01_02:
Group 2012_01_01:
Region 2012_01_06: Region_2012_01_06_070007.csv
Region 2012_01_05:
Region 2012_01_04:
Region 2012_01_03:
Region 2012_01_02:
Region 2012_01_01:
XXXX 2012_01_06:
XXXX 2012_01_05:
XXXX 2012_01_04:
XXXX 2012_01_03:
XXXX 2012_01_02:
XXXX 2012_01_01:
Note Region_2012_01_06_041508.csv is not shown for Region 2012_01_06 as it is older than Region_2012_01_06_070007.csv

Resources