curl output to csv table - bash

I have a bash code that generates an output that is currently saved in a .txt file. I'm trying to instead place these data points in a csv table. Can you help me with this?
The sample output looks like this:
****** A Day at the Races ******
* 19371937
* PassedPassed
* 1h 51m
IMDb RATING
7.5/10
14K
****** The King and the Chorus Girl ******
* 19371937
* ApprovedApproved
* 1h 34m
IMDb RATING
6.2/10
376
****** Room Service ******
* 19381938
* ApprovedApproved
* 1h 18m
IMDb RATING
6.6/10
5.2K
****** At the Circus ******
* 19391939
* PassedPassed
* 1h 27m
IMDb RATING
6.8/10
6K
I'm trying to change this into a csv that contains: Movie title, year of release, notes, run time, imdb rating and number of reviews as columns.
for example, for the first datapoint above, the csv datapoint should
look like:
Movie title: 'A day at the races'
Year of release: 1937
Notes: Passed
Run time: 1h 51m
IMDB rating: 7.5/10
Number of reviews: 14k
The code used for generating the above output:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt

Related

How to replace a not existing value by 0?

So I have a subject folder, located in /home/subject, which contains subjects such as :
Geography
Math
These subjects are files.
And each one of them contain, the name of a student with his mark.
So for example it will be,
For Geography
Mattew 15
Elena 14
And Math :
Matthew 10
Elena 19
I also have a student folder, located in /home/student, which is empty for now.
And the purpose of this folder is to put inside of it :
The name of the student as the name of the file;
The marks of all subjets this student received.
So here is my code :
rm /home/student/*
for subjectFile in /home/subject/*; do
awk -v subject="$(basename "$subjectFile")" '{ print subject, $2 >>"/home/student/" $1 }' "$subjectFile"
done
This loop iterates over all the subject files, inside of the subject folder.
The $subjectFile value is something like :
/home/subject/Math
/home/subject/Geograpy
/home/subject/(MySubject)
Etc.. => Depending on what subjects are available.
I then, get the basename of each of these subject files :
Math
Geography
(...)
And then print the result of the second column, the mark number, inside of my student name that I get through the first column of the subject file : so for this example, for the subject Geography, I'll get Matthew.
I also didn't want to simply infinite append the result, but overwrite the previous results each time I run this script, so I typed : rm /home/student/* , to erase any student files before to proceed to the append.
This works great.
But then I have a request,
How can I make it to replace by 0 the subject mark of a student, if this one is undefined?
A student that did not received any mark for a specific subject, while others did?
As for example :
For Geography
Mattew 15
And Math :
Matthew 10
Elena 19
So for Elena,
This shall create an Elena file, inside of the student folder with :
Geography 0
Math 19
Simple distribution using pure bash
Let's try: This will create a little tree in /tmp:
cd /tmp && tar -zxvf <(base64 -d <<eof
H4sIAFUFUFwAA+3XXU6EMBQF4D6ziu7A/l2uLsBHE7eAYwUMghlKnNm9EKfEEHUyBpg4nu+lJJDQ
5HBK226KpqmuxJJUj4mGUTOpz2MktHXG9RdEWiitrWUhadFZHXRtyLZSiidflbsfnjt2/49qP/Jv
u4dnvwnLfAcn5K9JuT5/w0zIfw2T/F+yUMz+jiHg1Lnv87c09j9l0+dvU3JCqtln8oV/nv9dFkLh
36RWyW3l60zqm+S+KCspNSfnnhwsbtJ/X+dV2c68BBzvfzr2nw33/XeWUvR/DWP/WcYFgOICcI0F
4OJN+p/7Jt9mr8V+znec8P8/7P8cK4v+r2HsP8X6u1h/Qv0vX+x/6B59ff7znyFNw/5fGZz/AQAA
AAAAAAAAAAB+7R1PsalnACgAAA==
eof
)
This will create (and print in terminal, because of -v flag in tar command):
school/
school/subject/
school/subject/math
school/subject/english
school/subject/geography
school/student/
Quick overview:
cd /tmp/school/subject/
grep . * | sort -t: -k2
will render:
geography:Elena 14
english:Elena 15
math:Elena 19
math:Matthew 10
geography:Matthew 15
english:Matthew 17
geography:Phil 15
math:Phil 17
english:Phil 18
Building student stat files:
cd /tmp/school
rm student/*
for file in subject/*;do
subj=${file##*/}
while read student note ;do
echo >>student/${student,,} ${subj^} $note
done < $file
done
Nota: This use ${VARNAME^} to upper first char and ${VARNAME,,} to lower all string. So filenames are lowercaps and subject become capitalized in student files.
Then now:
ls -l student
total 12
-rw-r--r-- 1 user user 32 jan 29 08:57 elena
-rw-r--r-- 1 user user 32 jan 29 08:57 matthew
-rw-r--r-- 1 user user 32 jan 29 08:57 phil
and
cat student/phil
English 18
Geography 15
Math 17
Then now: searching for missing notation
for file in student/*;do
for subj in subject/*;do
subj=${subj##*/}
grep -q ^${subj^}\ $file || echo ${subj^} 0 >> $file
done
done
This could be tested (This will randomely drop 0 or 1 mark in all files):
for file in subject/*;do
((val=1+(RANDOM%4)))
((val<4)) && sed ${val}d -i $file
done
Then run:
cd /tmp/school
rm student/*
for file in subject/*;do
subj=${file##*/}
while read student note ;do
echo >>student/${student,,} ${subj^} $note
done < $file
done
for file in student/*;do
for subj in subject/*;do
subj=${subj##*/}
grep -q ^${subj^}\ $file || echo ${subj^} 0 >> $file
done
done
Ok, now:
grep ' 0$' student/*
student/matthew:Geography 0
Nota: As I'v been used $RANDOM, result may differ in your tests;-)
Another aproach: two steps again but
First step: building student list, then student files imediately with 0 notation:
cd /tmp/school
rm student/*
declare -A students
for file in subject/* ;do
while read student mark ;do
[ "$student" ] && students[$student]=
done <$file
done
for file in subject/*;do
class=(${!students[#]})
while read student mark ;do
subj=${file##*/}
echo >> student/${student,,} ${subj^} $mark
class=(${class[#]/$student})
done <$file
for student in ${class[#]};do
echo >> student/${student,,} ${subj^} 0
done
done
Statistic tool
For fun, with a lot of bashisms and without file creation, there is a pretty dump tool:
#!/bin/bash
declare -A students
declare subjects=() sublen=0 stdlen=0
for file in subject/* ;do # read all subject files
subj=${file##*/}
subjects+=($subj) # Add subject to array
sublen=$(( ${#subj} > sublen ? ${#subj} : sublen )) # Max subject string len
declare -A mark_$subj # Create subject's associative array
while read student mark ;do
stdlen=$(( ${#student} > $stdlen ? ${#student} : stdlen ))
[ "$student" ] && { # Skip empty lines
((students[$student]++)) # Count student's marks
printf -v mark_$subj[$student] "%d" $mark # Store student's mark
}
done <$file
done
printf -v formatstr %${#subjects[#]}s; # prepare format string for all subjects
formatstr="%-${stdlen}s %2s ${formatstr// / %${sublen}s}"
printf -v headline "$formatstr" Student Qt "${subjects[#]}"
echo "$headline" # print head line
echo "${headline//[^ ]/-}" # underscore head line
for student in ${!students[#]};do # Now one line by student...
marks=() # Clear marks
for subject in ${subjects[#]};do
eval "marks+=(\${mark_$subject[\$student]:-0})" # Add subject mark or 0
done
printf "$formatstr\n" $student ${students[$student]} ${marks[#]}
done
This may print out something like:
Student Qt english geography math
------- -- ------- --------- ----
Phil 2 18 15 0
Matthew 3 17 15 10
Elena 2 0 14 19
Nota
This script was built for bash v4.4.12 and tested under bash v5.0.
More
You could download bigger demo script: scholl-averages-demo.sh (view in browser as text .txt).
Always pure bash without forks, but with
average by student, average by subject and overall average, in pseudo float
subject and student sorted alphabeticaly
support UTF-8 in student names
.
Student Qt art biology english geography history math Average
------- -- --- ------- ------- --------- ------- ---- -------
Elena 5 12 0 15 14 17 19 12.83
Iñacio 6 12 15 19 18 12 14 15.00
Matthew 5 19 18 17 15 17 0 14.33
Phil 5 15 19 18 0 13 17 13.67
Renée 6 14 19 18 17 18 15 16.83
Theresa 5 17 14 0 12 17 18 13.00
William 6 17 17 15 15 13 14 15.17
------- -- --- ------- ------- --------- ------- ---- -------
Avgs 7 15.14 14.57 14.57 13.00 15.28 13.86 14.40
I would do it something like this. First make empty files for every student:
cat /home/subject/* | cut -d' ' -f1 | sort -u | while read student_name; do > /home/students/$student ; done
Then I would go through each one and add the marks:
for student in `ls /home/students` ; do
for file in /home/subjects/* ; do
subject="`basename $file`"
mark="`egrep "^$student [0-9]+" $file | cut -d' ' -f2`"
if [ -z "$mark" ]; then
echo "$subject 0" >> /home/students/$student
else
echo "$subject $mark" >> /home/students/$student
fi
done
done
something like that anyway

I have two files need to take record count and checksum from this file and compare with other file

I have two files one is sg_fx_cur_rates.csv need to take record count and checksum from this file and compare with other file sg_fx_cur_mapping_20170221.tok
1st file
head -10 sg_fx_cur_mapping_20170221.csv
UNIQUE IDENTIFIER AC CODE LONGNAME RISK FACTOR IDENTIFIER INSTRUMENT TYPE QUOTED CURRENCY BASE CURRENCY GLOBAL RATE LOCALE MXG_CURRENCY MXG_PIPSIZE MXG_LOCALE
SC.1000010374 FX_AED*USD_SPOT_GBL FX_AED*USD_SPOT_GBL FX_SPOT AED USD 1 UK USD-AED UK
SC.1000010375 FX_AMD*USD_SPOT_GBL FX_AMD*USD_SPOT_GBL FX_SPOT AMD USD 1 UK
SC.1000010376 FX_ANG*USD_SPOT_GBL FX_ANG*USD_SPOT_GBL FX_SPOT ANG USD 1 UK USD-ANG UK
SC.1000010376 FX_ANG*USD_SPOT_GBL FX_ANG*USD_SPOT_GBL FX_SPOT ANG USD 1 UK USD-ANG SG
SC.1000010376 FX_ANG*USD_SPOT_GBL FX_ANG*USD_SPOT_GBL FX_SPOT ANG USD 1 UK USD-ANG US
SC.1000010377 FX_AOA*USD_SPOT_GBL FX_AOA*USD_SPOT_GBL FX_SPOT AOA USD 1 UK USD-AOA UK
SC.1000010377 FX_AOA*USD_SPOT_GBL FX_AOA*USD_SPOT_GBL FX_SPOT AOA USD 1 UK USD-AOA SG
SC.1000010378 FX_ARS*USD_SPOT_GBL FX_ARS*USD_SPOT_GBL FX_SPOT ARS USD 1 UK USD-ARS UK
SC.1000010380 FX_BBD*USD_SPOT_GBL FX_BBD*USD_SPOT_GBL FX_SPOT BBD USD 1 UK USD-BBD UK
2nd file
cat sg_fx_cur_mapping_20170221.tok
CHECKSUM|0b4e6c5935c39ae311dd477e216892d5
RECORDCOUNT|00000000681
Since we don't have many clues (which checksum algorithm is being used?), here is one option :
> cat checker.sh
#!/bin/bash
echo "CHECKSUM|"$(md5sum $1 | cut -d' ' -f1) > /tmp/$$
echo "RECORDCOUNT|"$(wc -l $1 | cut -d' ' -f1) >> /tmp/$$
if [ $(comm -1 -2 <(sort /tmp/$$) <(sort $2) | wc -l) -eq 2 ]
then
echo "Files are equal"
else
echo "Files are different"
fi
rm /tmp/$$
return 0
And use it this way :
> checker.sh sg_fx_cur_mapping_20170221.csv sg_fx_cur_mapping_20170221.tok

shell script inserting "$" into a formatted column and adding new column

Hi guys pardon for my bad English. I manage to display out my data nicely and neatly using column program in the code. But how do i add a "$" in the price column. Secondly how do i add a new column total sum to it and display it with "$". (Price * Sold)
(echo "Title:Author:Price:Quantity:Sold" && cat BookDB.txt) | column -s: -t
Output:
Title Author Price Quantity Sold
The Godfather Mario Puzo 21.50 50 20
The Hobbit J.R.R Tolkien 40.50 50 10
Romeo and Juliet William Shakespeare 102.80 200 100
The Chronicles of Narnia C.S.Lewis 35.90 80 15
Lord of the Flies William Golding 29.80 125 25
Memories of a Geisha Arthur Golden 35.99 120 50
I guess you could do it with awk (line break added before && for readability
(echo "Title:Author:Price:Quantity:Sold:Calculated"
&& awk -F: '{printf ("%s:%s:$%d:%d:%d:%d\n",$1,$2,$3,$4,$5,$3*$5)}' BookDB.txt) | column -s: -t

List of last generated file on each day from 7 days list

I've a list of files in the following format:
Group_2012_01_06_041505.csv
Region_2012_01_06_041508.csv
Region_2012_01_06_070007.csv
XXXX_YYYY_MM_DD_HHMMSS.csv
What is the best way to compile a list of last generated file for each day per group from last 7 days list?
Version that worked on HP-UX
for d in 6 5 4 3 2 1 0
do
DATES[d]=$(perl -e "use POSIX;print strftime '%Y_%m_%d%',localtime time-86400*$d;")
done
for group in `ls *.csv | cut -d_ -f1 | sort -u`
do
CSV_FILES=$working_dir/*.csv
if [ ! -f $CSV_FILES ]; then
break # if no file exists do not attempt processing
fi
for d in "${DATES[#]}"
do
file_nm=$(ls ${group}_$d* 2>>/dev/null | sort -r | head -1)
if [ "$file_nm" != "" ]
then
# Process file
fi
done
done
You can explicitly iterate over the group/time combinations:
for d in {1..6}
do
DATES[d]=`gdate +"%Y_%m_%d" -d "$d day ago"`
done
for group in `ls *csv | cut -d_ -f1 | sort -u`
do
for d in "${DATES[#]}"
do
echo "$group $d: " `ls ${group}_$d* 2>>/dev/null | sort -r | head -1`
done
done
Which outputs the following for your example data set:
Group 2012_01_06: Group_2012_01_06_041505.csv
Group 2012_01_05:
Group 2012_01_04:
Group 2012_01_03:
Group 2012_01_02:
Group 2012_01_01:
Region 2012_01_06: Region_2012_01_06_070007.csv
Region 2012_01_05:
Region 2012_01_04:
Region 2012_01_03:
Region 2012_01_02:
Region 2012_01_01:
XXXX 2012_01_06:
XXXX 2012_01_05:
XXXX 2012_01_04:
XXXX 2012_01_03:
XXXX 2012_01_02:
XXXX 2012_01_01:
Note Region_2012_01_06_041508.csv is not shown for Region 2012_01_06 as it is older than Region_2012_01_06_070007.csv

Bash and awk: converting a field from 12 hour to 24 hour clock time

I have a large txt file space delimited which I split into 18 smaller files (each with their own number of columns). This split is based on a delimiter i.e. whenever the timestamp hits midnight. So effectively, I'll end up with a 18 files in the form of (note, ignore the dashes and pipes, I've used them to improve the readability):
file1
time ----------- valueA - valueB
12:00:00 AM | 54.13 | 239.12
12:00:01 AM | 51.83 | 119.93
..
file18
time ---------- valueA - valueB - valueC - valueD
12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12
12:00:01 AM | 23.92 | 121.92 | 201.23 | 892.12
..
Once I split the file I then perform some processing on each of the files using AWK so in short there's 2 stages the 'split stage' and the 'processing stage'.
Unfortunately, the timestamp contained in the large txt file is in 1 of 2 formats. Either the desirable 24 hour format of "00:00:01" or the undesirable 12 hour format of "12:00:01 AM".
As a result, I'm trying to convert all formats to be 24 hours and I'm not sure how to do this. I'm also not sure whether to attempt this at the split stage using bash or at the process stage using AWK. I know that the following function converts 12 hour to 24 hr
'date --date="12:00:01 AM" +%T'
however, I'm not sure how to incorporate this into my shell script were I'm using 'while read line' at the 'split stage' or whether I should do the time conversion in AWK (if possible?) at the 'processing stage'.
see the test below, is it helpful for you?
kent$ echo "12:00:00 AM | 54.92 | 239.12 | 231.23 | 882.12 "\
|awk -F'|' 'BEGIN{OFS="|"}{("date --date=\""$1"\" +%T") |getline $1;print }'
output
00:00:00| 54.92 | 239.12 | 231.23 | 882.12

Resources