How to count the number of reviews in my files in a folder and then sort by highest to lowest? - bash

thanks anyone who has helped me so far, here is my problem: I have a folder which contains 825 files. Within these files are reviews of a hotel. An example name of one of these files is hotel_72572.dat and this file basically contains the following:
<Overall Rating>4
<Avg. Price>$173
<URL>http://www.tripadvisor.com/ShowUserReviews-g60878-d72572-r23327047-Best_Western_Pioneer_Square_Hotel-Seattle_Washington.html
<Author>everywhereman2
<Content>Old seattle...
<Date>Jan 6, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>5
<Value>5
<Rooms>5
<Location>5
<Cleanliness>5
<Check in / front desk>5
<Service>5
<Business service>5
<Author> //repeats the fields again, each cluster of fields is a review
The fields (line 6 - <Business service>) are then repeated by n times where n is the number of reviews in the file. I thought that by counting the number of times "Author" appears per file would achieve this but perhaps there is a better solution?
I am trying to write a script that will called countreviews.sh that will count the number of reviews per file in my folder (the folder name is reviews_folder) and then sort the number per file from highest to lowest. An example output will be:
hotel_72572 45
hotel_72579 33
hotel_73727 17
where the prefix is the name of the file and the number is the number of reviews per file. My script must take the folder name as an argument. For example I would type ./countreviews.sh reviews_folder and would get my output.
I have received lots of help over the past few days with many different suggestions but none of them have achieved what I am trying to do (my fault due to poor explanations), I hope this finally explains it clearly enough. Thanks again anyone who has helped me over the past few days and for any help I get for this question.

grep -c Author hotel_*.dat | sort -t : -k2nr | sed 's/\.dat:/ /'
Output (e.g.):
hotel_72572 45
hotel_72579 33
hotel_73727 17
Update
#!/bin/bash
cd "$1" || exit 1
grep -c Author hotel_*.dat | sort -t : -k2nr | sed 's/\.dat:/ /'

Related

FreeBSD script to show active connections and append number remote file

I am using NetScaler FreeBSD, which recognizes many of the UNIX like commands, grep, awk, crontab… etc.
I run the following command to get the number of connected users that we have on the system
#> nsconmsg -g aaa_cur_ica_conn -d stats
OUTPUT (numbered lines):
Line1: Displaying current counter value information
Line2: NetScaler V20 Performance Data
Line3: NetScaler NS11.1: Build 63.9.nc, Date: Oct 11 2019, 06:17:35
Line4:
Line5: reltime:mili second between two records Sun Jun 28 23:12:15 2020
Line6: Index reltime counter-value symbol-name&device-no
Line7: 1 2675410 605 aaa_cur_ica_conn
…
…
From above output - I only need the number of connected users (represented in Line 7, 3rd column (605 to be precise), along with the Hostname and Time (of the running script)
Now, to extract this important 3rd column number i.e. 605, along with the hostname, and time of data collected - I wrote the following script:
printf '%s - %s - %s\n' "$(hostname)" "$(date '+%H:%M')" "$(nsconmsg -g aaa_cur_ica_conn -d stats | grep aaa_cur_ica_conn | awk '{print $3}')"
The result is perfect, showing hostname, time, and the number of connected users as follows:
Hostname - 09:00 – 605
Now can anyone please shed light on how I can:
Run this script every day - 5am to 5pm (12hours)?
Each time scripts runs - append a file on a remote Unix share with the output?
I appreciate this might be a bit if a challenge... however would be grateful for any bash scripts wizards out there that can create magic!
Thanks in advance!
I would suggest a quick look into the FreeBSD Handbook or For People New to Both FreeBSD and UNIX® so that you could get familiar with the operating system and tools that could help you achieve better what you want.
For example, there is a utility/command named cron
The software utility cron is a time-based job scheduler in Unix-like computer operating systems.
For example, to run something all days between 5am to 5pm every minute, you could use something like:
* 05-17 * * * command
Try more options here: https://crontab.guru/#*_05-17_*_*_*.
There are more tools for scheduling commands, for example at (https://en.wikipedia.org/wiki/At_(command)) but this something you need to evaluate and read more about it.
Now regarding the command, you are using to get the "number of connected users", you could avoid the grep and just used awk for example:
awk '/aaa_cur_ica_conn/ {print $3}'
This will print only column 3 if line contains aaa_cur_ica_conn, but as before I invite you to read more about the topic so that you could bet a better overview and better understand the commands.
Last but not least, check this link How do I ask a good question? the better you could format, and elaborate your question the easy for others to give an answer.

Faster way of Appending/combining thousands (42000) of netCDF files in NCO

I seem to be having trouble properly combining thousands of netCDF files (42000+) (3gb in size, for this particular folder/variable). The main variable that i want to combine has a structure of (6, 127, 118) i.e (time,lat,lon)
Im appending each file 1 by 1 since the number of files is too long.
I have tried:
for i in input_source/**/**/*.nc; do ncrcat -A -h append_output.nc $i append_output.nc ; done
but this method seems to be really slow (order of kb/s and seems to be getting slower as more files are appended) and is also giving a warning:
ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "forecast_period" does not monotonically increase between (input file file1.nc record indices: 17, 18) (output file file1.nc record indices 17, 18) record coordinate values 6.000000, 1.000000
that basically just increases the variable "forecast_period" 1-6 n-times. n = 42000files. i.e. [1,2,3,4,5,6,1,2,3,4,5,6......n]
And despite this warning i can still open the file and ncrcat does what its supposed to, it is just slow, at-least for this particular method
I have also tried adding in the option:
--no_tmp_fl
but this gives an eror:
ERROR: nco__open() unable to open file "append_output.nc"
full error attached below
If it helps, im using wsl and ubuntu in windows 10.
Im new to bash and any comments would be much appreciated.
Either of these commands should work:
ncrcat --no_tmp_fl -h *.nc
or
ls input_source/**/**/*.nc | ncrcat --no_tmp_fl -h append_output.nc
Your original command is slow because you open and close the output files N times. These commands open it once, fill-it up, then close it.
I would use CDO for this task. Given the huge number of files it is recommended to first sort them on time (assuming you want to merge them along the time axis). After that, you can use
cdo cat *.nc outfile

Getting log entry "disk online" from system log

When a disk inserted to my cluster, i wanna know that.
So i need to listen /var/adm/messages and when i catch !NEW! "online" line i must write it to a different log file.
When disk goes online I get this kind of log entries:
Dec 8 10:10:46 SMNODE01 genunix: [ID 408114 kern.info] /scsi_vhci/disk#g5000c50095f92a8f (sd69) online
Tail works without -F option. But i need -F option :/
tail messages | grep 408114 | grep '/scsi_vhci/disk#'| egrep -wi --color 'online'
I have 3 uniform words for grep.
1- The id "408114" is unique for online status.
2- /scsi_vhci/disk#
3- online
P.S: Sorry for my english :)
For grep AND use .*:
$ grep 408114.*/scsi_vhci/disk#.*online test
Dec 8 10:10:46 SMNODE01 genunix: [ID 408114 kern.info] /scsi_vhci/disk#g5000c50095f92a8f (sd69) online
Next time don't edit the question completely but ask another question.

How to resume reading a file?

I'm trying to find the best and most efficient way to resume reading a file from a given point.
The given file is being written frequently (this is a log file).
This file is rotated on a daily basis.
In the log file I'm looking for a pattern 'slow transaction'. End of such lines have a number into parentheses. I want to have the sum of the numbers.
Example of log line:
Jun 24 2015 10:00:00 slow transaction (5)
Jun 24 2015 10:00:06 slow transaction (1)
This is easy part that I could do with awk command to get total of 6 with above example.
Now my challenge is that I want to get the values from this file on a regular basis. I've an external system that polls a custom OID using SNMP. When hitting this OID the Linux host runs a couple of basic commands.
I want this SNMP polling event to get the number of events since the last polling only. I don't want to have the total every time, just the total of the newly added lines.
Just to mention that only bash can be used, or basic commands such as awk sed tail etc. No perl or advanced programming language.
I hope my description will be clear enough. Apologizes if this is duplicate. I did some researches before posting but did not find something that precisely correspond to my need.
Thank you for any assistance
In addition to the methods in the comment link, you can also simply use dd and stat to read the logfile size, save it and sleep 300 then check the logfile size again. If the filesize has changed, then skip over the old information with dd and read the new information only.
Note: you can add a test to handle the case where the logfile is deleted and then restarted with 0 size (e.g. if $((newsize < size)) then read all.
Here is a short example with 5 minute intervals:
#!/bin/bash
lfn=${1:-/path/to/logfile}
size=$(stat -c "%s" "$lfn") ## save original log size
while :; do
newsize=$(stat -c "%s" "$lfn") ## get new log size
if ((size != newsize)); then ## if change, use new info
## use dd to skip over existing text to new text
newtext=$(dd if="$lfn" bs="$size" skip=1 2>/dev/null)
## process newtext however you need
printf "\nnewtext:\n\n%s\n" "$newtext"
size=$((newsize)); ## update size to newsize
fi
sleep 300
done

How to get access_log summary by goaccess starting from certain date?

Currently I keep 6 weeks of apache access_log. If I generate a access summary at month end:
cat /var/log/httpd/access_log* | goaccess --output-format=csv
the summary will include some access data from previous month.
How can I skip logs of previous month and summarise from first day of month?
p.s. the data-format is: %d/%b/%Y
You can trade the Useless Use of cat for a useful grep.
grep -n $(date +'[0-3][0-9]/%b/%Y') /var/log/httpd/access_log* |
goaccess --output-format=csv
If the logs are by date, it would be a lot more economical to skip the logs which you know are too old or too new, i.e. modify the wildcard argument so you only match the files you really want (or run something like find -mtime -30 to at least narrow the set to a few files).
(The cat is useless because, if goaccess is at all correctly written, it should be able to handle
goaccess --output-format=csv /var/log/httpd/access_log*
just fine.)

Resources