Unix script is sorting the input - bash

I am having sometime here with my home assignment. Maybe you guys will advise what to read or what commands I can use in order to create the following:
Create a shell script test that will act as follows:
The script will display the following message on the terminal screen:
Enter file names (wild cards OK)
The script will read the list of names.
For each file on the list that is a proper file, display a table giving the ten most frequently used words in the file, sorted with the most frequent first. Include the count.
Repeat steps 1-3 over and over until the user indicates end-of-file. This is done by entering the single character Ctrl-d as a file name.
Here is what I have so far:
#!/bin/bash
echo 'Enter file names (wild cards OK)'
read input_source
if test -f "$input_source"
then

I'm usually ignoring homework questions without showing some progress and effort to learn something - but you're as beautifully cheeky so i'll make an exception.
here is what you want
while read -ep 'Files?> ' files
do
for file in $files
do
echo "== word counts for the $file =="
tr -cs '[:alnum:]' '\n' < "$file" | sort | uniq -c | tail | sort -nr
done
done
And now = at least try understand what the above doing...
Ps: voting to close...

How to find the ten most frequently used words in a file
Assumptions:
The files given have one word per line.
The files are not huge, so efficiency isn't a primary concern.
You can use sort and uniq to find the count of non-unique values in a file, then tail to cut off all but the last ten, and reverse-numeric sort to put them in descending order.
sort "$afile" | uniq -c | tail | sort -rd

Some tips:
have access to the complete bash manual: it's daunting at first, but it's an invaluable reference -- http://www.gnu.org/software/bash/manual/bashref.html
You can get help about bash builtins at the command line: try help read
the read command can handle printing the prompt with the -p option (see previous tip)
you'll accomplish the last step with a while loop:
while read -p "the prompt" filenames; do
# ...
done

Related

Sed through files without using for loop?

I have a small script which basically generates a menu of all the scripts in my ~/scripts folder and next to each of them displays a sentence describing it, that sentence being the third line within the script commented out. I then plan to pipe this into fzf or dmenu to select it and start editing it or whatever.
1 #!/bin/bash
2
3 # a script to do
So it would look something like this
foo.sh a script to do X
bar.sh a script to do Y
Currently I have it run a for loop over all the files in the scripts folder and then run sed -n 3p on all of them.
for i in $(ls -1 ~/scripts); do
echo -n "$i"
sed -n 3p "~/scripts/$i"
echo
done | column -t -s '#' | ...
I was wondering if there is a more efficient way of doing this that did not involve a for loop and only used sed. Any help will be appreciated. Thanks!
Instead of a loop that is parsing ls output + sed, you may try this awk command:
awk 'FNR == 3 {
f = FILENAME; sub(/^.*\//, "", f); print f, $0; nextfile
}' ~/scripts/* | column -t -s '#' | ...
Yes there is a more efficient way, but no, it doesn't only use sed. This is probably a silly optimization for your use case though, but it may be worthwhile nonetheless.
The inefficiency is that you're using ls to read the directory and then parse its output. For large directories, that causes lots of overhead for keeping that list in memory even though you only traverse it once. Also, it's not done correctly, consider filenames with special characters that the shell interprets.
The more efficient way is to use find in combination with its -exec option, which starts a second program with each found file in turn.
BTW: If you didn't rely on line numbers but maybe a tag to mark the description, you could also use grep -r, which avoids an additional process per file altogether.
This might work for you (GNU sed):
sed -sn '1h;3{H;g;s/\n/ /p}' ~/scripts/*
Use the -s option to reset the line number addresses for each file.
Copy line 1 to the hold space.
Append line 3 to the hold space.
Swap the hold space for the pattern space.
Replace the newline with a space and print the result.
All files in the directory ~/scripts will be processed.
N.B. You may wish to replace the space delimiter by a tab or pipe the results to the column command.

How to use grep/awk/sed to print until a certain character?

I am a complete beginner on shell scripting and I am trying to iterate through a set of JSON files and trying to extract a certain field out of it. Each JSON file has a "country:"xxx" field. In each JSON file, there are 10k of the same field with the same country name so I need only the first occurrence and I can do that using "-m 1".
I tried to use grep for this but could not figure out how to extract the whole field including the country name from each file at first occurrence.
for FILE in *.json;
do
grep -o -a -m 1 -h -r '"country":"' $FILE;
done
I tried to use another pipe and use the below pattern but it did not work
| egrep -o '^[^"]+'
Actual Output:
"country":"
"country":"
"country":"
Desired Output:
"country:"romania"
"country:"united kingdom"
"country:"tajikistan"
but I need the whole thing. Any help would be great. Thanks
There is one general answer on the question "I only want the first occurence", and that answer is:
... | head -n 1
This mean, whatever your do: take the head (the first lines), the -n switch gives you the possibility to say how many you want (one in this case).
The same can be done for the last occurence(s), but then you use tail instead of head (you can also use the -n switch).
After trying many things. I found the pattern I was looking for.
grep -Po '"country":.*?[^\\]",' $FILE | head -n 1;

How to create argument variable in bash script

I am trying to write a script such that I can identify number of characters of the n-th largest file in a sub-directory.
I was trying to assign n and the name of sub-directory into arguments like $1, $2.
Current directory: Greetings
Sub-directory: language_files, others
Sub-directory: English, German, French
Files: Goodmorning.csv, Goodafternoon.csv, Goodevening.csv ….
I would be at directory “Greetings”, while I indicating subdirectory (English, German, French), it would show the nth-largest file in the subdirectory indicated and calculate number of characters as well.
For instance, if I am trying to figure out number of characters of 2nd largest file in English, I did:
langs=$1
n=$2
for langs in language_files/;
Do count=$(find language_files/$1 name "*.csv" | wc -m | head -n -1 | sort -n -r | sed -n $2(p))
Done | echo "The file has $count bytes!"
The result I wanted was:
$ ./script1.sh English 2
The file has 1100 bytes!
The main problem of all the issue is the fact that I don't understand how variables and looping work in bash script.
no need for looping
find language_files/"$1" -name "*.csv" | xargs wc -m | sort -nr | sed -n "$2{p;q}"
for byte counting you should use -c, since -m is for char counting (it may be the same for you).
You don't use the loop variable in the script anyway.
Bash loops are interesting. You are encouraged to learn more about them when you have some time. However, this particular problem might not need a loop. Set lang (you can call it langs if you prefer) and n appropriately, and then try this:
count=$(stat -c'%s %n' language_files/$lang/* | sort -nr | head -n$n | tail -n1 | sed -re 's/^[[:space:]]*([[:digit:]]+).*/\1/')
That should give you the $count you need. Then you can echo it however you like.
EXPLANATION
If you wish to learn how it works:
The stat command outputs various statistics about the named file (or files), in this case %s the file's size and %n the file's name.
The head and tail output respectively the first and last several lines of a file. Together, they select a specific line from the file
The sed command screens a certain part of the line. (You can use cut, instead, if you prefer.)
If you wish to be cleverer, then you can optimize as #karafka has done.

piping output to uniq or sort -u not returning expected result

I have tens of thousands of files that have names with similar, often repeating prefixes. I want to loop through all filenames and get a list of unique prefixes.
AB-61-GA_0001c.txt
AB-61-GA_aseguh.xml
AM-81-BU_0678.mp4
AM-81-BU_ochyu.doc
AM-92-LA_gatyt.csv
I want to end up with output:
AB-61-GA
AM-81-BU
AM-92-LA
For that I've put together the following shell script
#!/bin/bash
for i in *.*
do
UNIQUEOBJECT=$(echo "$i" | cut -d '_' -f 1 | sort -u)
echo "$UNIQUEOBJECT"
done
For some reason I end up with the list of prefixes (everything before the underscore) with identical prefixes still repeating. Obviously this is just a lack of understanding of bash scripting on my part but what am I doing wrong?
Thanks
The problem is that your for loop is sending one filename at a time. So you sort and unique a single filename.
You could do something like (syntax may not be quite right as I don't have a Linux box for testing at the moment)
#!/bin/bash
UNIQUEOBJECT=$(for i in *.*
do
echo "$i"
done | cut -d '_' -f 1 | sort -u)
echo "$UNIQUEOBJECT"
You need to generate the list before you sort. Your original was generating the list after sorting.

Get first N chars and sort them

I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou
sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.
Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.
The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.

Resources