How to calculate a file count in tcsh? - tcsh

I want to add a new file based on how many similar files are there already in tcsh. Here is my code.
set aNum = (`ls -ld a* | wc -l` + 1)
echo "aNum is $aNum"
Since I already have a file a*, I expect aNum be set to 2, but it is actually set to "1+1". How do I ask the script to calculate the value instead of the string?

Related

shell script to create multiple files, incrementing from last file upon next execution

I'm trying to create a shell script that will create multiple files (or a batch of files) of a specified amount. When the amount is reached, script stops. When the script is re-executed, the files pick up from the last file created. So if the script creates files 1-10 on first run, then on the next script execution should create 11-20, and so on.
enter code here
#!/bin/bash
NAME=XXXX
valid=true
NUMBER=1
while [ $NUMBER -le 5 ];
do
touch $NAME$NUMBER
((NUMBER++))
echo $NUMBER + "batch created"
if [ $NUMBER == 5 ];
then
break
fi
touch $NAME$NUMBER
((NUMBER+5))
echo "batch complete"
done
Based on my comment above and your description, you can write a script that will create 10 numbered files (by default) each time it is run, starting with the next available number. As mentioned, rather than just use a raw-unpadded number, it's better for general sorting and listing to use zero-padded numbers, e.g. 001, 002, ...
If you just use 1, 2, ... then you end up with odd sorting when you reach each power of 10. Consider the first 12 files numbered 1...12 without padding. a general listing sort would produce:
file1
file11
file12
file2
file3
file4
...
Where 11 and 12 are sorted before 2. Adding leading zeros with printf -v avoids the problem.
Taking that into account, and allowing the user to change the prefix (first part of the file name) by giving it as an argument, and also change the number of new files to create by passing the count as the 2nd argument, you could do something like:
#!/bin/bash
prefix="${1:-file_}" ## beginning of filename
number=1 ## start number to look for
ext="txt" ## file extension to add
newcount="${2:-10}" ## count of new files to create
printf -v num "%03d" "$number" ## create 3-digit start number
fname="$prefix$num.$ext" ## form first filename
while [ -e "$fname" ]; do ## while filename exists
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
while ((newcount--)); do ## loop newcount times
touch "$fname" ## create filename
((! newcount)) && break; ## newcount 0, break (optional)
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
Running the script without arguments will create the first 10 files, file_001.txt - file_010.txt. Run a second time, it would create 10 more files file_011.txt to file_020.txt.
To create a new group of 5 files with the prefix of list_, you would do:
bash scriptname list_ 5
Which would result in the 5 files list_001.txt to list_005.txt. Running again with the same options would create list_006.txt to list_010.txt.
Since the scheme above with 3 digits is limited to 1000 files max (if you include 000), there isn't a big need to get the number from the last file written (bash can count to 1000 quite fast). However, if you used 7-digits, for 10 million files, then you would want to parse the last number with ls -1 | tail -n 1 (or version sort and choose the last file). Something like the following would do:
number=$(ls -1 "$prefix"* | tail -n 1 | grep -o '[1-9][0-9]*')
(note: that is ls -(one) not ls -(ell))
Let me know if that is what you are looking for.

How to compare multiple extension-less files in Bash

I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.
Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.
If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files

Script to pick random directory in bash

I have a directory full of directories containing exam subjects I would like to work on randomly to simulate the real exam.
They are classified by difficulty level:
0-0, 0-1 .. 1-0, 1-1 .. 2-0, 2-1 ..
I am trying to write a shell script allowing me to pick one subject (directory) randomly based on the parameter I pass when executing the script (0, 1, 2 ..).
I can't quite figure it, here is my progress so far:
ls | find . -name "1$~" | sort -r | head -n 1
What am I missing here?
There's no need for any external commands (ls, find, sort, head) for this at all:
#!/usr/bin/env bash
set -o nullglob # make globs expand to nothing, not themselves, when no matches found
dirs=( "$1"*/ ) # list directories starting with $1 into an array
# Validate that our glob actually had at least one match
(( ${#dirs[#]} )) || { printf 'No directories start with %q at all\n' "$1" >&2; exit 1; }
idx=$(( RANDOM % ${#dirs[#]} )) # pick a random index into our array
echo "${dirs[$idx]}" # and look up what's at that index

Creating histograms in bash

EDIT
I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.
END EDIT
QUESTION-
I have a long column of data with values between 0 and 1.
This will be of the type-
0.34
0.45
0.44
0.12
0.45
0.98
.
.
.
A long column of decimal values with repetitions allowed.
I'm trying to change it into a histogram sort of output such as (for the input shown above)-
0.0-0.1 0
0.1-0.2 1
0.2-0.3 0
0.3-0.4 1
0.4-0.5 3
0.5-0.6 0
0.6-0.7 0
0.7-0.8 0
0.8-0.9 0
0.9-1.0 1
Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.
I wrote it (badly) as-
for i in $(seq 0 0.1 0.9)
do
awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l;
done
Which basically does a wc -l of the entries it finds in each range.
Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.
I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?
Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture).
The script is the following:
#!/usr/bin/awk -f
BEGIN{
bin_width=0.1;
}
{
bin=int(($1-0.0001)/bin_width);
if( bin in hist){
hist[bin]+=1
}else{
hist[bin]=1
}
}
END{
for (h in hist)
printf " * > %2.2f -> %i \n", h*bin_width, hist[h]
}
The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.
For this specific problem, I would drop the last digit, then count occurrences of sorted data:
cut -b1-3 | sort | uniq -c
which gives, on the specified input set:
2 0.1
1 0.3
3 0.4
1 0.9
Output formatting can be done by piping through this awk command:
| awk 'BEGIN{r=0.0}
{while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'
The only loop you will find in this algorithm is around the line of the file.
This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.
How the algorithm works
Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.
#!/bin/bash
# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9 # They are actually 10: 0 is already a channel
# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`
# Define the channel width
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l`
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.
# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
NUMBER=$1
CHANNEL_DIM=$2
# The channel is found dividing the value for the channel width and
#+rounding it.
RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
RESULT=`printf '%.0f' $RESULT_LONG`
echo $RESULT
}
# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do
CHANNEL=`find_channel $line $CHANNEL_DIM`
[[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
let HIST[$CHANNEL]+=1
done < $FILE
counter=0
for i in ${HIST[*]}; do
CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
let counter+=1
done
Hope this helps. Comment if you have other questions.

Join in unix when field is numeric in a huge file

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

Resources