How can I split a file with a custom suffix? - shell

I have a file with 100s of filenames:
ax
bx
cx
...
...
...
112z
I want to split this file into files with 10 filenames each.
split -a 2 -d l 10 MASTERLIST
TRIAL 2 works: split -a 2 -d -l 10 MASTERLIST LIST_
But I want the numbering of files from 01 instead of 00. How can I do this? I know I have to use this:
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic.
FROM changes the start value (default 0).
But I am not sure how to use the FROM syntax.
Link: http://man7.org/linux/man-pages/man1/split.1.html

split -a 2 --numeric-suffixes=1 -l 10 MASTERLIST

Related

GNU Parallel job number leading zeros with multiple expansions

I am using GNU Parallel to create a file of python jobs. I was looking to have the file look like such:
job_num, args
001, -a 1 -b 2
002, -a 1 -b 4
003, -a 2 -b 2
004, -a 2 -b 4
The idea being that each group of args can be configured at file generation while having leading zero job numbers.
Here is one thing I tried:
parallel --rpl '{0#} $_=sprintf("%02d",$job->seq())' echo {0#}, -a {1} -b {2} ::: 1 2 ::: 2 4
This results in:
01 01, -a 1 -b 2
02 02, -a 1 -b 4
03 03, -a 2 -b 2
04 04, -a 2 -b 4
The result that I did not expect was the double set of job numbers (with 3 expansions, it results in 3 job numbers). Any thoughts on how to make what I am trying work?
Tested on versions 20170322 & 20210222.
Related posts (tried the contents of each):
GNU Parallel with sequence number `{#}` and `-n` option
Linux shell script to add leading zeros to file names
Something like this:
parallel --rpl '{0#} 1 $f=1+int((log(total_jobs())/log(10))); $_=sprintf("%0${f}d",seq())' echo '{0#}, -a {1} -b {2}' ::: {1..9} ::: {1..12}
1+int((log(total_jobs())/log(10))) computes the width of total_jobs() in digits.

How to compare multiple extension-less files in Bash

I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.
Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.
If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files

Bash script cut at specific ranges

I have a log file with a plenty of collected logs, I already made a grep command with a regex that outputs the number of lines that matches it.
This is the grep command I'm using to output the matched lines:
grep -n -E 'START_REGEX|END_REGEX' Example.log | cut -d ':' -f 1 > ranges.txt
The regex is conditional it can match the begin of a specific log or its end, thus the output is something like:
12
45
128
136
...
The idea is to use this as a source of ranges to make specific cut on the log file from first number to the second and save them on another file.
The ranges are made by couples of the output, according to the example the first range is 12,45 and the second 128,136.
I expect to see in the final file all the text from line 12 to 45 and then from 128 to 136.
The problem I'm facing is that the sed command seems to work with only one range at time.
sed -E -iTMP "$START_RANGE,$END_RANGE! d;$END_RANGEq" $FILE_NAME
Is there any way (maybe with awk) to do that just in one "cycle"?
Constraints: I can only use supported bash command.
You can use an awk statement, too
awk '(NR>=12 && NR<=45) || (NR>=128 && NR<=136)' file
where, NR is a special variable in Awk which keep tracks of the line number as it processes the file.
An example,
seq 1 10 > file
cat file
1
2
3
4
5
6
7
8
9
10
awk '(NR>=1 && NR<=3) || (NR>=8 && NR<=10)' file
1
2
3
8
9
10
You can also avoid, hard-coding the line numbers by using the -v variable option,
awk -v start1=1 -v end1=3 -v start2=8 -v end2=10 '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2)' file
1
2
3
8
9
10
With sed you can do multiple ranges of lines like so:
sed -n '12,45p;128,136p'
This would output lines 12-45, then 128-136.

How to use "cmp" to compare two binaries and find all the byte offsets where they differ?

I would love some help with a Bash script loop that will show all the differences between two binary files, using just
cmp file1 file2
It only shows the first change I would like to use cmp because it gives a offset an a line number of where each change is but if you think there's a better command I'm open to it :) thanks
I think cmp -l file1 file2 might do what you want. From the manpage:
-l --verbose
Output byte numbers and values of all differing bytes.
The output is a table of the offset, the byte value in file1 and the value in file2 for all differing bytes. It looks like this:
4531 66 63
4532 63 65
4533 64 67
4580 72 40
4581 40 55
[...]
So the first difference is at offset 4531, where file1's decimal octal byte value is 66 and file2's is 63.
Method that works for single byte addition/deletion
diff <(od -An -tx1 -w1 -v file1) \
<(od -An -tx1 -w1 -v file2)
Generate a test case with a single removal of byte 64:
for i in `seq 128`; do printf "%02x" "$i"; done | xxd -r -p > file1
for i in `seq 128`; do if [ "$i" -ne 64 ]; then printf "%02x" $i; fi; done | xxd -r -p > file2
Output:
64d63
< 40
If you also want to see the ASCII version of the character:
bdiff() (
f() (
od -An -tx1c -w1 -v "$1" | paste -d '' - -
)
diff <(f "$1") <(f "$2")
)
bdiff file1 file2
Output:
64d63
< 40 #
Tested on Ubuntu 16.04.
I prefer od over xxd because:
it is POSIX, xxd is not (comes with Vim)
has the -An to remove the address column without awk.
Command explanation:
-An removes the address column. This is important otherwise all lines would differ after a byte addition / removal.
-w1 puts one byte per line, so that diff can consume it. It is crucial to have one byte per line, or else every line after a deletion would become out of phase and differ. Unfortunately, this is not POSIX, but present in GNU.
-tx1 is the representation you want, change to any possible value, as long as you keep 1 byte per line.
-v prevents asterisk repetition abbreviation * which might interfere with the diff
paste -d '' - - joins every two lines. We need it because the hex and ASCII go into separate adjacent lines. Taken from: Concatenating every other line with the next
we use parenthesis () to define bdiff instead of {} to limit the scope of the inner function f, see also: How to define a function inside another function in Bash?
See also:
https://superuser.com/questions/125376/how-do-i-compare-binary-files-in-linux
https://unix.stackexchange.com/questions/59849/diff-binary-files-of-different-sizes
The more efficient workaround I've found is to translate binary files to some form of text using od.
Then any flavour of diff works fine.

How to split a file into equal parts, without breaking individual lines? [duplicate]

This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 5 years ago.
I was wondering if it was possible to split a file into equal parts (edit: = all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!
If you mean an equal number of lines, split has an option for this:
split --lines=75
If you need to know what that 75 should really be for N equal parts, its:
lines_per_part = int(total_lines + N - 1) / N
where total lines can be obtained with wc -l.
See the following script for an example:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
This outputs:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:
split --number=l/6 ${fspec} xyzzy.
(that's ell-slash-six, meaning lines, not one-slash-six).
That will give you roughly equal files in terms of size, with no mid-line splits.
I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.
The script isn't even necessary, split(1) supports the wanted feature out of the box:
split -l 75 auth.log auth.log.
The above command splits the file in chunks of 75 lines a piece, and outputs file on the form: auth.log.aa, auth.log.ab, ...
wc -l on the original file and output gives:
321 auth.log
75 auth.log.aa
75 auth.log.ab
75 auth.log.ac
75 auth.log.ad
21 auth.log.ae
642 total
A simple solution for a simple question:
split -n l/5 your_file.txt
no need for scripting here.
From the man file, CHUNKS may be:
l/N split into N files without splitting lines
Update
Not all unix dist include this flag. For example, it will not work in OSX. To use it, you can consider replacing the Mac OS X utilities with GNU core utilities.
split was updated in coreutils release 8.8 (announced 22 Dec 2010) with the --number option to generate a specific number of files. The option --number=l/n generates n files without splitting lines.
coreutils manual
I made a bash script, that given a number of parts as input, split a file
#!/bin/sh
parts_total="$2";
input="$1";
parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
lines=$(wc -l "$input" | cut -f 1 -d" ")
#n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
n=$(awk -v lines=$lines -v parts=$parts 'BEGIN {
n = lines/parts;
rounded = sprintf("%.0f", n);
if(n>rounded){
print rounded + 1;
}else{
print rounded;
}
}');
head -$n "$input" > split${i}
tail -$((lines-n)) "$input" > .tmp${i}
input=".tmp${i}"
parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*
I used head and tail commands, and store in tmp files, for split the files
#10 means 10 parts
sh mysplitXparts.sh input_file 10
or with awk, where 0.1 is 10% => 10 parts, or 0.334 is 3 parts
awk -v size=$(wc -l < input) -v perc=0.1 '{
nfile = int(NR/(size*perc));
if(nfile >= 1/perc){
nfile--;
}
print > "split_"nfile
}' input
var dict = File.ReadLines("test.txt")
.Where(line => !string.IsNullOrWhitespace(line))
.Select(line => line.Split(new char[] { '=' }, 2, 0))
.ToDictionary(parts => parts[0], parts => parts[1]);
or
enter code here
line="to=xxx#gmail.com=yyy#yahoo.co.in";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);
ans:
tokens[0]=to
token[1]=xxx#gmail.com=yyy#yahoo.co.in"

Resources