Comparing filenames using the same case conversion rules as a given filesystem

Comparing filenames using the same case conversion rules as a given filesystem - winapi

I need to compare filenames in order to check if they are equivalent or not on a given file system.
For example on a standard Windows NTFS volume following filenames are equivalent:
TEST.TXT <--> Test.txt
but following filenames are not:
HÉLLO.TXT <--> Héllo.txt
Is there a Win32 function that allows to check the equivalence of two filenames ?

many functions exist - all what need - case insensitive unicode string compare
lstrcmpiW, _wcsicmp, RtlEqualUnicodeString, ...

Related

Folder listing with gsutil with condition

I have got this: gsutil ls -d gs://mystorage/*123*,
which gives me all files matching the pattern "123".
I wonder if i could do this with condition like >123 and <127. To grab all files whose names contain 124, 125 and 126.

Other than *, gsutil supports special wildcard names.
You can use these special wildcards to match the name of your files, but keep in mind that you are working with strings and characters rather than numbers, therefore the solution is not very straight forward. Here is a guide using regexp, that better explains how to work with digits, in a general way.
For your specific question, you would end up with something like:
gsutil ls -d gs://mystorage/*12[456]*

Sed replace unusual file extension arising from gmv

As a result of using gmv on a large nested directory to flatten in, I have a number of duplicate files separated out and with the extensions "._1_" "._2_" etc ( .... ._n_ )
eg "a.pdf.\_1\_"
ie its
a(dot)pdf(dot)(back slash)1(back slash)
as opposed to
a(dot)pdf(dot)1
which I want to reduce it back to "a.pdf"
I tried something like
sed -i .bak "s|.\_1\_||" *
which is usually reliable and doesn't require escape characters. However its giving me
"error: illegal byte sequence"
Grateful for help to fix. This is on Mac OSX terminal. Ideally I'd like a generic solution to fix ._*_ forms where the * varies 1 to 9

There are two challenges here.
How to deal with the duplicate basename (The suffixes '1', '2', ... mostly like added to designate different sections of a single file - may be different pages a PDF, etc. Performing rename that will strip the files may cause some important files to disappear.
How to deal with the "error: illegal byte sequence" which indicate that some special characters (unicode) are part of the file name. Usually ASCII characters with value >= \0xc0, which can not be decoded according to the current local. The fact that the file names are escaped (as per OP "a.pdf.\_1\_" may hint at additional characters, not displayed (assuming this was not added by the OP).
Proposed solution is to rename the file, and place the 'sequence' part, that make the file unique BEFORE the extension, allowing the extension to be used to determine file type.
a.pdf.1 => a.1.pdf
The rename command to perform this task is:
rename 's/(.).pdf.(_._)/$1$2.pdf/' .pdf.__
Adjust the file name list as needed, and use -n to verify before running.

rename -n s/.\_1\_// *.*_1_
works (remove the -n once tested).

Slice keywords from log text files

I have a big log file with lines as
[2016-06-03T10:03:12] No data: TW.WA2
,
[2016-06-03T11:03:02] wrong overlaps: XW.W12.HHZ.2007.289
and as
[2016-06-03T14:05:26] failed to correct YP.CT02.HHZ.2012.334 because No matching response.
Each line consists of a timestamp, a reason for the logging and a keyword composed of some substrings connected by dots (TW.WA2, XW.W12.HHZ.2007.289 and YP.CT02.HHZ.2012.334 in above examples).
The format of the keywords of a specific type is fixed (substrings are joined by fixed number of dots).
The substrings are composed of letters and digits (0-5 chars, but not all substrings can be empty, generally only one at maximum, e.g., XW.WTA12..2007.289).
I want to
extract the keywords
save different types of keywords uniqued to separated files
Currently I tried grep, but only the classification is done.
grep "wrong overlaps" logfile > wrong_overlaps
grep "failed to correct" logfile > no_resp
grep "No data" logfile > no_data
In no_data, the contents are expected as like
AW.AA1
TW.WA2
TW.WA3
...
In no_resp, the contents are expected as like
XP..HHZ.2002.334
YP.CT01.HHZ.2012.330
YP.CT02.HHZ.2012.334
...
However, the simple grep commands above save the full lines. I guess I need regex to extract the keywords?

Assuming a keyword is defined by containing period and surrounded by letters and digits, then the followed regex will match all keywords:
% grep -oE '\w+(\.\w+)+' data
TW.WA2
XW.W12.HHZ.2007.289
YP.CT02.HHZ.2012.334
-o will print the matches only. And -E enables Extended Regular Expressions
This will however not make it possible for you to split it into multiply files, eg: Creating a file wrong_overlaps that contains all lines with wrong overlaps.
You can use -P to enable Perl Compatible Regular Expressions which support lookbehinds:
% grep -oP '(?<=wrong overlaps: )\w+(\.\w+)+' data
XW.W12.HHZ.2007.289
But note that PCRE doesn't support variable length lookbehinds so you will need to type out the full pattern before, eg:
something test string: ABC:DEF
ABC:DEF Can be extracted with:
(?<=test string: )\w+(\.\w+)+
But not
(?<=test string)\w+(\.\w+)+

bash cat all files that contains a certain string in file name

In bash, how would you cat all files in a directory that contains a certain string in its filename. For example I have files named:
test001.csv
test002.csv
test003.csv
result001.csv
result002.csv
I want to cat all .csv that contains the string test in the file name together, and all .csv that contains the string result in the file name together.

Just:
cat *test*.csv
cat *result*.csv
For all files with test (or in case of the second one result) in their name.

The shell itself can easily find all files matching a simple wildcard.
cat *test*.csv >testresult
You want to take care so that the output file's name does not match the wildcard. (It's technically harmless, but good practice.)
The shell will expand the wildcard in alphabetical order. Most shells will obey your locale, so the definition of "alphabetical order" may depend on current locale settings.

Here's very simple way
cat `find . -name "*test*.csv"`

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.

You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file

Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)

You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Comparing filenames using the same case conversion rules as a given filesystem - winapi

many functions exist - all what need - case insensitive unicode string compare lstrcmpiW, _wcsicmp, RtlEqualUnicodeString, ...

Related

Folder listing with gsutil with condition

Sed replace unusual file extension arising from gmv

Slice keywords from log text files

bash cat all files that contains a certain string in file name

Split text file into multiple files

Categories

Resources