Sorting large files on Windows

Sorting large files on Windows - windows

On a windows 2008 Server, the below more and sort command is being used to sort a large csv file (20MB) by the first column. But the command is still running after 20 minutes! What is the best way to sort large csv files in Windows?
more input.csv +1 | sort > sortedInput.csv

If I have to bet, your file has more than 65535 lines and the more command is waiting for you to press a key (more command makes a pause after each 65535 lines)
Without more information on the .csv file characteristics, this can be used as a starting point
#echo off
setlocal enableextensions disabledelayedexpansion
< input.csv (
set /p header=
setlocal enabledelayedexpansion
echo(!header!
endlocal
findstr "^" | sort
) > output.csv
This will
open the input file for reading
read the first line (max 1021 characters)
output the firstline (delayed expansion is needed)
read the rest of the file (findstr) and pipe the data to sort command
send anything to the output file
Please, note that both set /p and findstr have several limitations that could make this approach fail.

sort input.csv > sorted.csv

Related

batch files to create and delete text to windows hosts file with task scheduler

The goal is to utilize the task scheduler to block facebook from a designated time to increase productivity and reduce distractions. I deleted facebook for seven months. I don't want to use third party software. Please help me. Also if you have an easier method or better code please show.
Here is what I have.
I have this batch file = blockfacebook.bat executed a particular time which works:
echo 0.0.0.0 www.facebook.com >> c:\windows\system32\drivers\etc\hosts
Although the next batch file = unblockfacebook.bat has the result of completely emptying the hosts file:
type c:\windows\system32\drivers\etc\hosts | findstr /v facebook > c:\windows\system32\drivers\etc\hosts
Could anyone tell me what I am doing wrong with the unblockfacebook.bat
Thank you,
AEGIS

#echo off
setlocal enableextensions disabledelayedexpansion
set "file=c:\windows\system32\drivers\etc\hosts"
for /f "tokens=* delims=0123456789" %%a in (
'findstr /n /i /v /c:"facebook" "%file%" ^& type nul ^> "%file%"'
) do (
set "line=%%a"
setlocal enabledelayedexpansion
>>"%file%" echo(!line:~1!
endlocal
)
endlocal
This code uses findstr to filter the input lines and number the output (numbering the lines ensures the empty lines in input file are also readed, processed and written to output), and removes the content of the input file. As for /f has cached the output of findstr in memory, and findstr has ended processing the data, there is no problem to do it.
Then for starts to process the input (the output of findstr). It is a list of numbered lines. To remove the numbers, they are used as delimiters in the for command. As the lines start with delimiters, they are removed until a non delimiter character is found, the colon that separates the numbers from the line content (colon has not been used as delimiter to avoid problems with lines starting with a colon per example some ipv6 address)
Now, with the numbers removed from start all that needs to be done is to remove the first character and append the lines to the input/output file (that was emptied).

type c:\windows\system32\drivers\etc\hosts | findstr /v facebook > c:\users\aegis\desktop\grr.txt
type c:\users\aegis\desktop\grr.txt > c:\windows\system32\drivers\etc\hosts
That's the solution I figured it out when I read this. Although more effective or alternative methods are always appreciated.

Split text file into multiple files using windows batch scripting

I need to split one text file into multiple files using windows batch script, could anybody light me up?
sample text file:
abc1-10
abc1-11
abc1-12
xyz2-01
xyz2-02
xyz3-01
xyz3-02
in this case, it has to split into 3 files, first one consists the lines abc1-xx, second one consists xyz2-xx and xyz3-xx go to the last one

You could use a batch file, but why not just use FINDSTR command?
findstr /R "^abc1-" sample.txt > file1.txt
findstr /R "^xyz2-" sample.txt > file2.txt
findstr /R "^xyz3-" sample.txt > file3.txt

Use the cgwin command SPLIT.
Samples:
-split a file every 500 lines counts:
split -l 500 [filename.ext]
For more: split --help

This may help - it will split the text into separate files of
abc1.txt
xyz2.txt
xyz3.txt
#echo off
for /f "tokens=1,* delims=-" %%a in ('type "file.txt"') do (
>>"%%a.txt" echo(%%a-%%b
)
pause

Need batch file help to pull data from a text file, separated by a bunch of |||

I'm running windows vista and I need some help to write a batch program to pull out some data from a text file, so that I can generate another text file with the data in a specific format
The problem is that the text file data is separated by a bunch of |||| and I don't know how to remove or read the data between them.
Here's a sample of the data file:
MMM|^~\&|SSS||||20130813084347||RUR|14864-1W2220300-9|P^|2.3.0|||NE|ER
PID|||2013-1W2220300|-|LASTNAME^FIRSTNAME^||19971101|F|||||(416)222-3888||||||X2861673469 HY
and here's the output file that I need to generate.
currentdate currenttime REC F M 1W2220300 2861673469 HY LASTNAME FIRSTNAME 14864 13-AUG-2013 08:43:47
The "REC" "F" and "M" variables don't change, but the others do. Any help would be greatly appreciated, as my knowledge of batch files is pretty limited.

solution for sed for Windows
sed "y/|/ /" file

#ECHO OFF
SETLOCAL
(
FOR /f "delims=" %%a IN (
q18241872.txt
) DO (
SET "var=%%a"
SETLOCAL enabledelayedexpansion
SET var=!var:^|=^| !
ECHO !var!
ENDLOCAL
)
)>newfile.txt
GOTO :EOF
This should produce newfile.txt from your original file which I've arbitrarily named q18241872.txt.
The new file should have each | separated by a space (nothing magical about space, could be any character within reason - # perhaps...) and you would then be able to process further using FOR/F "tokens=...delims=|" ... but since you give no clue as to the actual structure of your source nor the processing required to produce your desired output, that's where I'll have to leave it...

Using Windows/DOS shell/batch commands, how do I take a file and only keep unique lines?

Say I have a file like:
apple
pear
lemon
lemon
pear
orange
lemon
How do I make it so that I only keep the unique lines, so I get:
apple
pear
lemon
orange
I can either modify the original file or create a new one.
I'm thinking there's a way to scan the original file a line at a time, check whether or not the line exists in the new file, and then append if it doesn't. I'm not dealing with really large files here.

#echo off
setlocal disabledelayedexpansion
set "prev="
for /f "delims=" %%F in ('sort uniqinput.txt') do (
set "curr=%%F"
setlocal enabledelayedexpansion
if "!prev!" neq "!curr!" echo !curr!
endlocal
set "prev=%%F"
)
What it does: sorts the input first, and then goes though it sequentially and outputs only if current line is different to previous one. It could have been even simpler if not for need to handle special characters (that's why those setlocal/endlocal are for).
It just echoes lines to stdout, if you want to write to file do (assuming you named your batch myUniq.bat) myUniq >>output.txt

Run PowerShell from the command prompt.
Assuming the items are in a file call fruits.txt, the following will put the unique lines in uniques.txt:
type fruits.txt | Sort-Object -unique | Out-File uniques.txt

In Windows 10 sort.exe has a hidden flag called /unique that you can use
C:\Users>sort fruits.txt
apple
lemon
lemon
lemon
orange
pear
pear
C:\Users>sort /unique fruits.txt
apple
lemon
orange
pear

There's no easy way to do that from the command line without an additional program.
uniq will do what you want.
Or you can download CoreUtils for Windows to get GNU tools. Then you can just use sort -u to get what you want.
Either one of those should be callable from a batch file.
Personally though, if you need to do a lot text manipulation like that I think you'd be better off getting Cygwin. Then you'd have easy access to sort, sed, awk, vim, etc.

The SORT command in Windows 10 does have an undocumented switch to remove duplicate lines.
SORT /UNIQ File.txt /O Fileout.TXT
But a more bullet proof option with a pure batch file you could use the following.
#echo off
setlocal disableDelayedExpansion
set "file=MyFileName.txt"
set "sorted=%file%.sorted"
set "deduped=%file%.deduped"
::Define a variable containing a linefeed character
set LF=^
::The 2 blank lines above are critical, do not remove
sort "%file%" >"%sorted%"
>"%deduped%" (
set "prev="
for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%sorted%") do (
set "ln=%%A"
setlocal enableDelayedExpansion
if /i "!ln!" neq "!prev!" (
endlocal
(echo %%A)
set "prev=%%A"
) else endlocal
)
)
>nul move /y "%deduped%" "%file%"
del "%sorted%"

I also used Powershell from the command prompt, in the directory in which my text file is located, and then I used the cat command, the sort command, and Get-Unique cmdlet, as mentioned at http://blogs.technet.com/b/heyscriptingguy/archive/2012/01/15/use-powershell-to-choose-unique-objects-from-a-sorted-list.aspx.
It looked like this:
PS C:\Users\username\Documents\VDI> cat .\cde-smb-incxxxxxxxx.txt | sort | Get-Unique > .\cde-smb-incxxxxxxx-sorted.txt

Use GNU sort utility:
sort -u file.txt
If you're on Windows and using Git, then sort and many more useful utilities are already here:
C:\Program Files\Git\usr\bin\
Just add this path to your %PATH% environment variable.

You can use SORT command
eg
SORT test.txt > Sorted.txt

How to find the number of occurrences of a string in file using windows command line?

I have a huge files with e-mail addresses and I would like to count how many of them are in this file. How can I do that using Windows' command line ?
I have tried this but it just prints the matching lines. (btw : all e-mails are contained in one line)
findstr /c:"#" mail.txt

Using what you have, you could pipe the results through a find. I've seen something like this used from time to time.
findstr /c:"#" mail.txt | find /c /v "GarbageStringDefNotInYourResults"
So you are counting the lines resulting from your findstr command that do not have the garbage string in it. Kind of a hack, but it could work for you. Alternatively, just use the find /c on the string you do care about being there. Lastly, you mentioned one address per line, so in this case the above works, but multiple addresses per line and this breaks.

Why not simply using this (this determines the number of lines containing (at least) an # char.):
find /C "#" "mail.txt"
Example output:
---------- MAIL.TXT: 96
To avoid the file name in the output, change it to this:
find /C "#" < "mail.txt"
Example output:
96
To capture the resulting number and store it in a variable, use this (change %N to %%N in a batch file):
set "NUM=0"
for /F %N in ('find /C "#" ^< "mail.txt"') do set "NUM=%N"
echo %NUM%

Using grep for Windows
Very simple solution:
grep -o "#" mail.txt | grep -c .
Remember a dot at end of line!
Here is little bit more understandable way:
grep -o "#" mail.txt | grep -c "#"
First grep selects only "#" strings and put each on new line.
Second grep counts lines (or lines with #).
The grep utility can be easy installed from grep-for Windows page. It is very small and safe text filter. The grep is one of most usefull Unix/Linux commands and I use it in both Linux and Windows daily.
The Windows findstr is good, but does not have such features as grep.
Installation of the grep in Windows will be one of the best decision if you like CLI or batch scripts.
Download and Installation
Download latest version from the project page https://sourceforge.net/projects/grep-for-windows/. Direct link to file is https://sourceforge.net/projects/grep-for-windows/files/grep-3.5_win32.zip/download.
Unzip the ZIP archive. A file is inside.
Put the grep.exe file to the C:\Windows directory or another place from the system path list got using command echo %PATH%.
That is all.
Test if grep is working:
Open command line window (cmd)
Run the command grep --help
Uninstallation
Delete the grep.exe file from folder where you have placed it.

May be it's a little bit late, but the following script worked for me (the source file contained quote characters, this is why I used 'usebackq' parameter).
The caret sign(^) acts as escape character in windows batch scripting language.
#setlocal enableextensions enabledelayedexpansion
SET TOTAL=0
FOR /F "usebackq tokens=*" %%I IN (file.txt) do (
SET LN=%%I
FOR %%J IN ("!LN!") do (
FOR /F %%K IN ('ECHO %%J ^| FIND /I /C "searchPhrase"') DO (
#SET /A TOTAL=!TOTAL!+%%K
)
)
)
ECHO Number of occurences is !TOTAL!

I found this on the net. See if it works:
findstr /R /N "^.*certainString.*$" file.txt | find /c "#"

I would install the unix tools on your system (handy in any case :-), then it's really simple - look e.g. here:
Count the number of occurrences of a string using sed?
(Using awk:
awk '$1 ~ /title/ {++c} END {print c}' FS=: myFile.txt
).
You can get the Windows unix tools here:
http://unxutils.sourceforge.net/

OK - way late to the table, but... it seems many respondents missed the original spec that all email addresses occur on 1 line. This means unless you introduce a CRLF with each occurrence of the # symbol, your suggestions to use variants of FINDSTR /c will not help.
Among the Unix tools for DOS is the very powerful SED.exe. Google it. It rocks RegEx. Here's a suggestion:
find "#" datafile.txt | find "#" | sed "s/#/#\n/g" | find /n "#" | SED "s/\[\(.*\)\].*/Set \/a NumFound=\1/">CountChars.bat
Explanation: (assuming the file with the data is named "Datafile.txt")
1) The 1st FIND includes 3 lines of header info, which throws of a line-count approach, so pipe the results to a 2nd (identical) find to strip off unwanted header info.
2) Pipe the above results to SED, which will search for each "#" character and replace it with itself+ "\n" (which is a "new line" aka a CRLF) which gets each "#" on its own line in the output stream...
3) When you pipe the above output from SED into the FIND /n command, you'll be adding line numbers to the beginning of each line. Now, all you have to do is isolate the numeric portion of each line and preface it with "SET /a" to convert each line into a batch statement that (increasingly with each line) sets the variable equal to that line's number.
4) isolate each line's numeric part and preface the isolated number per the above via:
| SED "s/\[\(.*\)\].*/Set \/a NumFound=\1/"
In the above snippet, you're piping the previous commands's output to SED, which uses this syntax "s/WhatToLookFor/WhatToReplaceItWith/", to do these steps:
a) look for a "[" (which must be "escaped" by prefacing it with "\")
b) begin saving (or "tokenizing") what follows, up to the closing "]"
--> in other words it ignores the brackets but stores the number
--> the ".*" that follows the bracket wildcards whatever follows the "]"
c) the stuff between the \( and the \) is "tokenized", which means it can be referred-to later, in the "WhatToReplaceItWith" section. The first stuff that's tokenized is referred to via "\1" then second as "\2", etc.
So... we're ignoring the [ and the ] and we're saving the number that lies between the brackets and IGNORING all the wild-carded remainder of each line... thus we're replacing the line with the literal string:
Set /a NumFound= + the saved, or "tokenized" number, i.e.
...the first line will read: Set /a NumFound=1
...& the next line reads: Set /a NumFound=2 etc. etc.
Thus, if you have 1,283 email addresses, your results will have 1,283 lines.
The last one executed = the one that matters.
If you use the ">" character to redirect all of the above output to a batch file, i.e.:
> CountChars.bat
...then just call that batch file & you'll have a DOS environment variable named "NumFound" with your answer.

This is how I do it, using an AND condition with FINDSTR (to count number of errors in a log file):
SET COUNT=0
FOR /F "tokens=4*" %%a IN ('TYPE "soapui.log" ^| FINDSTR.exe /I /R^
/C:"Assertion" ^| FINDSTR.exe /I /R /C:"has status VALID"') DO (
:: counts number of lines containing both "Assertion" and "has status VALID"
SET /A COUNT+=1
)
SET /A PASSNUM=%COUNT%
NOTE: This counts "number of lines containing string match" rather than "number of total occurrences in file".

Use this:
type file.txt | find /i "#" /c

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Sorting large files on Windows - windows

On a windows 2008 Server, the below more and sort command is being used to sort a large csv file (20MB) by the first column. But the command is still running after 20 minutes! What is the best way to sort large csv files in Windows? more input.csv +1 | sort > sortedInput.csv

sort input.csv > sorted.csv

Related

batch files to create and delete text to windows hosts file with task scheduler

Split text file into multiple files using windows batch scripting

Need batch file help to pull data from a text file, separated by a bunch of |||

Using Windows/DOS shell/batch commands, how do I take a file and only keep unique lines?

How to find the number of occurrences of a string in file using windows command line?

Categories

Resources