vbscript to read text files and extract data - vbscript

i am in need of script to extract number of note-ref_ and #ref_ presence in all html files
my folder structure will be
D:\Count_Test
wherein lot of folders and sub-folder will contain and in each sub-folder will have a ref.html, text.html file will contain note-ref_ and #ref_ text (apart these files, some other files such as xml, txt and imges and css sub-folder will contain)
I need to count for every single file how many times note-ref_ and #ref_ appeared and the results needs to capture in .csv file
can anybody help me by providing solution to extract text into csv file

Suggestions:
Use the Scripting.FileSystemObject (FSO) to walk through files and sub folders to identify the scope of your actions. Alternatively, you could capture the output of DIR /s /b D:\Count_Test*.html.
Once you know the list of files you'll need to open, you should read each of them using the OpenTextFile function of the FSO and loop through each row. When you find what you're looking for, increase some sort of counter - perhaps in an array.
Finally once you've finished collecting the data, you can output your results by once again doing OpenTextFile, but this time opening your CSV file location and writing the data you've collected in the appropriate format.

Related

extracting .jpeg files from subfolders and putting them in another folder using SSIS

I have a folder that has around 400 subfolders each with ONE .jpeg file in them. I need to get all the pictures into 1 new folder using SSIS, everything is on my local (no connecting through different servers or DBs) just subfolders to one folder so that I can pull out those images without going one by one into each subfolder.
I would create 3 variables, all of type String. CurrentFile, FolderBase, FolderOutput.
FolderBase is going to be where we start searching i.e. C:\ssisdata
FolderOutput is where we are going to move any .jpg files that we find rooted under FolderBase.
Use a Foreach File Enumerator (sample How to import text files with the same name and schema but different directories into database?) configured to process subfolders looking for *.jpg. Map the first element on the Variable tab to be our CurrentFile. Map the Enumerator to start in FolderBase. For extra flexibility, create an additional variable to hold the file mask *.jpg.
Run the package. It should quickly zip through all the folders finding and doing nothing.
Drag and drop a file system task into the Foreach Enumerator. Make it a Move file (or maybe it's rename) type. Use a Variable source and destination. The Source will be CurrentFile and the destination will be FolderOutput

How can I find duplicately named files in Windows?

I am organizing a large Windows folder with many subfolders (with sub folders, etc...), in which files have been saved multiple times in different locations. Can anyone figure out how to identify all files with duplicate names across multiple directories? Some ways I am thinking about include:
A command or series of that could be run in the command line (cmd). Perhaps DIR could be a start...
Possibly a tool that comes with Windows
Possibly a way to specify in search to find duplicate filenames
NOT a separate downloadable tool (those could carry unwanted security risks).
I would like to be able to know the directory paths and filename to the duplicate file(s).
Not yet a full solution, but I think I am on the right track, further comments would be appreciated:
From CMD (start, type cmd):
DIR "C:\mypath" /S > filemap.txt
This should generate a recursive list of files within the directories.
TODO: Find a way to have filenames on the left side of the list
From outside cmd:
Open filemap.txt
Copy and paste the results into Excel
From Excel:
Sort the data
Add in the next column logic to compare to see if the current text = previous text (for filename)
Filter on that row to identify all duplicates
To see where the duplicates are located:
Search filemap.txt for the duplicate filenames identified above and note their directory location.
Note: I plan to update this as I get further along, or if a better solution is found.

Issue when copying two text files using the copy command

I am having an issue when attempting to copy two files into a separate file using the Windows Copy command. In the first file, when I open the text file in notepad I see the data in the file formatted as expected.
File #1
0900|Y3RN|19944|12/OCT/2016|2|2|1600|C||||||0|0|||Replace
0900|Y3RN|19944|13/OCT/2016|2|2|2000|C||||||0|0|||Replace
0900|Y3RN|19944|14/OCT/2016|2|2|600|C||||||0|0|||Replace
However in the second file that has the same fields the format of the data in notepad is different.
File #2
0901|ECQQ|339489|18/OCT/2016|2|2|25|C||||||0|0|||Replace0901|ECQQ|339489|19/OCT/2016|2|2|180|C||||||0|0|||Replace0901|EK1P|339489|04/OCT/2016|2|2|100|C||||||0|0|||Replace
Supposedly the same process is generating the two files on my customer's system of record. If I open only each file separately using Textpad, the two files have the same format as File#1 above.
When I use the Copy command, the resulting file looks as below when viewing in Notepad.
0900|Y3RN|19944|28/OCT/2016|2|2|1400|C||||||0|0|||Replace
0900|Y3RN|19944|31/OCT/2016|2|2|1400|C||||||0|0|||Replace
0900|Y6CJ|19944|10/OCT/2016|2|2|200|C||||||0|0|||Replace0901|ECQQ|339489|18/OCT/2016|2|2|25|C||||||0|0|||Replace0901|ECQQ|339489|19/OCT/2016|2|2|180|C||||||0|0|||Replace0901|EK1P|339489|04/OCT/2016|2|2|100|C||||||0|0|||Replace
However when viewing the resulting file in Textpad, the format is correct.
There has to be something missing in the format of File#2, but since I do not have access or visibility to how these files are being generated my hands are tied.
Is there a way to convert File #2 so that it is formatted exactly like File#1?

process 100K of image files with bash

here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.
I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.

pig load udf for loading files from several sub directories

I want to write a custom load udf in pig for loading files from a directory structure.
The directory structure is like an email directory.It has a root directory called maildir.Inside this we have the sub-directory of individual mail holders.Inside every mailaccount holder directory are several sub directories like inbox,sent,trash etc.
eg: maildir/mailholdername1/inbox/1.txt
maildir/mailholdername2/sent/1.txt
I want to read only inbox files from all mailerholdername sub-directories.
I am not able to understand:
what should be passed to the load udf as parameter
how should the entire directory structure be parsed an only respective inbox files are read.
I want to process one file and perform some data extraction and load it as one record.Hence if there are 10 files, i get a relation having 10 records
Further, i want to do some operation on these inbox files and extract some data.
Because you have a defined folder structure that doesn't have variable depth, I think it's as simple as passing the following pattern as your input path:
A = LOAD 'maildir/*/inbox/1.txt' USING PigStorage('\t') AS (f1,f2,f3)
You probably don't need to create your own UDF for this, the PigLoader should be able to handle them, assuming they are in some delimited format (the above example assumes 3 fields, tab delimited).
If there are multiple txt files in each inbox, use *.txt rather than 1.txt. Finally, if the maildir root directory is not in your users home directory, you should use the absolute path to the folder, say /data/maildir/*/index/*.txt

Resources