How to ETL multiple files using Scriptella? - etl

I am having multiple log files 1.csv,2.csv and 3.csv generated by a log report.
I want to read those files and parse them concurrently using Scriptella.

Scriptella does not provide parallel job execution out of the box. Instead you should use a job scheduler provided by an operating system or a programming environment (e.g. run multiple ETL files by submitting jobs to an ExecutorService).
Here is a working example to import a single file specified as a system property:
ETL file:
<!DOCTYPE etl SYSTEM "http://scriptella.javaforge.com/dtd/etl.dtd">
<etl>
<connection id="in" driver="csv" url="$input"/>
<connection id="out" driver="text"/>
<query connection-id="in">
<script connection-id="out">
Importing: $1, $2
</script>
</query>
</etl>
Java code to run files in parallel:
//Imports 3 csv files in parallel using a fixed thread pool
public class ParallelCsvTest {
public static void main(String[] args) throws EtlExecutorException, MalformedURLException, InterruptedException {
final ExecutorService service = Executors.newFixedThreadPool(3);
for (int i=1;i<=3;i++) {
//Pass a name as a parameter to ETL file, e.g. input<i>.csv
final Map<String,?> map = Collections.singletonMap("input", "input"+i+".csv");
EtlExecutor executor = EtlExecutor.newExecutor(new File("parallel.csv.etl.xml").toURI().toURL(), map);
service.submit((Callable<ExecutionStatistics>)executor);
}
service.shutdown();
service.awaitTermination(10, TimeUnit.SECONDS);
}
}
Tu run this example create 3 csv files input1.csv, input2.csv and input3.csv and put them in the current working directory. Example of the CSV file:
Level, Message
INFO,Process 1 started
INFO,Process 1 stopped

Related

How to create multiple json files from multiple cucumber tests

The below cucumber runner class generates a JSON file. This JSON is then used to generate a cucumber report.
I have since added a new .feature file to my resources.
Both sets of tests in the feature files pass, but the problem is that a second JSON file is not being generated, so my second set of results are not being recorded.
#RunWith(Cucumber.class)
#CucumberOptions(
plugin = {"progress",
"html:build/report/cucumber/html",
"junit:build/report/cucumber/junit/cucumber.xml",
"json:build/report/cucumber/json/cucumber.json"
},
glue = {"com.commercial.tests"},
features = {"src/test/resources/templates"},
tags = {"#BR000, #BR002a, #BR002b, #BR003, #BR004, #BR004b, #BR005, #BR006, #BR007, #BR008", "not #wip"}
)
public class QARunner {
public static void main(String[] args) {
// TODO Auto-generated method stub
System.exit(0);
}
}
Above, I specify to create cucumber.json, but how do I specify a second json file for the second .feature?
Problem solved: The above code works, the 1 json file creates 2 results (1 per feature file).
So there is no need to add a second JSON file in the annotation at the top of the class.

Reading and writing multiple files simultaneously using Spring batch

We are developing one application which will read multiple files & write multiple files i.e. one output file for one input file (name of output file must be same as input file).
MultiResourceItemReader can read multiple files but not simultaneously, which is a performance bottleneck for us. Spring batch provides multithreading support for this but again many threads will read the same file & try to write it. Since output file name must be same as Input file name, we can't use that option too.
Now I am looking for one more possibility, if I can create 'n' threads to read & write 'n' files. But I am not sure how to integrate this logic with Spring Batch framework.
Advance thanks for any help.
Since MultiResourceItemReader doesn't meet your performance needs you may take a closer look at parallel processing, which you already mentioned is a desirable option. I don't think many threads will read the same file and try to write it when running multi-threaded, if configured correctly.
Rather than taking the typical chunk-oriented approach you could create a tasklet-orient step that is partitioned (multi-threaded). The tasklet class would be the main driver, delegating calls to a reader and a writer.
The general flow would be something like this:
Retrieve the names of all the files that need to be read in/written out (via some service class) and save them to the execution context within an implementation of Partitioner.
public class filePartitioner implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, Path> filesToProcess = this.service.getFilesToProcess(directory); // this is just sudo-ish code but maybe you inject the directory you'll be targeting into this class
Map<String, ExecutionContext> execCtxs = new HashMap<>();
for(Entry<String, Path> entry : filesToProcess.entrySet()) {
ExecutionContext execCtx = new ExecutionContext();
execCtx.put("file", entry.getValue());
execCtxs.put(entry.getKey(), execCtx);
}
return execCtxs;
}
// injected
public void setServiceClass(ServiceClass service) {
this.service = service;
}
}
a. For the .getFilesToProcess() method you just need something that returns all of the files in the designated directory because you need to eventually know what is to be read and the name of the file that is to be written. Obviously there are several ways to go about this, such as...
public Map<String, Path> getFilesToProcess(String directory) {
Map<String, Path> filesToProcess = new HashMap<String, Path>();
File directoryFile = new File(directory); // where directory is where you intend to read from
this.generateFileList(filesToProcess, directoryFile, directory);
private void generateFileList(Map<String, Path> fileList, File node, String directory) {
// traverse directory and get files, adding to file list.
if(node.isFile()) {
String file = node.getAbsoluteFile().toString().substring(directory.length() + 1, node.toString().length());
fileList.put(file, directory);
}
if(node.isDirectory()) {
String[] files = node.list();
for(String filename : files) {
this.generateFileList(fileList, new File(node, filename), directory);
}
}
}
You'll need to create a tasklet, which will pull file names from the execution context and pass them to some injected class that will read in the file and write it out (custom ItemReaders and ItemWriters may be necessary).
The rest of the work would be in configuration, which should be fairly straight forward. It is in the configuration of the Partitioner where you can set your grid size, which could even be done dynamically using SpEL if you really intend to create n threads for n files. I would bet a fixed number of threads running across n files would show significant improvement in performance but you'll be able to determine that for yourself.
Hope this helps.

Pig: Perform task on completion of UDF

In Hadoop I have a Reducer that looks like this to transform data from a prior mapper into a series of files of a non InputFormat compatible type.
protected void setup(Context context) {
LocalDatabase ld = new LocalDatabase("localFilePath");
}
protected void reduce(BytesWritable key, Text value, Context context) {
ld.addValue(key, value)
}
protected void cleanup(Context context) {
saveLocalDatabaseInHDFS(ld);
}
I was rewriting my application in Pig, and can't figure out how this would be done in a Pig UDF as there's no cleanup function or anything else to denote when the UDF has finished running. How can this be done in pig?
I would say you'd need to write a StoreFunc UDF, wrapping your own custom OutputFormat - then you'd have the ability to close out in the Output Format's RecordWriter.close() method.
This will create an database in HDFS for each reducer however, so if you want everything in a single file, you'd need to run with a single reducer or run a secondary step to merge the databases together.
http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
If you want something to run at the end of your UDF, use the finish() call. This will be called after all records have been processed by your UDF. It will be called once per mapper or reducer, the same as the cleanup call in your reducer.

Duplicate the behaviour of a data driven test

Right now, if you have a test that looks like this:
[TestMethod]
[DeploymentItem("DataSource.csv")]
[DataSource(
Microsoft.VisualStudio.TestTools.DataSource.CSV,
"DataSource.csv",
"DataSource#csv",
DataAccessMethod.Sequential)]
public void TestSomething()
{
string data = TestContext.DataRow["ColumnHeader"].ToString();
/*
do something with the data
*/
}
You'll get as many tests runs as you have data values when you execute this test.
What I'd like to do is duplicate this kind of behaviour in code while still having a datasource. For instance: let's say that I want to run this test against multiple deployed versions of a web service (this is a functional test, so nothing is being mocked - ie. it could very well be a codedui test against a web site deployed to multiple hosts).
[TestMethod]
[DeploymentItem("DataSource.csv")]
[DataSource(
Microsoft.VisualStudio.TestTools.DataSource.CSV,
"DataSource.csv",
"DataSource#csv",
DataAccessMethod.Sequential)]
public void TestSomething()
{
var svc = helper.GetService(/* external file - NOT a datasource */);
string data = TestContext.DataRow["ColumnHeader"].ToString();
/*
do something with the data
*/
}
Now, if I have 2 deployment locations listed in the external file, and 2 values in the datasource for the testmethod, I should get 4 tests.
You might be asking why I don't just add the values to the datasource. The data in the external file will be pulled in via the deployment items in the .testsettings for the test run, because they can and will be defined differently for each person running the tests and I don't want to force a rebuild of the test code in order to run the tests, or explode the number of data files for tests. Each test might/should be able to specify which locations it would like to test against (the types are known at compile-time, not the physical locations).
Likewise, creating a test for each deployment location isn't possible because the deployment locations can and will be dynamic in location, and in quantity.
Can anyone point me to some info that might help me solve this problem of mine?
UPDATE! This works for Visual Studio 2010 but does not seem to work on 2012 and 2013.
I had a similar problem where I had a bunch of files I wanted to use as test data in a data driven test. I solved it by generating a CSV file before executing the data driven test. The generation occurs in a static method decorated with the ClassInitialize attribute.
I guess you could basically do something similar and merge your current data source with your "external file" and output a new CSV data source that your data driven test use.
public TestContext TestContext { get; set; }
const string NameColumn = "NAME";
const string BaseResourceName = "MyAssembly.UnitTests.Regression.Source";
[ClassInitialize()]
public static void Initialize(TestContext context)
{
var path = Path.Combine(context.TestDeploymentDir, "TestCases.csv");
using (var writer = new StreamWriter(path, false))
{
// Write column headers
writer.WriteLine(NameColumn);
string[] resourceNames = typeof(RegressionTests).Assembly.GetManifestResourceNames();
foreach (string resourceName in resourceNames)
{
if (resourceName.StartsWith(BaseResourceName))
{
writer.WriteLine(resourceName);
}
}
}
}
[TestMethod]
[DataSource("Microsoft.VisualStudio.TestTools.DataSource.CSV", "|DataDirectory|\\TestCases.csv", "TestCases#csv", DataAccessMethod.Random)]
public void RegressionTest()
{
var resourceName = TestContext.DataRow[NameColumn].ToString();
// Get testdata from resource and perform test.
}

Where does hadoop mapreduce framework send my System.out.print() statements ? (stdout)

I want to debug a mapreduce script, and without going into much trouble tried to put some print statements in my program. But I cant seem to find them in any of the logs.
Actually stdout only shows the System.out.println() of the non-map reduce classes.
The System.out.println() for map and reduce phases can be seen in the logs. Easy way to access the logs is
http://localhost:50030/jobtracker.jsp->click on the completed job->click on map or reduce task->click on tasknumber->task logs->stdout logs.
Hope this helps
Another way is through the terminal:
1) Go into your Hadoop_Installtion directory, then into "logs/userlogs".
2) Open your job_id directory.
3) Check directories with _m_ if you want the mapper output or _r_ if you're looking for reducers.
Example: In Hadoop-20.2.0:
> ls ~/hadoop-0.20.2/logs/userlogs/attempt_201209031127_0002_m_000000_0/
log.index stderr stdout syslog
The above means:
Hadoop_Installation: ~/hadoop-0.20.2
job_id: job_201209031127_0002
_m_: map task , "map number": _000000_
4) open stdout if you used "system.out.println" or stderr if you used "system.err.append".
PS. other hadoop versions might have a sight different hierarchy but they're all should be under $Hadoop_Installtion/logs/userlogs.
On a Hadoop cluster with yarn, you can fetch the logs, including stdout, with:
yarn logs -applicationId application_1383601692319_0008
For some reason, I've found this to be more complete than what I see in the webinterface. The webinterface did not list the output of System.out.println() for me.
to get your stdout and log message on the console you can use apache commons logging framework in to your mapper and reducer.
public class MyMapper extends Mapper<..,...,..,...> {
public static final Log log = LogFactory.getLog(MyMapper.class)
public void map() throws Exception{
// Log to stdout file
System.out.println("Map key "+ key);
//log to the syslog file
log.info("Map key "+ key);
if(log.isDebugEanbled()){
log.debug("Map key "+ key);
}
context.write(key,value);
}
}
After most of the options above did not work for me, I realized that on my single node cluster, I can use this simple method:
static private PrintStream console_log;
static private boolean node_was_initialized = false;
private static void logPrint(String line){
if(!node_was_initialized){
try{
console_log = new PrintStream(new FileOutputStream("/tmp/my_mapred_log.txt", true));
} catch (FileNotFoundException e){
return;
}
node_was_initialized = true;
}
console_log.println(line);
}
Which, for example, can be used like:
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
logPrint("map input: key-" + key.toString() + ", value-" + value.toString());
//actual impl of 'map'...
}
After that, the prints can be viewed with: cat /tmp/my_mapred_log.txt.
To get rid of prints from prior hadoop runs you can simple use rm /tmp/my_mapred_log.txt before running hadoop again.
notes:
The solution by Rajkumar Singh is likely better if you have the time to download and integrate a new library.
This could work for multi-node clusters if you have a way to access "/tmp/my_mapred_log.txt" on each worker node machine.
If for some strange reason you already have a file named "/tmp/my_mapred_log.txt", consider changing the name (just make sure to give an absolute path).

Resources