ycsb load run elasticsearch - elasticsearch

I'm trying to run the benchmark software yscb on ElasticSearch
The problem I'm having is that after the load, the data seems to get removed during cleanup.
I'm struggling to understand what is supposed to happen?
If I comment out the cleanup, it still fails because it cannot find the index during the "run" phase.
Can someone please explain what is supposed to happen in YSCB?
I mean I think it would have
1. load phase: load say 1,000,000 records
2. run phase: query the records loaded during the "load phase"
Thanks,

Okay I have discovered by running Couchbase in YCSB that the data shouldn't be removed.
Looking at cleanup() for ElasticSearchClient I see no reason why the files would be deleted (?)
#Override
public void cleanup() throws DBException {
if (!node.isClosed()) {
client.close();
node.stop();
node.close();
}
}
The init is as follows: any reason this would not persist on the filesystem?
public void init() throws DBException {
// initialize OrientDB driver
Properties props = getProperties();
this.indexKey = props.getProperty("es.index.key", DEFAULT_INDEX_KEY);
String clusterName = props.getProperty("cluster.name", DEFAULT_CLUSTER_NAME);
Boolean newdb = Boolean.parseBoolean(props.getProperty("elasticsearch.newdb", "false"));
Builder settings = settingsBuilder()
.put("node.local", "true")
.put("path.data", System.getProperty("java.io.tmpdir") + "/esdata")
.put("discovery.zen.ping.multicast.enabled", "false")
.put("index.mapping._id.indexed", "true")
.put("index.gateway.type", "none")
.put("gateway.type", "none")
.put("index.number_of_shards", "1")
.put("index.number_of_replicas", "0");
//if properties file contains elasticsearch user defined properties
//add it to the settings file (will overwrite the defaults).
settings.put(props);
System.out.println("ElasticSearch starting node = " + settings.get("cluster.name"));
System.out.println("ElasticSearch node data path = " + settings.get("path.data"));
node = nodeBuilder().clusterName(clusterName).settings(settings).node();
node.start();
client = node.client();
if (newdb) {
client.admin().indices().prepareDelete(indexKey).execute().actionGet();
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
} else {
boolean exists = client.admin().indices().exists(Requests.indicesExistsRequest(indexKey)).actionGet().isExists();
if (!exists) {
client.admin().indices().prepareCreate(indexKey).execute().actionGet();
}
}
}
Thanks,

Okay what I am finding is as follows
(any help from ElasticSearch-ers much appreciated!!!!
because I'm obviously doing something wrong )
Even when the load shuts down leaving the data behind, the "run" still cannot find the data on startup
ElasticSearch node data path = C:\Users\Pl_2\AppData\Local\Temp\/esdata
org.elasticsearch.action.NoShardAvailableActionException: [es.ycsb][0] No shard available for [[es.ycsb][usertable][user4283669858964623926]: routing [null]]
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:140)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:125)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:72)
at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:47)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61)
at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:83)

The github README has been updated.
It looks like you need to specify using:
-p path.home=<path to folder to persist data>

Related

How to extract and manipulate data within a Nifi processor

I'm trying to write a custom Nifi processor which will take in the contents of the incoming flow file, perform some math operations on it, then write the results into an outgoing flow file. Is there a way to dump the contents of the incoming flow file into a string or something? I've been searching for a while now and it doesn't seem that simple. If anyone could point me toward a good tutorial that deals with doing something like that it would be greatly appreciated.
The Apache NiFi Developer Guide documents the process of creating a custom processor very well. In your specific case, I would start with the Component Lifecycle section and the Enrich/Modify Content pattern. Any other processor which does similar work (like ReplaceText or Base64EncodeContent) would be good examples to learn from; all of the source code is available on GitHub.
Essentially you need to implement the #onTrigger() method in your processor class, read the flowfile content and parse it into your expected format, perform your operations, and then re-populate the resulting flowfile content. Your source code will look something like this:
#Override
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
final ComponentLog logger = getLogger();
AtomicBoolean error = new AtomicBoolean();
AtomicReference<String> result = new AtomicReference<>(null);
// This uses a lambda function in place of a callback for InputStreamCallback#process()
processSession.read(flowFile, in -> {
long start = System.nanoTime();
// Read the flowfile content into a String
// TODO: May need to buffer this if the content is large
try {
final String contents = IOUtils.toString(in, StandardCharsets.UTF_8);
result.set(new MyMathOperationService().performSomeOperation(contents));
long stop = System.nanoTime();
if (getLogger().isDebugEnabled()) {
final long durationNanos = stop - start;
DecimalFormat df = new DecimalFormat("#.###");
getLogger().debug("Performed operation in " + durationNanos + " nanoseconds (" + df.format(durationNanos / 1_000_000_000.0) + " seconds).");
}
} catch (Exception e) {
error.set(true);
getLogger().error(e.getMessage() + " Routing to failure.", e);
}
});
if (error.get()) {
processSession.transfer(flowFile, REL_FAILURE);
} else {
// Again, a lambda takes the place of the OutputStreamCallback#process()
FlowFile updatedFlowFile = session.write(flowFile, (in, out) -> {
final String resultString = result.get();
final byte[] resultBytes = resultString.getBytes(StandardCharsets.UTF_8);
// TODO: This can use a while loop for performance
out.write(resultBytes, 0, resultBytes.length);
out.flush();
});
processSession.transfer(updatedFlowFile, REL_SUCCESS);
}
}
Daggett is right that the ExecuteScript processor is a good place to start because it will shorten the development lifecycle (no building NARs, deploying, and restarting NiFi to use it) and when you have the correct behavior, you can easily copy/paste into the generated skeleton and deploy it once.

Hazelcast Near Cache: Evict if changed on different node

I am using Spring + Hazelcast 3.8.2 and have configured a map like this using the Spring configuration:
<hz:map name="test.*" backup-count="1"
max-size="0" eviction-percentage="30" read-backup-data="true"
time-to-live-seconds="900"
eviction-policy="NONE" merge-policy="com.hazelcast.map.merge.PassThroughMergePolicy">
<hz:near-cache max-idle-seconds="300"
time-to-live-seconds="0"
max-size="0" />
</hz:map>
I've got two clients connected (both on same machine [test env], using different ports).
When I change a value in the map on one client the other client still has the old value until it will get evicted from the near cache due to the expired idle time.
I found a similar issue like this here: Hazelcast near-cache eviction doesn't work
But I'm unsure if this is really the same issue, at least it is mentioned that this was a bug in version 3.7 and we are using 3.8.2.
Is this a correct behaviour or am I doing something wrong? I know that there is a property invalidate-on-change, but as a default this is true, so I don't expect I have to set this one.
I also tried setting the read-backup-data to false, doesn't help.
Thanks for your support
Christian
I found the solution myself.
The issue is that Hazelcast sends the invalidations by default in batches and thus it waits a few seconds until the invalidations will be sent out to all other nodes.
You can find more information about this here: http://docs.hazelcast.org/docs/3.8/manual/html-single/index.html#near-cache-invalidation
So I had to set the property hazelcast.map.invalidation.batch.enabled to false which will immediately send out invalidations to all nodes. But as mentioned in the documentation this should only be used when there aren't too many put/remove/... operations expected, as this will then make the event system very busy.
Nevertheless, even though this property is set it will not guarantee that all nodes will directly invalidate the near cache entries. I noticed that after directly accessing the values on the different node sometimes it's fine, sometimes not.
Here is the JUnit test I built up for this:
#Test
public void testWithInvalidationBatchEnabled() throws Exception {
System.setProperty("hazelcast.map.invalidation.batch.enabled", "true");
doTest();
}
#Test
public void testWithoutInvalidationBatchEnabled() throws Exception {
System.setProperty("hazelcast.map.invalidation.batch.enabled", "false");
doTest();
}
#After
public void shutdownNodes() {
Hazelcast.shutdownAll();
}
protected void doTest() throws Exception {
// first config for normal cluster member
Config c1 = new Config();
c1.getNetworkConfig().setPort(5709);
// second config for super client
Config c2 = new Config();
c2.getNetworkConfig().setPort(5710);
// map config is the same for both nodes
MapConfig testMapCfg = new MapConfig("test");
NearCacheConfig ncc = new NearCacheConfig();
ncc.setTimeToLiveSeconds(10);
testMapCfg.setNearCacheConfig(ncc);
c1.addMapConfig(testMapCfg);
c2.addMapConfig(testMapCfg);
// start instances
HazelcastInstance h1 = Hazelcast.newHazelcastInstance(c1);
HazelcastInstance h2 = Hazelcast.newHazelcastInstance(c2);
IMap<Object, Object> mapH1 = h1.getMap("test");
IMap<Object, Object> mapH2 = h2.getMap("test");
// initial filling
mapH1.put("a", -1);
assertEquals(mapH1.get("a"), -1);
assertEquals(mapH2.get("a"), -1);
int updatedH1 = 0, updatedH2 = 0, runs = 0;
for (int i = 0; i < 5; i++) {
mapH1.put("a", i);
// without this short sleep sometimes the nearcache is updated in time, sometimes not
Thread.sleep(100);
runs++;
if (mapH1.get("a").equals(i)) {
updatedH1++;
}
if (mapH2.get("a").equals(i)) {
updatedH2++;
}
}
assertEquals(runs, updatedH1);
assertEquals(runs, updatedH2);
}
testWithInvalidationBatchEnabled finishs only sometimes successfully, testWithoutInvalidationBatchEnabled finishs always successfully.

error setting linkConfig.connectionString on sqoop 1.99.4

I followed https://sqoop.apache.org/docs/1.99.4/RESTAPI.html for trying out sqoop2. But Iam getting error "Exception in thread "main" org.apache.sqoop.common.SqoopException: MODEL_011:Input do not exist - Input name: linkConfig.connectionString" on the line linkConfig.getStringInput("linkConfig.connectionString").setValue("jdbc:mysql://localhost/my");
i tested sqoop2, mysql, database etc from terminal and working fine. please help. thanks in advance.
here is the code i am trying
import org.apache.sqoop.client.SqoopClient;
import org.apache.sqoop.model.MLink;
import org.apache.sqoop.model.MLinkConfig;
import org.apache.sqoop.validation.Status;
public class Sqoop2 {
public static void main(String[] args) {
//Initialization SqoopClient
String url = "http://<myip>:12000/sqoop/";
SqoopClient client = new SqoopClient(url);
// create a placeholder for link
long connectorId = 1;
MLink link = client.createLink(connectorId);
link.setName("Vampire");
link.setCreationUser("Buffy");
MLinkConfig linkConfig = link.getConnectorLinkConfig();
// fill in the link config values
linkConfig.getStringInput("linkConfig.connectionString").setValue("jdbc:mysql://<myip>/<dbname>");
linkConfig.getStringInput("linkConfig.jdbcDriver").setValue("com.mysql.jdbc.Driver");
linkConfig.getStringInput("linkConfig.username").setValue("root");
linkConfig.getStringInput("linkConfig.password").setValue("root");
// save the link object that was filled
Status status = client.saveLink(link);
if(status.canProceed()) {
System.out.println("Created Link with Link Id : " + link.getPersistenceId());
} else {
System.out.println("Something went wrong creating the link");
}
}
}
I faced the same issue. As per the documentation generic-jdbc connector id =1 and hdfs-connector id =2. But after we upgraded to 5.3.2 the id's were swapped.
Don't hard code the connector Id's(as said in the documentation). Use client.getConnectors(); or show connector --all method to look for existing connectors and get the connector Id you need. There is currently an issue logged for this https://issues.apache.org/jira/browse/SQOOP-1965.
Looks like connector 1 is already exists. Can you try with another id ?

Setting Time To Live (TTL) from Java - sample requested

EDIT:
This is basically what I want to do, only in Java
Using ElasticSearch, we add documents to an index bypassing IndexRequest items to a BulkRequestBuilder.
I would like for the documents to be dropped from the index after some time has passed (time to live/ttl)
This can be done either by setting a default for the index, or on a per-document basis. Either approach is fine by me.
The code below is an attempt to do it per document. It does not work. I think it's because TTL is not enabled for the index. Either show me what Java code I need to add to enable TTL so the code below works, or show me different code that enables TTL + sets default TTL value for the index in Java I know how to do it from the REST API but I need to do it from Java code, if at all possible.
logger.debug("Indexing record ({}): {}", id, map);
final IndexRequest indexRequest = new IndexRequest(_indexName, _documentType, id);
final long debug = indexRequest.ttl();
if (_ttl > 0) {
indexRequest.ttl(_ttl);
System.out.println("Setting TTL to " + _ttl);
System.out.println("IndexRequest now has ttl of " + indexRequest.ttl());
}
indexRequest.source(map);
indexRequest.operationThreaded(false);
bulkRequestBuilder.add(indexRequest);
}
// execute and block until done.
BulkResponse response;
try {
response = bulkRequestBuilder.execute().actionGet();
Later I check in my unit test by polling this method, but the document count never goes down.
public long getDocumentCount() throws Exception {
Client client = getClient();
try {
client.admin().indices().refresh(new RefreshRequest(INDEX_NAME)).actionGet();
ActionFuture<CountResponse> response = client.count(new CountRequest(INDEX_NAME).types(DOCUMENT_TYPE));
CountResponse countResponse = response.get();
return countResponse.getCount();
} finally {
client.close();
}
}
After a LONG day of googling and writing test programs, I came up with a working example of how to use ttl and basic index/object creation from the Java API. Frankly most of the examples in the docs are trivial, and some JavaDoc and end-to-end examples would go a LONG way to help those of us who are using the non-REST interfaces.
Ah well.
Code here: Adding mapping to a type from Java - how do I do it?

TFS build duration report by agent

I'm trying to build a report to show the relative efficiency of my various build agents and having trouble getting the info I need out of the tool.
What I'd like to have is a simple grid with the following columns:
Build Number
Build Definition
Build Agent
Build Status
Build Start Time
Build Duration
Which would let me do something like chart the duration of successful builds of a given build definition on agent1 against the same build definition on agent2 through agentN.
How would I go about this?
My initial intention was to point you to TFS OLAP Cube & describe how you could retrieve what you were after. Then I realized that the cube does not provide with the info which Agent built what Build.Then I thought it would be simple to write a small TFS-console app that print the infos you 're after:
using System;
using Microsoft.TeamFoundation.Build.Client;
using Microsoft.TeamFoundation.Client;
namespace BuildDetails
{
class Program
{
static void Main()
{
TfsTeamProjectCollection teamProjectCollection = TfsTeamProjectCollectionFactory.GetTeamProjectCollection(new Uri("http://TFS:8080/tfs/CoLLeCtIoNNaMe"));
var buildService = (IBuildServer)teamProjectCollection.GetService(typeof(IBuildServer));
IBuildDefinition buildDefinition = buildService.GetBuildDefinition("TeamProjectName", "BuildDefinitionName");
IBuildDetail[] buildDetails = buildService.QueryBuilds(buildDefinition);
foreach (var buildDetail in buildDetails)
{
Console.Write(buildDetail.BuildNumber+"\t");
Console.Write(buildDefinition.Name+"\t");
Console.Write(buildDetail.BuildAgent.Name+"\t");
Console.Write(buildDetail.Status+"\t");
Console.Write(buildDetail.StartTime+"\t");
Console.WriteLine((buildDetail.FinishTime - buildDetail.StartTime).Minutes);
}
}
}
}
This won't compile, since
Eventually I dove into the IBuildInformationNode[] and got the build agent as follows:
IBuildInformation buildInformation = buildDetail.Information;
IBuildInformationNode[] buildInformationNodes = buildInformation.Nodes;
string agentName;
try
{
agentName = buildInformationNodes[0].Children.Nodes[3].Fields["ReservedAgentName"];
}
catch
{
agentName = "Couldn't determine BuildAgent";
}
Console.Write(agentName + "\t");
The try-catch is necessary, so you can deal with builds that failed/stopped before agent-selection.If you use this latter part as a substitute to the failing Console.Write(buildDetail.BuildAgent.Name+"\t"); you should end up with a console app, whose output can be piped into a *.CSV file & then imported to Excel.
The following code should help in getting the Build Agent Name for the given build detail.
private string GetBuildAgentName(IBuildDetail build)
{
var buildInformationNodes = build.Information.GetNodesByType("AgentScopeActivityTracking", true);
if (buildInformationNodes != null)
{
var node = buildInformationNodes.Find(s => s.Fields.ContainsKey(InformationFields.ReservedAgentName));
return node != null ? node.Fields[InformationFields.ReservedAgentName] : string.Empty;
}
return string.Empty;
}
Make sure that you have refresh the build information in the build details object.You can do so by the either calling the following code on your build Details object before getting the build agents
string[] refreshAllDetails = {"*"};
build.Refresh(refreshAllDetails, QueryOptions.Agents);
Hope it helps :)
The build agent information isn't always in the same place.
I found it for a build I was looking at in buildInformationNodes[1].Children.Nodes[2].Fields["ReservedAgentName"]. The following seems to work for me (so far).
private static string GetAgentName(IBuildDetail buildDetail)
{
string agentName = "Unknown";
bool fAgentFound = false;
try
{
foreach (IBuildInformationNode node in buildDetail.Information.Nodes)
{
foreach (IBuildInformationNode childNode in node.Children.Nodes)
{
if (childNode.Fields.ContainsKey("ReservedAgentName"))
{
agentName = childNode.Fields["ReservedAgentName"];
break;
}
}
if (fAgentFound) break;
}
}
catch (Exception ex)
{
// change to your own routine as needed
DumpException(ex);
}
return agentName;
}

Resources