Chapter 20. Performance

Esper has been highly optimized to handle very high throughput streams with very little latency between event receipt and output result posting. It is also possible to use Esper on a soft-real-time or hard-real-time JVM to maximize predictability even further.

This section describes performance best practices and explains how to assess Esper performance by using our provided performance kit.

20.1. Performance Results

For a complete understanding of those results, consult the next sections.

Esper exceeds over 500 000 event/s on a dual CPU 2GHz Intel based hardware,
with engine latency below 3 microseconds average (below 10us with more than 
99% predictability) on a VWAP benchmark with 1000 statements registered in the system 
- this tops at 70 Mbit/s at 85% CPU usage.

Esper also demonstrates linear scalability from 100 000 to 500 000 event/s on this 
hardware, with consistent results accross different statements.

Other tests demonstrate equivalent performance results
(straight through processing, match all, match none, no statement registered,
VWAP with time based window or length based windows).
                
Tests on a laptop demonstrated about 5x time less performance - that is 
between 70 000 event/s and 200 000 event/s - which still gives room for easy 
testing on small configuration.

20.2. Performance Tips

public class SampleClassThreading {

    public static void main(String[] args) throws InterruptedException {

        int numEvents = 1000000;
        int numThreads = 3;

        Configuration config = new Configuration();
        config.getEngineDefaults().getThreading()
          .setListenerDispatchPreserveOrder(false);
        config.getEngineDefaults().getThreading()
          .setInternalTimerEnabled(false);   // remove thread that handles time advancing
        EPServiceProvider engine = EPServiceProviderManager
          .getDefaultProvider(config);
        engine.getEPAdministrator().getConfiguration().addEventType(MyEvent.class);

        engine.getEPAdministrator().createEPL(
          "create context MyContext coalesce by consistent_hash_crc32(id) " +
          "from MyEvent granularity 64 preallocate");
        String epl = "context MyContext select count(*) from MyEvent group by id";
        EPStatement stmt = engine.getEPAdministrator().createEPL(epl);
        stmt.setSubscriber(new MySubscriber());

        Thread[] threads = new Thread[numThreads];
        CountDownLatch latch = new CountDownLatch(numThreads);

        int eventsPerThreads = numEvents / numThreads;
        for (int i = 0; i < numThreads; i++) {
            threads[i] = new Thread(
              new MyRunnable(latch, eventsPerThreads, engine.getEPRuntime()));
        }
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < numThreads; i++) {
            threads[i].start();
        }

        latch.await(10, TimeUnit.MINUTES);
        if (latch.getCount() > 0) {
            throw new RuntimeException("Failed to complete in 10 minute");
        }
        long delta = System.currentTimeMillis() - startTime;
        System.out.println("Took " + delta + " millis");
    }

    public static class MySubscriber {
        public void update(Object[] args) {
        }
    }

    public static class MyRunnable implements Runnable {
        private final CountDownLatch latch;
        private final int numEvents;
        private final EPRuntime runtime;

        public MyRunnable(CountDownLatch latch, int numEvents, EPRuntime runtime) {
            this.latch = latch;
            this.numEvents = numEvents;
            this.runtime = runtime;
        }

        public void run() {
            Random r = new Random();
            for (int i = 0; i < numEvents; i++) {
                runtime.sendEvent(new MyEvent(r.nextInt(512)));
            }
            latch.countDown();
        }
    }

    public static class MyEvent {
        private final int id;

        public MyEvent(int id) {
            this.id = id;
        }

        public int getId() {
            return id;
        }
    }
}

We recommend using Java threads as above, or a blocking queue and thread pool with sendEvent() or alternatively we recommend configuring inbound threading if your application does not already employ threading. Esper provides the configuration option to use engine-level queues and threadpools for inbound, outbound and internal executions. See Section 14.7.1, “Advanced Threading” for more information.

We recommend the outbound threading if your listeners are blocking. For outbound threading also see the section below on tuning and disabling listener delivery guarantees.

If enabling advanced threading options keep in mind that the engine will maintain a queue and thread pool. There is additional overhead associated with entering work units into the queue, maintaining the queue and the hand-off between threads. The Java blocking queues are not necessarily fast on all JVM. It is not necessarily true that your application will perform better with any of the advanced threading options.

We found scalability better on Linux systems and running Java with -server and pinning threads to exclusive CPUs and after making sure CPUs are available on your system.

We recommend looking at LMAX Disruptor, an inter-thread messaging library.

20.2.4. Select the underlying event rather than individual fields

By selecting the underlying event in the select-clause we can reduce load on the engine, since the engine does not need to generate a new output event for each input event.

For example, the following statement returns the underlying event to update listeners:

// Better performance
select * from RFIDEvent

In comparison, the next statement selects individual properties. This statement requires the engine to generate an output event that contains exactly the required properties:

// Less good performance
select assetId, zone, xlocation, ylocation from RFIDEvent

20.2.5. Prefer stream-level filtering over where-clause filtering

Esper stream-level filtering is very well optimized, while filtering via the where-clause post any data windows is not optimized.

The same is true for named windows. If your application is only interested in a subset of named window data and such filters are not correlated to arriving events, place the filters into parenthesis after the named window name.

20.2.5.1. Examples without named windows

Consider the example below, which performs stream-level filtering:

// Better performance : stream-level filtering
select * from MarketData(ticker = 'GOOG')

The example below is the equivalent (same semantics) statement and performs post-data-window filtering without a data window. The engine does not optimize statements that filter in the where-clause for the reason that data window views are generally present.

// Less good performance : post-data-window filtering
select * from Market where ticker = 'GOOG'

Thus this optimization technique applies to statements without any data window.

When a data window is used, the semantics change. Let's look at an example to better understand the difference: In the next statement only GOOG market events enter the length window:

select avg(price) from MarketData(ticker = 'GOOG').win:length(100)

The above statement computes the average price of GOOG market data events for the last 100 GOOG market data events.

Compare the filter position to a filter in the where clause. The following statement is NOT equivalent as all events enter the data window (not just GOOG events):

select avg(price) from Market.win:length(100) where ticker = 'GOOG'

The statement above computes the average price of all market data events for the last 100 market data events, and outputs results only for GOOG.

20.2.5.2. Examples using named windows

The next two example EPL queries put the account number filter criteria directly into parenthesis following the named window name:

// Better performance : stream-level filtering
select * from WithdrawalNamedWindow(accountNumber = '123')

// Better performance : example with subquery
select *, (select * from LoginSucceededWindow(accountNumber = '123'))
from WithdrawalNamedWindow(accountNumber = '123')

20.2.5.3. Common computations in where-clauses

If you have a number of queries performing a given computation on incoming events, consider moving the computation from the where-clause to a plug-in user-defined function that is listed as part of stream-level filter criteria. The engine optimizes evaluation of user-defined functions in filters such that an incoming event can undergo the computation just once even in the presence of N queries.

// Prefer stream-level filtering with a user-defined function
select * from MarketData(vstCompare(*))

// Less preferable when there are N similar queries: 
// Move the computation in the where-clause to the "vstCompare" function.
select * from MarketData where (VST * RT) – (VST / RT) > 1

20.2.6. Reduce the use of arithmetic in expressions

Esper does not yet attempt to pre-evaluate arithmetic expressions that produce constant results.

Therefore, a filter expression as below is optimized:

// Better performance : no arithmetic
select * from MarketData(price>40)

While the engine cannot currently optimize this expression:

// Less good performance : with arithmetic
select * from MarketData(price+10>50)

20.2.7. Remove Unneccessary Constructs

If your statement uses order by to order output events, consider removing order by unless your application does indeed require the events it receives to be ordered.

If your statement specifies group by but does not use aggregation functions, consider removing group by.

If your statement specifies group by but the filter criteria only allows one group, consider removing group by:

// Prefer:
select * from MarketData(symbol = 'GE') having sum(price) > 1000

// Don't use this since the filter specifies a single symbol:
select * from MarketData(symbol = 'GE') group by symbol having sum(price) > 1000

If your statement specifies the grouped data window std:groupwin but the window being grouped retains the same set of events regardless of grouping, remove std:groupwin:

// Prefer:
create window MarketDataWindow.win:keepall() as MarketDataEventType

// Don't use this, since keeping all events 
// or keeping all events per symbol is the same thing:
create window MarketDataWindow.std:groupwin(symbol).win:keepall() as MarketDataEventType

// Don't use this, since keeping the last 1-minute of events 
// or keeping 1-minute of events per symbol is the same thing:
create window MarketDataWindow.std:groupwin(symbol).win:time(1 min) as MarketDataEventType

It is not necessary to specify a data window for each stream.

// Prefer:
select * from MarketDataWindow

// Don't have a data window if just listening to events, prefer the above
select * from MarketDataWindow.std:lastevent()

If your statement specifies unique data window but the filter criteria only allows one unique criteria, consider removing the unique data window:

// Prefer:
select * from MarketDataWindow(symbol = 'GE').std:lastevent()

// Don't have a unique-key data window if your filter specifies a single value
select * from MarketDataWindow(symbol = 'GE').std:unique(symbol)

20.2.8. End Pattern Sub-Expressions

In patterns, the every keyword in conjunction with followed by (->) starts a new sub-expression per match.

For example, the following pattern starts a sub-expression looking for a B event for every A event that arrives.

every A -> B

Determine under what conditions a subexpression should end so the engine can stop looking for a B event. Here are a few generic examples:

every A -> (B and not C)
every A -> B where timer:within(1 sec)

20.2.9. Consider using EventPropertyGetter for fast access to event properties

The EventPropertyGetter interface is useful for obtaining an event property value without property name table lookup given an EventBean instance that is of the same event type that the property getter was obtained from.

When compiling a statement, the EPStatement instance lets us know the EventType via the getEventType() method. From the EventType we can obtain EventPropertyGetter instances for named event properties.

To demonstrate, consider the following simple statement:

select symbol, avg(price) from Market group by symbol

After compiling the statement, obtain the EventType and pass the type to the listener:

EPStatement stmt = epService.getEPAdministrator().createEPL(stmtText);
MyGetterUpdateListener listener = new MyGetterUpdateListener(stmt.getEventType());

The listener can use the type to obtain fast getters for property values of events for the same type:

public class MyGetterUpdateListener implements StatementAwareUpdateListener {
    private final EventPropertyGetter symbolGetter;
    private final EventPropertyGetter avgPriceGetter;

    public MyGetterUpdateListener(EventType eventType) {
        symbolGetter = eventType.getGetter("symbol");
        avgPriceGetter = eventType.getGetter("avg(price)");
    }

Last, the update method can invoke the getters to obtain event property values:

    public void update(EventBean[] eventBeans, EventBean[] oldBeans, EPStatement epStatement, EPServiceProvider epServiceProvider) {
        String symbol = (String) symbolGetter.get(eventBeans[0]);
        long volume = (Long) volumeGetter.get(eventBeans[0]);
        // some more logic here
    }

20.2.10. Consider casting the underlying event

When an application requires the value of most or all event properties, it can often be best to simply select the underlying event via wildcard and cast the received events.

Let's look at the sample statement:

select * from MarketData(symbol regexp 'E[a-z]')

An update listener to the statement may want to cast the received events to the expected underlying event class:

    public void update(EventBean[] eventBeans, EventBean[] eventBeans) {
        MarketData md = (MarketData) eventBeans[0].getUnderlying();
        // some more logic here
    }

20.2.11. Turn off logging and audit

Since Esper 1.10, even if you don't have a log4j configuration file in place, Esper will make sure to minimize execution path logging overhead. For prior versions, and to reduce logging overhead overall, we recommend the "WARN" log level or the "INFO" log level.

Please see the log4j configuration file in "etc/infoonly_log4j.xml" for example log4j settings.

Esper provides the @Audit annotation for statements. For performance testing and production deployment, we recommend removing @Audit.

20.2.12. Disable view sharing

By default, Esper compares streams and views in use with existing statement's streams and views, and then reuses views to efficiently share resources between statements. The benefit is reduced resources usage, however the potential cost is that in multithreaded applications a shared view may mean excessive locking of multiple processing threads.

Consider disabling view sharing for better threading performance if your application overall uses fewer statements and statements have very similar streams, filters and views.

View sharing can be disabled via XML configuration or API, and the next code snippet shows how, using the API:

Configuration config = new Configuration();
config.getEngineDefaults().getViewResources().setShareViews(false);

20.2.13. Tune or disable delivery order guarantees

If your application is not a multithreaded application, or you application is not sensitive to the order of delivery of result events to your application listeners, then consider disabling the delivery order guarantees the engine makes towards ordered delivery of results to listeners:

Configuration config = new Configuration();
config.getEngineDefaults().getThreading().setListenerDispatchPreserveOrder(false);

If your application is not a multithreaded application, or your application uses the insert into clause to make results of one statement available for further consuming statements but does not require ordered delivery of results from producing statements to consuming statements, you may disable delivery order guarantees between statements:

Configuration config = new Configuration();
config.getEngineDefaults().getThreading().setInsertIntoDispatchPreserveOrder(false);

Additional configuration options are available and described in the configuration section that specify timeout values and spin or thread context switching.

Esper logging will log the following informational message when guaranteed delivery order to listeners is enabled and spin lock times exceed the default or configured timeout : Spin wait timeout exceeded in listener dispatch. The respective message for delivery from insert into statements to consuming statements is Spin wait timeout exceeded in insert-into dispatch.

If your application sees messages that spin lock times are exceeded, your application has several options: First, disabling preserve order is an option. Second, ensure your listener does not perform (long-running) blocking operations before returning, for example by performing output event processing in a separate thread. Third, change the timeout value to a larger number to block longer without logging the message.

20.2.14. Use a Subscriber Object to Receive Events

The subscriber object is a technique to receive result data that has performance advantages over the UpdateListener interface. Please refer to Section 14.3.3, “Setting a Subscriber Object”.

20.2.15. Consider Data Flows

Data flows offer a high-performance means to execute EPL select statements and use other built-in data flow operators. The data flow Emitter operator allows sending underlying event objects directly into a data flow. Thereby the engine does not need to wrap each underlying event into a EventBean instance and the engine does not need to match events to statements. Instead, the underling event directly applies to only that data flow instance that your application submits the event to, and no other continuous query statements or data flows see the same event.

Data flows are described in Chapter 13, EPL Reference: Data Flow.

20.2.16. High-Arrival-Rate Streams and Single Statements

A context partition is associated with certain context partition state that consists of current aggregation values, partial pattern matches, data windows or other view state depending on whether your statement uses such constructs. When an engine receives events it updates context partition state under locking such that context partition state remains consistent under concurrent multi-threaded access.

For high-volume streams, the locking required to protected context partition state may slow down or introduce blocking for very high arrival rates of events that apply to the very same context partition and its state.

Your first choice should be to utilize a context that allows for multiple context partitions, such as the hash segmented context. The hash segmented context usually performs better compared to the keyed segmented context since in the keyed segmented context the engine must check whether a partition exists or must be created for a given key.

Your second choice is to split the statement into multiple statements that each perform part of the intended function or that each look for a certain subset of the high-arrival-rate stream. There is very little cost in terms of memory or CPU resources per statement, the engine can handle larger number of statements usually as efficiently as single statements.

For example, consider the following statement:

// less effective in a highly threaded environment 
select venue, ccyPair, side, sum(qty)
from CumulativePrice
where side='O'
group by venue, ccyPair, side

The engine protects state of each context partition by a separate lock for each context partition, as discussed in the API section. In highly threaded applications threads may block on a specific context partition. You would therefore want to use multiple context partitions.

Consider creating a keyed segmented context, for example:

create context MyContext partition by venue, ccyPair, side from CumulativePrice(side='O')

The keyed segmented context instructs the engine to employ a context partition per venue, ccyPair, side key combination. As locking is on the level of context partition, the locks taken by the engine are very fine grained allowing for highly concurrent processing.

The new statement that refers to the context as created above is:

context MyContext select venue, ccyPair, side, sum(qty) from CumulativePrice

For testing purposes or if your application controls concurrency, you may disable context partition locking, see Section 15.4.23.3, “Disable Locking”.

20.2.17. Subqueries versus Joins And Where-clause And Data Windows

When joining streams the engine builds a product of the joined data windows based on the where clause. It analyzes the where clause at time of statement compilation and builds the appropriate indexes and query strategy. Avoid using expressions in the join where clause that require evaluation, such as user-defined functions or arithmatic expressions.

When joining streams and not providing a where clause, consider using the std:unique data window or std:lastevent data window to join only the last event or the last event per unique key(s) of each stream.

The sample query below can produce up to 5,000 rows when both data windows are filled and an event arrives for either stream:

// Avoid joins between streams with data windows without where-clause
select * from StreamA.win:length(100), StreamB.win:length(50)

Consider using a subquery, consider using separate statements with insert-into and consider providing a where clause to limit the product of rows.

Below examples show different approaches, that are not semantically equivalent, assuming that an MyEvent is defined with the properties symbol and value:

// Replace the following statement as it may not perform well
select a.symbol, avg(a.value), avg(b.value) 
from MyEvent.win:length(100) a, MyEvent.win:length(50) b

// Join with where-clause
select a.symbol, avg(a.value), avg(b.value) 
from MyEvent.win:length(100) a, MyEvent.win:length(50) b 
where a.symbol = b.symbol

// Unidirectional join with where-clause
select a.symbol, avg(b.value) 
from MyEvent unidirectional, MyEvent.win:length(50) b 
where a.symbol = b.symbol

// Subquery
select 
  (select avg(value) from MyEvent.win:length(100)) as avgA, 
  (select avg(value) from MyEvent.win:length(50)) as avgB,
  a.symbol
from MyEvent

// Since streams cost almost nothing, use insert-into to populate and a unidirectional join 
insert into StreamAvgA select symbol, avg(value) as avgA from MyEvent.win:length(100)
insert into StreamAvgB select symbol, avg(value) as avgB from MyEvent.win:length(50)
select a.symbol, avgA, avgB from StreamAvgA unidirectional, StreamAvgB.std:unique(symbol) b
where a.symbol = b.symbol

A join is multidirectionally evaluated: When an event of any of the streams participating in the join arrive, the join gets evaluated, unless using the unidirectional keyword. Consider using a subquery instead when evaluation only needs to take place when a certain event arrives:

// Rewrite this join since we don't need to join when a LoginSucceededWindow arrives
// Also rewrite because the account number always is the value 123.
select * from LoginSucceededWindow as l, WithdrawalWindow as w
where w.accountNumber = '123' and w.accountNumber = l.accountNumber

// Rewritten as a subquery, 
select *, (select * from LoginSucceededWindow where accountNumber=’123’) 
from WithdrawalWindow(accountNumber=’123’) as w

20.2.18. Patterns and Pattern Sub-Expression Instances

The every and repeat operators in patterns control the number of sub-expressions that are active. Each sub-expression can consume memory as it may retain, depending on the use of tags in the pattern, the matching events. A large number of active sub-expressions can reduce performance or lead to out-of-memory errors.

During the design of the pattern statement consider the use of timer:within to reduce the amount of time a sub-expression lives, or consider the not operator to end a sub-expression.

The examples herein assume an AEvent and a BEvent event type that have an id property that may correlate between arriving events of the two event types.

In the following sample pattern the engine starts, for each arriving AEvent, a new pattern sub-expression looking for a matching BEvent. Since the AEvent is tagged with a the engine retains each AEvent until a match is found for delivery to listeners or subscribers:

every a=AEvent -> b=BEvent(b.id = a.id)

One way to end a sub-expression is to attach a time how long it may be active.

The next statement ends sub-expressions looking for a matching BEvent 10 seconds after arrival of the AEvent event that started the sub-expression:

every a=AEvent -> (b=BEvent(b.id = a.id) where timer:within(10 sec))

A second way to end a sub-expression is to use the not operator. You can use the not operator together with the and operator to end a sub-expression when a certain event arrives.

The next statement ends sub-expressions looking for a matching BEvent when, in the order of arrival, the next BEvent that arrives after the AEvent event that started the sub-expression does not match the id of the AEvent:

every a=AEvent -> (b=BEvent(b.id = a.id) and not BEvent(b.id != a.id))

The every-distinct operator can be used to keep one sub-expression alive per one or more keys. The next pattern demonstrates an alternative to every-distinct. It ends sub-expressions looking for a matching BEvent when an AEvent arrives that matches the id of the AEvent that started the sub-expression:

every a=AEvent -> (b=BEvent(b.id = a.id) and not AEvent(b.id = a.id))

20.2.19. Pattern Sub-Expression Instance Versus Data Window Use

For some use cases you can either specify one or more data windows as the solution, or you can specify a pattern that also solves your use case.

For patterns, you should understand that the engine employs a dynamic state machine. For data windows, the engine employs a delta network and collections. Generally you may find patterns that require a large number of sub-expression instances to consume more memory and more CPU then data windows.

For example, consider the following EPL statement that filters out duplicate transaction ids that occur within 20 seconds of each other:

select * from TxnEvent.std:firstunique(transactionId).win:time(20 sec)

You could also address this solution using a pattern:

select * from pattern [every-distinct(a.transactionId) a=TxnEvent where timer:within(20 sec)]

If you have a fairly large number of different transaction ids to track, you may find the pattern to perform less well then the data window solution as the pattern asks the engine to manage a pattern sub-expression per transaction id. The data window solution asks the engine to manage expiry, which can give better performance in many cases.

20.2.20. The Keep-All Data Window

The std:keepall data window is a data window that retains all arriving events. The data window can be useful during the development phase and to implement a custom expiry policy using on-delete and named windows. Care should be taken to timely remove from the keep-all data window however. Use on-select or on-demand queries to count the number of rows currently held by a named window with keep-all expiry policy.

20.2.21. Statement Design for Reduced Memory Consumption - Diagnosing OutOfMemoryError

This section describes common sources of out-of-memory problems.

If using the keep-all data window please consider the information above. If using pattern statements please consider pattern sub-expression instantiation and lifetime as discussed prior to this section.

When using the group-by clause or std:groupwin grouped data windows please consider the hints as described below. Make sure your grouping criteria are fields that don't have an unlimited number of possible values or specify hints otherwise.

The std:unique unique data window can also be a source for error. If your uniqueness criteria include a field which is never unique the memory use of the data window can grow, unless your application deletes events.

When using the every-distinct pattern construct parameterized by distinct value expressions that generate an unlimited number of distinct values, consider specifying a time period as part of the parameters to indicate to the pattern engine how long a distinct value should be considered.

In a match-recognize pattern consider limiting the number of optional events if optional events are part of the data reported in the measures clause. Also when using the partition clause, if your partitioning criteria include a field which is never unique the memory use of the match-recognize pattern engine can grow.

A further source of memory use is when your application creates new statements but fails to destroy created statements when they are no longer needed.

In your application design you may also want to be conscious when the application listener or subscriber objects retain output data.

An engine instance, uniquely identified by an engine URI is a relatively heavyweight object. Optimally your application allocates only one or a few engine instances per JVM. A statement instance is associated to one engine instance, is uniquely identified by a statement name and is a medium weight object. We have seen applications allocate 100,000 statements easily. A statement's context partition instance is associated to one statement, is uniquely identified by a context partition id and is a light weight object. We have seen applications allocate 5000 context partitions for 100 statements easily, i.e. 5,000,000 context partitions. An aggregation row, data window row, pattern etc. is associated to a statement context partition and is a very lightweight object itself.

The prev, prevwindow and prevtail functions access a data window directly. The engine does not need to maintain a separate data structure and grouping is based on the use of the std:groupwin grouped data window. Compare this to the use of event aggregation functions such as first, window and last which group according to the group by clause. If your statement utilizes both together consider reformulating to use prev instead.

20.2.22. Performance, JVM, OS and hardware

Performance will also depend on your JVM (Sun HotSpot, BEA JRockit, IBM J9), your operating system and your hardware. A JVM performance index such as specJBB at spec.org can be used. For memory intensive statement, you may want to consider 64bit architecture that can address more than 2GB or 3GB of memory, although a 64bit JVM usually comes with a slow performance penalty due to more complex pointer address management.

The choice of JVM, OS and hardware depends on a number of factors and therefore a definite suggestion is hard to make. The choice depends on the number of statements, and number of threads. A larger number of threads would benefit of more CPU and cores. If you have very low latency requirements, you should consider getting more GHz per core, and possibly soft real-time JVM to enforce GC determinism at the JVM level, or even consider dedicated hardware such as Azul. If your statements utilize large data windows, more RAM and heap space will be utilized hence you should clearly plan and account for that and possibly consider 64bit architectures or consider EsperHA.

The number and type of statements is a factor that cannot be generically accounted for. The benchmark kit can help test out some requirements and establish baselines, and for more complex use cases a simulation or proof of concept would certainly works best. EsperTech' experts can be available to help write interfaces in a consulting relationship.

20.2.23. Consider using Hints

The @Hint annotation provides a single keyword or a comma-separated list of keywords that provide instructions to the engine towards statement execution that affect runtime performance and memory-use of statements. Also see Section 5.2.7.8, “@Hint”.

The hints for use with joins, outer joins, unidirectional joins, relational and non-relational join query planning are described in Section 5.12.5, “Hints Related to Joins”.

The hint for use with group by to specify how state for groups is reclaimed is described in Section 5.6.2.1, “Hints Pertaining to Group-By” and Section 12.3.2, “Grouped Data Window (std:groupwin)”.

The hint for use with group by to specify aggregation state reclaim for unbound streams and timestamp groups is described in Section 5.6.2.1, “Hints Pertaining to Group-By”.

The hint for use with match_recognize to specify iterate-only is described in Section 7.4.6, “Eliminating Duplicate Matches”.

To tune subquery performance when your subquery selects from a named window, consider the hints discussed in Section 5.11.8, “Hints Related to Subqueries”.

The @NoLock hint to remove context partition locking (also read caution note) is described at Section 14.7, “Engine Threading and Concurrency”.

The hint for influencing query planning is elaborated in Section 20.2.33, “Query Planning Index Hints” and query planning is described in Section 20.2.32, “Notes on Query Planning”.

20.2.24. Optimizing Stream Filter Expressions

Assume your EPL statement invokes a static method in the stream filter as the below statement shows as an example:

select * from MyEvent(MyHelperLibrary.filter(field1, field2, field3, field4*field5))

As a result of starting above statement, the engine must evaluate each MyEvent event invoking the MyHelperLibrary.filter method and passing certain event properties. The same applies to pattern filters that specify functions to evaluate.

If possible, consider moving some of the checking performed by the function back into the filter or consider splitting the function into a two parts separated by and conjunction. In general for all expressions, the engine evaluates expressions left of the and first and can skip evaluation of the further expressions in the conjunction in the case when the first expression returns false. In addition the engine can build a reverse index for fields provided in stream or pattern filters.

For example, the below statement could be faster to evaluate:

select * from MyEvent(field1="value" and 
  MyHelperLibrary.filter(field1, field2, field3, field4*field5))

20.2.25. Statement and Engine Metric Reporting

You can use statement and engine metric reporting as described in Section 14.15, “Engine and Statement Metrics Reporting” to monitor performance or identify slow statements.

20.2.26. Expression Evaluation Order and Early Exit

The term "early exit" or "short-circuit evaluation" refers to when the engine can evaluate an expression without a complete evaluation of all sub-expressions.

Consider an expression such as follows:

where expr1 and expr2 and expr3

If expr1 is false the engine does not need to evaluate expr2 and expr3. Therefore when using the AND logical operator consider reordering expressions placing the most-selective expression first and less selective expressions thereafter.

The same is true for the OR logical operator: If expr1 is true the engine does not need to evaluate expr2 and expr3. Therefore when using the OR logical operator consider reordering expressions placing the least-selective expression first and more selective expressions thereafter.

The order of expressions (here: expr1, expr2 and expr3) does not make a difference for the join and subquery query planner.

Note that the engine does not guarantee short-circuit evaluation in all cases. The engine may rewrite the where-clause or filter conditions into another order of evaluation so that it can perform index or reverse index lookups.

20.2.27. Large Number of Threads

When using a large number of threads with the engine, such as more then 100 threads, we provide a setting in the configuration that instructs the engine to reduce the use of thread-local variables. Please see Section 15.4.23, “Engine Settings related to Execution of Statements” for more information.

20.2.28. Context Partition Related Information

As the engine locks on the level of context partition, high concurrency under threading can be achieved by using context partitions.

Generally context partitions require more memory then the more fine-grained grouping that can be achieved by group by or std:groupwin.

20.2.29. Prefer Constant Variables over Non-Constant Variables

The create-variable syntax as well as the APIs can identify a variable as a constant value. When a variable's value is not intended to change it is best to declare the variable as constant.

For example, consider the following two EPL statements that each declares a variable. The first statement declares a constant variable and the second statement declares a non-constant variable:

// declare a constant variable
create constant variable CONST_DEPARTMENT = 'PURCHASING'

// declare a non-constant variable
create variable VAR_DEPARTMENT = 'SALES'

When your application creates a statement that has filters for events according to variable values, the engine internally inspects such expressions and performs filter optimizations for constant variables that are more effective in evaluation.

For example, consider the following two EPL statements that each look for events related to persons that belong to a given department:

// perfer the constant
select * from PersonEvent(department=CONST_DEPARTMENT)

// less efficient
select * from PersonEvent(department=VAR_DEPARTMENT)

The engine can more efficiently evaluate the expression using a variable declared as constant. The same observation can be made for subquery and join query planning.

20.2.30. Prefer Object-array Events

Object-array events offer the best read access performance for access to event property values. In addition, object-array events use much less memory then Map-type events. They also offer the best write access performance.

A comparison of different event representations is in Section 2.13, “Comparing Event Representations”.

First, we recommend that your application sends object-array events into the engine, instead of Map-type events. See Section 2.7, “Object-array (Object[]) Events” for more information.

Second, we recommend that your application sets the engine-wide configuration of the default event representation to object array, as described in Section 15.4.11.1, “Default Event Representation”. Alternatively you can use the @EventRepresentation(array=true) annotation with individual statements.

20.2.31. Composite or Compound Keys

If your uniqueness, grouping, sorting or partitioning keys are composite keys or compound keys, this section may apply. A composite key is a key that consists of 2 or more properties or expressions.

In the example below the firstName and lastName expressions are part of a composite key:

... group by firstName, lastName
...std:unique(firstName, lastName)...
...order by firstName, lastName

Note

The example above is not a comprehensive discussion where composite or compound keys may be used in EPL. Other places where composite keys may apply are patterns, partitioned contexts and grouped data windows (we may have missed one).

You application could change the EPL to instead refer to a single value fullName:

... group by fullName
...std:unique(fullName)...
...order by fullName

The advantage in using a single expression as the uniqueness, grouping and sorting key is that the engine does not need to compute multiple expressions and retain a separate data structure in memory that represents the composite key, resulting in reduced memory use and increased throughput.

20.2.32. Notes on Query Planning

Query planning takes place for subqueries, joins (any type), named window on-actions (on-select, on-merge, on-insert, on-update, on-select) and fire-and-forget queries. Query planning affects query execution speed. Enable query plan logging to output query plan information.

For query planning, the engine draws information from:

The where-clauses, if any are specified. Where-clauses correlate streams, patterns, named windows etc. with more streams, patterns and named windows and are thus the main source of information for query planning.
The data window(s) declared on streams and named windows. The std:unique and the std:firstunique data window instruct the engine to retain the last event per unique criteria.
For named windows, the explicit indexes created via create unique index or create index.
For named windows, the previously created implicit indexes. The engine can create implicit indexes automatically if explicit indexes do not match correlation requirements.
Any hints specified for the statement in question and including hints specified during the creation of named windows with create window.

The engine prefers unique indexes over non-unique indexes.

20.2.33. Query Planning Index Hints

Currently index hints are only supported for the following types of statements:

Named window on-action statements (on-select, on-merge, on-insert, on-update, on-select).
Statements that have subselects against named windows that have index sharing enabled (the default is disabled).
Fire-and-forget queries.

For the above statements, you may dictate to the engine which explicit index (created via create index syntax) to use.

Specify the name of the explicit index in parentheses following @Hint and the index literal.

The following example instructs the engine to use the UserProfileIndex if possible:

@Hint('index(UserProfileIndex)')

Add the literal bust to instruct the engine to use the index, or if the engine cannot use the index fail query planning with an exception and therefore fail statement creation.

The following example instructs the engine to use the UserProfileIndex if possible or fail with an exception if the index cannot be used:

@Hint('index(UserProfileIndex, bust)')

Multiple indexes can be listed separated by comma (,).

The next example instructs the engine to consider the UserProfileIndex and the SessionIndex or fail with an exception if either index cannot be used:

@Hint('index(UserProfileIndex, SessionIndex, bust)')

The literal explicit can be added to instruct the engine to use only explicitly created indexes.

The final example instructs the engine to consider any explicitly create index or fail with an exception if any of the explicitly created indexes cannot be used:

@Hint('index(explicit, bust)')

20.2.34. Measuring Throughput

We recommend using System.nanoTime() to measure elapsed time when processing a batch of, for example, 1000 events.

Note that System.nanoTime() provides nanosecond precision, but not necessarily nanosecond resolution.

Therefore don't try to measure the time spent by the engine processing a single event: The resolution of System.nanoTime() is not sufficient. Also, there are reports that System.nanoTime() can be actually go "backwards" and may not always behave as expected under threading. Please check your JVM platform documentation.

In the default configuration, the best way to measure performance is to take nano time, send a large number of events, for example 10.000 events, and take nano time again reporting on the difference between the two numbers.

If your configuration has inbound threading or other threading options set, you should either monitor the queue depth to determine performance, or disable threading options when measuring performance, or have your application use multiple threads to send events instead.

20.2.35. Do not create the same EPL Statement X times

It is vastly more efficient to create an EPL statement once and attach multiple listeners, then to create the same EPL statement X times.

Since your goal will be to make all test code as realistic, real-world and production-like as possible, we recommend against production code or test code creating the same EPL statement multiple times. Instead consider creating the same EPL statement once and attaching multiple listeners. Certain important optimizations that the engine can perform when EPL statements realistically differ, may not take place. The engine also does not try to detect duplicate EPL statements, since that can easily be done by your application using public APIs.

20.2.36. Comparing Single-Threaded and Multi-Threaded Performance

The Java Virtual Machine optimizes locks such that the time to obtain a read lock, for example, differs widely between single-threaded and multi-threaded applications. We compared code that obtains an unfair ReentrantReadWriteLock read lock 100 million times, without any writer. We measured 3 seconds for a single-threaded application and 15 seconds for an application with 2 threads. It can therefore not be expected that scaling from single-threaded to 2 threads will always double performance. There is a base cost for multiple threads to coordinate.

20.2.37. Incremental Versus Recomputed Aggregation for Named Window Events

Whether aggregations of named window rows are computed incrementally or are recomputed from scratch depends on the type of query.

When the engine computes aggregation values incrementally, meaning it continuously updates the aggregation value as events enter and leave a named window, it means that the engine internally subscribes to named window updates and applies these updates as they occur. For some applications this is the desired behavior.

For some applications re-computing aggregation values from scratch when a certain condition occurs, for example when a triggering event arrives or time passes, is beneficial. Re-computing an aggregation can be less expensive if the number of rows to consider is small and/or when the triggering event or time condition triggers infrequently.

The next paragraph assumes that a named window has been created to hold some historical financial data per symbol and minute:

create window HistoricalWindow.win:keepall() as (symbol string, int minute, double price)

insert into HistoricalWindow select symbol, minute, price from HistoricalTick

For statements that simply select from a named window (excludes on-select) the engine computes aggregation values incrementally, continuously updating the aggregation, as events enter and leave the named window.

For example, the below statement updates the total price incrementally as events enter and leave the named window. If events in the named window already exist at the time the statement gets created, the total price gets pre-computed once when the statement gets created and incrementally updated when events enter and leave the named window:

select sum(price) from HistoricalWindow(symbol='GE')

The same is true for uncorrelated subqueries. For statements that sub-select from a named window, the engine computes aggregation values incrementally, continuously updating the aggregation, as events enter and leave the named window. This is only true for uncorrelated subqueries that don't have a where-clause.

// Output GE symbol total price, incrementally computed
// Outputs every 15 minutes on the hour.
select (sum(price) from HistoricalWindow(symbol='GE')) 
from pattern [every timer:at(0, 15, 30, 45), *, *, *, *, 0)]

If instead your application uses on-select or a correlated subquery, the engine recomputes aggregation values from scratch every time the triggering event fires.

For example, the below statement does not incrementally compute the total price (use a plain select or subselect as above instead). Instead the engine computes the total price from scratch based on the where-clause and matching rows:

// Output GE symbol total price (recomputed from scratch) every 15 minutes on the hour
on pattern [every timer:at(0, 15, 30, 45), *, *, *, *, 0)]
select sum(price) from HistoricalWindow where symbol='GE'

Unidirectional joins against named windows also do not incrementally compute aggregation values.

Joins and outer joins, that are not unidirectional, compute aggregation values incrementally.

20.2.38. When Does Memory Get Released

Java Virtual Machines (JVMs) release memory only when a garbage collection occurs. Depending on your JVM settings a garbage collection can occur frequently or infrequently and may consider all or only parts of heap memory.

Esper is optimized towards latency and throughput. Esper does not force garbage collection or interfere with garbage collection. For performance-sensitive code areas, Esper utilizes thread-local buffers such as arrays or ringbuffers that can retain small amounts of recently processed state. Esper does not try to clean such buffers after every event for performance reasons. It does clean such buffers when destroying the engine and stopping or destroying statements. It is therefore normal to see a small non-increasing amount of memory to be retained after processing events that the garbage collector may not free immediately.

20.3. Using the performance kit

20.3.1. How to use the performance kit

The benchmark application is basically an Esper event server build with Esper that listens to remote clients over TCP. Remote clients send MarketData(ticker, price, volume) streams to the event server. The Esper event server is started with 1000 statements of one single kind (unless otherwise written), with one statement per ticker symbol, unless the statement kind does not depend on the symbol. The statement prototype is provided along the results with a '$' instead of the actual ticker symbol value. The Esper event server is entirely multithreaded and can leverage the full power of 32bit or 64bit underlying hardware multi-processor multi-core architecture.

The kit also prints out when starting up the event size and the theoretical maximal throughput you can get on a 100 Mbit/s and 1 Gbit/s network. Keep in mind a 100 Mbit/s network will be overloaded at about 400 000 event/s when using our kit despite the small size of events.

Results are posted on our Wiki page at Performance Wiki. Reported results do not represent best ever obtained results. Reported results may help you better compare Esper to other solutions (for latency, throughput and CPU utilization) and also assess your target hardware and JVMs.

The Esper event server, client and statement prototypes are provided in the source repository esper/trunk/examples/benchmark/. Refer to http://xircles.codehaus.org/projects/esper/repo for source access.

A built is provided for convenience (without sources) as an attachment to the Wiki page at Performance Wiki. It contains Ant script to start client, server in simulation mode and server. For real measurement we advise to start from a shell script (because Ant is pipelining stdout/stderr when you invoke a JVM from Ant - which is costly). Sample scripts are provided for you to edit and customize.

If you use the kit you should:

Choose the statement you want to benchmark, add it to etc/statements.properties under your own KEY and use the -mode KEY when you start the Esper event server.
Prepare your runServer.sh/runServer.cmd and runClient.sh/runclient.cmd scripts. You'll need to drop required jar libraries in lib/ , make sure the classpath is configured in those script to include build and etc . The required libraries are Esper (any compatible version, we have tested started with Esper 1.7.0) and its dependencies as in the sample below (with Esper 2.1) :
```
# classpath on Unix/Linux (on one single line)
etc:build:lib/esper-4.11.0.jar:lib/commons-logging-1.1.1.jar:lib/cglib-nodep-2.2.jar
   :lib/antlr-runtime-3.2.jar:lib/log4j-1.2.16.jar
@rem  classpath on Windows (on one single line)
etc;build;lib\esper-4.11.0.jar;lib\commons-logging-1.1.1.jar;lib\cglib-nodep-2.2.jar
   ;lib\antlr-runtime-3.2.jar;lib\log4j-1.2.16.jar
```
Note that ./etc and ./build have to be in the classpath. At that stage you should also start to set min and max JVM heap. A good start is 1GB as in -Xms1g -Xmx1g
Write the statement you want to benchmark given that client will send a stream MarketData(String ticker, int volume, double price), add it to etc/statements.properties under your own KEY and use the -mode KEY when you start the Esper event server. Use '$' in the statement to create a prototype. For every symbol, a statement will get registered with all '$' replaced by the actual symbol value (f.e. 'GOOG')
Ensure client and server are using the same -Desper.benchmark.symbol=1000 value. This sets the number of symbol to use (thus may set the number of statement if you are using a statement prototype, and governs how MarketData event are represented over the network. Basically all events will have the same size over the network to ensure predictability and will be ranging between S0AA and S999A if you use 1000 as a value here (prefix with S and padded with A up to a fixed length string. Volume and price attributes will be randomized.
By default the benchmark registers a subscriber to the statement(s). Use -Desper.benchmark.ul to use an UpdateListener instead. Note that the subscriber contains suitable update(..) methods for the default proposed statement in the etc/statements.properties file but might not be suitable if you change statements due to the strong binding with statement results. Refer to Section 14.3.2, “Receiving Statement Results”.
Establish a performance baseline in simulation mode (without clients). Use the -rate 1x5000 option to simulate one client (one thread) sending 5000 evt/s. You can ramp up both the number of client simulated thread and their emission rate to maximize CPU utilization. The right number should mimic the client emission rate you will use in the client/server benchmark and should thus be consistent with what your client machine and network will be able to send. On small hardware, having a lot of thread with slow rate will not help getting high throughput in this simulation mode.

Do performance runs with client/server mode. Remove the -rate NxM option from the runServer script or Ant task. Start the server with -help to display the possible server options (listen port, statistics, fan out options etc). On the remote machine, start one or more client. Use -help to display the possible client options (remote port, host, emission rate). The client will output the actual number of event it is sending to the server. If the server gets overloaded (or if you turned on -queue options on the server) the client will likely not be able to reach its target rate.

Usually you will get better performance by using server side -queue -1 option so as to have each client connection handled by a single thread pipeline. If you change to 0 or more, there will be intermediate structures to pass the event stream in an asynchronous fashion. This will increase context switching, although if you are using many clients, or are using the -sleep xxx (xxx in milliseconds) to simulate a listener delay you may get better performance.

The most important server side option is -stat xxx (xxx in seconds) to print out throughput and latency statistics aggregated over the last xxx seconds (and reset every time). It will produce both internal Esper latency (in nanosecond) and also end to end latency (in millisecond, including network time). If you are measuring end to end latency you should make sure your server and client machine(s) are having the same time with f.e. ntpd with a good enough precision. The stat format is like:

---Stats - engine (unit: ns)
  Avg: 2528 #4101107
        0 <    5000:  97.01%  97.01% #3978672
     5000 <   10000:   2.60%  99.62% #106669
    10000 <   15000:   0.35%  99.97% #14337
    15000 <   20000:   0.02%  99.99% #971
    20000 <   25000:   0.00%  99.99% #177
    25000 <   50000:   0.00% 100.00% #89
    50000 <  100000:   0.00% 100.00% #41
   100000 <  500000:   0.00% 100.00% #120
   500000 < 1000000:   0.00% 100.00% #2
  1000000 < 2500000:   0.00% 100.00% #7
  2500000 < 5000000:   0.00% 100.00% #5
  5000000 <    more:   0.00% 100.00% #18
---Stats - endToEnd (unit: ms)
  Avg: -2704829444341073400 #4101609
        0 <       1:  75.01%  75.01% #3076609
        1 <       5:   0.00%  75.01% #0
        5 <      10:   0.00%  75.01% #0
       10 <      50:   0.00%  75.01% #0
       50 <     100:   0.00%  75.01% #0
      100 <     250:   0.00%  75.01% #0
      250 <     500:   0.00%  75.01% #0
      500 <    1000:   0.00%  75.01% #0
     1000 <    more:  24.99% 100.00% #1025000
Throughput 412503 (active 0 pending 0 cnx 4)

This one reads as:

"Throughput is 412 503 event/s with 4 client connected. No -queue options 
was used thus no event is pending at the time the statistics are printed. 
Esper latency average is at 2528 ns (that is 2.5 us) for 4 101 107 events 
(which means we have 10 seconds stats here). Less than 10us latency 
was achieved for 106 669 events that is 99.62%. Latency between 5us 
and 10us was achieved for those 2.60% of all the events in the interval."

"End to end latency was ... in this case likely due to client clock difference
we ended up with unusable end to end statistics."

Consider the second output paragraph on end-to-end latency:

---Stats - endToEnd (unit: ms)
  Avg: 15 #863396
        0 <       1:   0.75%   0.75% #6434
        1 <       5:   0.99%   1.74% #8552
        5 <      10:   2.12%   3.85% #18269
       10 <      50:  91.27%  95.13% #788062
       50 <     100:   0.10%  95.22% #827
      100 <     250:   4.36%  99.58% #37634
      250 <     500:   0.42% 100.00% #3618
      500 <    1000:   0.00% 100.00% #0
     1000 <    more:   0.00% 100.00% #0

This would read:

"End to end latency average is at 15 milliseconds for the 863 396 events 
considered for this statistic report. 95.13% ie 788 062 events were handled 
(end to end) below 50ms, and 91.27% were handled between 10ms and 50ms."

20.3.2. How we use the performance kit

We use the performance kit to track performance progress across Esper versions, as well as to implement optimizations. You can track our work on the Wiki at http://docs.codehaus.org/display/ESPER/Home.