esper.codehaus.org and espertech.comDocumentation
Using match recognize patterns are defined in the familiar syntax of regular expressions.
The match recognize syntax presents an alternative way to specify pattern detection as compared to the EPL pattern language described in the previous chapter. A comparison of match recognize and EPL patterns is below.
The match recognize syntax is a proposal for incorporation into the SQL standard. It is thus subject to change as the standard evolves and finalizes (it has not finalized yet). Please consult row-pattern-recogniton-11-public for further information.
You may be familiar with regular expressions in the context of finding text of interest in a string, such as particular characters, words, or patterns of characters. Instead of matching characters, match recognize matches sequences of events of interest.
Esper can apply match-recognize patterns in real-time upon arrival of new events in a stream of events (also termed incrementally, streaming or continuous). Esper can also match patterns on-demand via the iterator
pull-API, if specifying a named window or data window on a stream.
This section compares pattern detection via match recognize and via the EPL pattern language.
Table 7.1. Comparison Match Recognize to EPL Patterns
Category | EPL Patterns | Match Recognize |
---|---|---|
Purpose | Pattern detection in sequences of events. | Same. |
Standards | Not standardized, similar to Rapide pattern language. | Proposal for incorporation into the SQL standard. |
Real-time Processing | Yes. | Yes. |
On-Demand query via Iterator | No. | Yes. |
Language | Nestable expressions consisting of boolean AND , OR , NOT and time or arrival-based constructs such as -> (followed-by), timer:within and timer:interval . | Regular expression consisting of variables each representing conditions on events. |
Event Types | An EPL pattern may react to multiple different types of events. | The input is a single type of event (unless used with variant streams). |
Data Window Interaction | Disconnected, i.e. an event leaving a data window does not change pattern state. | Connected, i.e. an event leaving a data window removes the event from match selection. |
Semantic Evaluation | Truth-value based: A EPL pattern such as (A and B) can fire when a single event arrives that satisfies both A and B conditions. | Sequence-based: A regular expression (A B) requires at least two events to match. |
Time Relationship Between Events | The timer:within , timer:interval and NOT operator can expressively search for absence of events or other more complex timing relationships. | Some support for detecting absence of events using the interval clause. |
Extensibility | Custom pattern objects, user-defined functions. | User-defined functions, custom aggregation functions. |
Memory Use | Likely between 500 bytes to 2k per open sequence, depends on pattern. | Likely between 100 bytes to 1k per open sequence, depends on pattern. |
The synopsis is as follows:
match_recognize ( [ partition by partition_expression [, partition_expression] [,...] ] measures measure_expressionas
col_name [, measure_expressionas
col_name ] [,...] [ all matches ] [ after match skip (past last row | to next row | to current row) ] pattern ( variable_regular_expr [, variable_regular_expr] [,...] ) [ interval time_period ] define variable as variable_condition [, variable as variable_condition] [,...] )
The match_recognize
keyword starts the match recognize definition and occurs right after the from
clause in an EPL select
statement. It is followed by parenthesis that surround the match recognize definition.
Partition by
is optional and may be used to specify that events are to be partitioned by one or more event properties or expressions. If there is no Partition by
then all rows of the table constitute a single partition. The regular expression applies to events in the same partition and not across partitions.
The measures
clause defines columns that contain expressions over the pattern variables. The expressions can reference partition columns, singleton variables, aggregates as well as indexed properties on the group variables. Each measure_expression expression must be followed by the as
keyword and a col_name column name.
The all matches
keywords are optional and instructs the engine to find all possible matches. By default matches are ranked and the engine returns a single match following an algorithm to eliminate duplicate matches, as described below. When specifying all matches
, matches may overlap and may start at the same row.
The after match skip
keywords are optional and serve to determine the resumption point of pattern matching after a match has been found. By default the behavior is after match skip past last row
. This means that after eliminating duplicate matches, the logic skips to resume pattern matching at the next event after the last event of the current match.
The pattern
component is used to specify a regular expression. The regular expression is built from variable names, and may use the operators such as *
, +
, ?
, *?
, +?
, ??
quantifiers and |
alteration (concatenation is indicated by the absence of any operator sign between two successive items in a pattern).
With the optional interval
keyword and time period you can control how long the engine should wait for further events to arrive that may be part of a matching event sequence, before indicating a match (or matches) (not applicable to on-demand pattern matching).
Define
is a mandatory component, used to specify the boolean condition that defines a variable name that is declared in the pattern. A variable name does not require a definition and if there is no definition, the default is a predicate that is always true. Such a variable name can be used to match any row.
For illustration, the examples in this chapter use the TemperatureSensorEvent
event. Each event has 3 properties: the id
property is a unique event id, the device
is a sensor device number and the temp
property is a temperature reading. An event described as "id=E1, device=1, temp=100"
is a TemperatureSensorEvent
event with id "E1" for device 1 with a reading of 100.
This example statement looks for two TemperatureSensorEvent
events from the same device, directly following each other, that indicate a jump in temperature of 10 or more between the two events:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, B.id as b_id, A.temp as a_temp, B.temp as b_temp pattern (A B) define B as Math.abs(B.temp - A.temp) >= 10 )
The partition by
ensures that the regular expression applies to sequences of events from the same device.
The measures
clause provides a list of properties or expressions to be selected from matching events. Each property name must be prefixed by the variable name.
In the pattern
component the statement declares two variables: A
and B
. As a matter of convention, variable names are uppercase characters.
The define
clause specifies no condition for variable A. This means that A defaults to true and any event matches A. Therefore, the pattern can start at any event.
The pattern A B
indicates to look for a pattern in which an event that fulfills the condition for variable A is immediately followed by an event that fulfills the condition for variable B.
A pattern thus defines the sequence (or sequences) of conditions that must be met for the pattern to fire.
Below table is an example sequence of events and output of the pattern:
Table 7.2. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=50 | |
2000 | id=E2, device=1, temp=55 | |
3000 | id=E3, device=1, temp=60 | |
4000 | id=E4, device=1, temp=70 | a_id = E3, b_id = E4, a_temp = 60, b_temp = 70 |
5000 | id=E5, device=1, temp=85 | |
6000 | id=E6, device=1, temp=85 | |
7000 | id=E7, device=2, temp=100 |
At time 4000 when event with id E4
(or event E4 or just E4 for short) arrives the pattern matches and produces an output event. Matching then skips past the last event of the current match (E4) and begins at event E5 (the default skip clause is past last row). Therefore events E4 and E5 do not constitute a match.
At time 3000, events E1 and E3 do not constitute a match as E3 does not immediately follow E, since there is E2 in between.
At time 7000, event E7 does not constitute a match as it is from device 2 and thereby not in the same partition as prior events.
The pattern
specifies the pattern to be recognized in the ordered sequence of events in a partition using regular expression syntax. Each variable name in a pattern corresponds to a boolean condition, which is specified later
using the define
component of the syntax. Thus the pattern
can be regarded as implicitly declaring one or more variable names; the definition of those variable names appears later in the
syntax. If a variable is not defined the variable defaults to true.
It is permitted for a variable name to occur more than once in a pattern, for example pattern (A B A)
.
The operators at the top of this table take precedence over operators lower on the table.
Table 7.3. Match Recognize Regular Expression Operator Precedence
Precedence | Operator | Description | Example |
---|---|---|---|
1 | Grouping | () | (A B) |
2 | Single-character duplication | * + ? | A* B+ C? |
3 | Concatenation | (no operator) | A B |
4 | Alternation | | | A | B |
If you are not sure about the precedence, please consider placing parenthesis ()
around your groups. Parenthesis can also help make
expressions easier to read and understand.
The concatenation is indicated by the absence of any operator sign between two successive items in a pattern.
In below pattern the two items A and B have no operator between them. The pattern matches for any event immediately followed by an event from the same device that indicates a jump in temperature over 10:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, B.id as b_id, A.temp as a_temp, B.temp as b_temp pattern (A B) define B as Math.abs(B.temp - A.temp) >= 10 )
Please see the Section 7.3.1, “Syntax Example” for a sample event sequence.
The alternation operator is a vertical bar ( |
).
The alternation operator has the lowest precedence of all operators. It tells the engine to match either everything to the left of the vertical bar, or everything to the right of the vertical bar. If you want to limit the reach of the alternation, you will need to use round brackets for grouping.
This example pattern looks for a sequence of an event with a temperature over 50 followed immediately by either an event with a temperature less then 45 or an event that indicates the temperature jumped by 10 (all for the same device):
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, B.id as b_id, C.id as c.id pattern (A (B | C)) define A as A.temp >= 50, B as B.temp <= 45, C as Math.abs(C.temp - A.temp) >= 10)
Below table is an example sequence of events and output of the pattern:
Table 7.4. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=50 | |
2000 | id=E2, device=1, temp=45 | a_id=E1, b_id=E2, c_id=null |
3000 | id=E3, device=1, temp=46 | |
4000 | id=E4, device=1, temp=48 | |
5000 | id=E5, device=1, temp=50 | |
6000 | id=E6, device=1, temp=60 | a_id = E5, b_id = null, c_id=E6 |
Quantifiers are postfix operators with the following choices:
Table 7.5. Quantifiers
Quantifier | Meaning |
---|---|
* | Zero or more matches (greedy). |
+ | One or more matches (greedy). |
? | Zero or one match (greedy). |
*? | Zero or more matches (reluctant). |
+? | One or more matches (reluctant). |
?? | Zero or one match (reluctant). |
A singleton variable is a variable in a pattern that does not have a quantifier or has a zero-or-one quantifier (?
or ??
) and occurs only once in the pattern (except with alteration).
In the measures
clause a singleton variable can be selected as:
variableName.propertyName
Variables with a zero-or-more or one-or-more quantifier, or variables that occur multiple places in a pattern (except when using alteration), may match multiple events and are group variables.
In the measures
clause a group variable must be selected either by providing an index or via any of the aggregation functions, such as first
, last
, count
and sum
:
variableName[index].propertyName
last(variableName.propertyName)
Please find examples of singleton and group variables and example measures
clauses below.
For group variables all existing aggregation functions can be used and in addition the following aggregation functions may be used:
Table 7.6. Syntax and results of aggregate functions
Aggregate Function | Result |
---|---|
first([all|distinct] expression) |
Returns the first value. |
last([all|distinct] expression) |
Returns the last value. |
The execution of match recognize is continuous and real-time by default. This means that every arriving event, or batch of events if using batching, evaluates against the pattern and matches are immediately indicated. Elimination of duplicate matches occurs between all matches of the arriving events (or batch of events) at a given time.
As an alternative, and if your application does not require continuous pattern evaluation, you may use the iterator
API to perform on-demand matching of the pattern. For the purpose of indicating to the engine to not
generate continuous results, specify the @Hint('iterate_only')
hint.
When using one-or-more, zero-or-more or zero-or-one quantifiers (?, +, *, ??, +?, *?
), the output of the real-time continuous query can differ from the output of the on-demand iterator
execution: The continuous query will output a match (or multiple matches) as soon as matches are detected at a given time upon arrival of events (not knowing what further events may arrive). The on-demand execution, since it knows all possible events in advance, can determine the longest match(es). Thus elimination of duplicate matches can lead to different results between real-time and on-demand use.
If the all matches
keywords are specified, then all matches are returned as the result and no elimination of duplicate matches as below occurs.
Otherwise matches to a pattern in a partition are ordered by preferment. Preferment is given to matches based on the following priorities:
A match that begins at an earlier row is preferred over a match that begins at a later row.
Of two matches matching a greedy quantifier, the longer match is preferred.
Of two matches matching a reluctant quantifier, the shorter match is preferred.
After ranking matches by preferment, matches are chosen as follows:
The first match by preferment is taken.
The pool of matches is reduced as follows based on the SKIP TO clause: If SKIP PAST LAST ROW is specified, all matches that overlap the first match are discarded from the pool. If SKIP TO NEXT ROW is specified, then all matches that overlap the first row of the first match are discarded. If SKIP TO CURRENT ROW is specified, then no matches are discarded.
The first match by preferment of the ones remaining is taken.
Step 2 is repeated to remove more matches from the pool.
Steps 3 and 4 are repeated until there are no remaining matches in the pool.
Reluctant quantifiers are indicated by an additional question mark (*?, +?, ??,
). Reluctant quantifiers try to match as few rows as possible, whereas non-reluctant quantifiers are greedy and try to match as many rows
as possible.
Greedy and reluctant come into play only for match selection among multiple possible matches. When specifying all matches
there is no difference between greedy and reluctant quantifiers.
Consider the below example. The conditions may overlap: an event with a temperature reading of 105 and over matches both A and B conditions:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, B.id as b_id pattern (A?? B?) define A as A.temp >= 100 B as B.temp >= 105)
A sample sequence of events and pattern matches:
Table 7.7. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=99 | |
2000 | id=E2, device=2, temp=106 | a_id=null, b_id=E2 |
3000 | id=E3, device=1, temp=100 | a_id=E3, b_id=null |
As the ?
qualifier on condition B is greedy, event E2 matches the pattern and is indicated as a B event by the measure
clause (and not as an A event therefore a_id
is null).
The one-or-more quantifier (+) must be matched one or more times by events. The operator is greedy and the reluctant version is +?
.
In the below example with pattern (A+ B+)
the pattern consists of two variable names, A and B, each of which is quantified with +
, indicating that they must be matched one or more times.
The pattern looks for one or more events in which the temperature is over 100 followed by one or more events indicating a higher temperature:
select * from TemperatureSensorEvent match_recognize ( partition by device measures first(A.id) as first_a, last(A.id) as last_a, B[0].id as b0_id, B[1].id as b1_id pattern (A+ B+) define A as A.temp >= 100, B as B.temp > A.temp)
An example sequence of events that matches the pattern above is:
Table 7.8. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=99 | |
2000 | id=E2, device=1, temp=100 | |
3000 | id=E3, device=1, temp=100 | |
4000 | id=E4, device=1, temp=101 | first_a = E2, last_a = E3, b0_id = E4, b1_id = null |
5000 | id=E5, device=1, temp=102 |
Note that for continuous queries, there is no match that includes event E5 since after the pattern matches for E4 the pattern skips to start fresh at E5 (by default skip clause). When performing on-demand matching via iterator
, event E5 gets included in the match
and the output is first_a = E2, last_a = E3, b0_id = E4, b1_id = E5
.
The zero-or-more quantifier (*) must be matched zero or more times by events. The operator is greedy and the reluctant version is *?
.
The pattern looks for a sequence of events in which the temperature starts out below 50 and then stays between 50 and 60 and finally comes over 60:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, count(B.id) as count_b, C.id as c_id pattern (A B* C) define A as A.temp < 50, B as B.temp between 50 and 60, C as C.temp > 60)
An example sequence of events that matches the pattern above is:
Table 7.9. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=55 | |
2000 | id=E2, device=1, temp=52 | |
3000 | id=E3, device=1, temp=49 | |
4000 | id=E4, device=1, temp=51 | |
5000 | id=E5, device=1, temp=55 | |
6000 | id=E5, device=1, temp=61 | a_id=E3, count_b=2, c_id=E6 |
The zero-or-one quantifier (?) must be matched zero or one time by events. The operator is greedy and the reluctant version is ??
.
The pattern looks for a sequence of events in which the temperature is below 50 and then dips to over 50 and then to under 50 before indicating a value over 55:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, B.id as b_id, C.id as c_id, D.id as d_id pattern (A B? C? D) define A as A.temp < 50, B as B.temp > 50, C as C.temp < 50, D as D.temp > 55)
An example sequence of events that matches the pattern above is:
Table 7.10. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=44 | |
2000 | id=E2, device=1, temp=49 | |
3000 | id=E3, device=1, temp=51 | |
4000 | id=E4, device=1, temp=49 | |
5000 | id=E5, device=1, temp=56 | a_id=E2, b_id=E3, c_id=E4, d_id=E5 |
6000 | id=E5, device=1, temp=61 |
Within define
are listed the boolean conditions that defines a variable name that is declared in the pattern.
A variable name does not require a definition and if there is no definition, the default is a predicate that is always true. Such a variable name can be used to match any row.
The definitions of variable names may reference the same or other variable names as prior examples have shown.
If a variable in your condition expression is a singleton variable, then only individual columns may be referenced. If the variable is not matched by an event, a null
value is returned.
If a variable in your condition expression is a group variable, then only indexed columns may be referenced. If the variable is not matched by an event, a null
value is returned.
Aggregation functions are not allowed within expressions of the define
clause.
The prev
function may be used in a define
expression to access columns of the previous row of a variable name. If there is no previous row, the null value is returned.
The prev
function can accept an optional non-negative integer argument indicating the offset to the previous rows. That argument must be a constant. In this case, the engine returns the property from the
N-th row preceding the current row, and if the row doesn’t exist, it returns null
.
This function can access variables currently defined, for example:
Y as Y.price < prev(Y.price, 2)
It is not legal to use prev
with another variable then the one being defined:
// not allowed Y as Y.price < prev(X.price, 2)
The prev
function returns properties of events in the same partition. Also, it returns properties of events according to event order-of-arrival. When using data windows or deleting events from a named window, the remove stream does not remove events from the prev
function.
The pattern looks for an event in which the temperature is greater or equal 100 and that, relative to that event, has an event preceding it by 2 events that also had a temperature greater or equal 100:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id pattern (A) define A as A.temp > 100 and prev(A.temp, 2) > 100)
An example sequence of events that matches the pattern above is:
Table 7.11. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=98 | |
2000 | id=E2, device=1, temp=101 | |
3000 | id=E3, device=1, temp=101 | |
4000 | id=E4, device=1, temp=99 | |
5000 | id=E5, device=1, temp=101 | a_id=E5 |
The measures
clause defines exported columns that contain expressions over the pattern variables. The expressions can reference partition columns, singleton variables and
any aggregation functions including last
and first
on the group variables.
Expressions in the measures
clause must use the as
keyword to assign a column name.
If a variable is a singleton variable then only individual columns may be referenced, not aggregates. If the variable is not matched by an event, a null
value is returned.
If a variable is a group variable and used in an aggregate, then the aggregate is performed over all rows that have matched the variable. If a group variable is not in an aggregate function, its variable name must be post-fixed with an index. See Section 7.4.5, “Variables Can be Singleton or Group” for more information.
When using match recognize with a named window or stream bound by a data window, all events removed from the named window or data window also removed the match-in-progress that includes the event(s) removed.
The next example looks for four sensor events from the same device immediately following each other and indicating a rising temperature, but only events that arrived in the last 10 seconds are considered:
select * from TemperatureSensorEvent.win:time(10 sec) match_recognize ( partition by device measures A.id as a_id pattern (A B C D) define B as B.temp > A.temp, C as C.temp > B.temp, D as D.temp > C.temp)
An example sequence of events that matches the pattern above is:
Table 7.12. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=80 | |
2000 | id=E2, device=1, temp=81 | |
3000 | id=E3, device=1, temp=82 | |
4000 | id=E4, device=1, temp=81 | |
7000 | id=E5, device=1, temp=82 | |
9000 | id=E6, device=1, temp=83 | |
13000 | id=E7, device=1, temp=84 | a_id=E4, a_id=E5, a_id=E6, a_id=E7 |
15000 | id=E8, device=1, temp=84 | |
20000 | id=E9, device=1, temp=85 | |
21000 | id=E10, device=1, temp=86 | |
26000 | id=E11, device=1, temp=87 |
Note that E8, E9, E10 and E11 doe not constitute a match since E8 leaves the data window at 25000.
With the optional interval
keyword and time period you can control how long the engine should wait for further events to arrive that may be part of a matching event sequence, before indicating a match (or matches). This is not applicable to on-demand pattern matching.
The interval timer starts are the arrival of the first event matching a sequence for a partition. When the time interval passes and an event sequence matches, duplicate matches are eliminated and output occurs.
The next example looks for sensor events indicating a temperature of over 100 waiting for any number of additional events with a temperature of over 100 for 10 seconds before indicating a match:
select * from TemperatureSensorEvent match_recognize ( partition by device measures A.id as a_id, count(B.id) as count_b, first(B.id) as first_b, last(B.id) as last_b pattern (A B*) interval 5 seconds define A as A.temp > 100, B as B.temp > 100)
An example sequence of events that matches the pattern above is:
Table 7.13. Example
Arrival Time | Tuple | Output Event (if any) |
---|---|---|
1000 | id=E1, device=1, temp=98 | |
2000 | id=E2, device=1, temp=101 | |
3000 | id=E3, device=1, temp=102 | |
4000 | id=E4, device=1, temp=104 | |
5000 | id=E5, device=1, temp=104 | |
7000 | a_id=E2, count_b=3, first_b=E3, last_b=E5 |
Notice that the engine waits 5 seconds (5000 milliseconds) after the arrival time of the first event E2 of the match at 2000, to indicate the match at 7000.
You may match different types of events using match-recognize by following any of these strategies:
Declare a variant stream.
Declare a supertype for your event types in the create schema
syntax.
Have you event classes implement a common interface or extend a common base class.
A short example that demonstrates variant streams and match-recognize is listed below:
// Declare one sample type create schema S0 as (col string)
// Declare second sample type create schema S1 as (col string)
// Declare variant stream holding either type create variant schema MyVariantStream as S0, S1
// Populate variant stream insert into MyVariantStream select * from S0
// Populate variant stream insert into MyVariantStream select * from S1
// Simple pattern to match S0 S1 pairs select * from MyVariantType.win:time(1 min) match_recognize ( measures A.id? as a, B.id? as b pattern (A B) define A as typeof(A) = 'S0', B as typeof(B) = 'S1' )
Please note the following limitations:
Subqueries are not allowed in expressions within match_recognize
.
Joins and outer joins are not allowed in the same statement as match_recognize
.
match_recognize
may not be used within on-select
or on-insert
statements.
When using match_recognize
on unbound streams (no data window provided) the iterator
pull API returns no rows.
A Statement Object Model API for match_recognize
is not yet available.