PreEmptive Analytics Workbench User Guide

Computation Pipeline

This section gives a detailed overview of the data-processing pipeline, and how Plugins can be used to control it. This detailed understanding is not required to write most custom Indexers and queries; this section is primarily meant to be a reference.

The explanations below will often refer to the default Plugins, to highlight real-world uses of each step. It may be useful to refer to the source code for the default Plugins, which is available from PreEmptive Solutions.

You may also wish to refer to the component overview diagram.

Overview

The data-processing pipeline works in five general Phases.

  1. Receive - When clients send envelopes to the Workbench, they are batched and sent to the Extract, Propagate, and Publish phases.
  2. Extract - The data of each analytics message in an envelope batch is extracted and stored in the fields of temporary bin ("temp-bin") objects.
  3. Propagate - When all data from a batch has been extracted, additional changes propagate through the temp-bins, based on the newly-extracted data.
  4. Publish - If sufficient data has been stored in a temp-bin, the relevant values are written out to permanent storage.
  5. Query - When a consumer of the published data (e.g. the Portal) requests data via the Query API, the appropriate information is retrieved from permanent storage.

The first four Phases are triggered by incoming data to the Workbench endpoint, and occur primarily in the Computation Service. Any errors that occur are handled according to Computation Error Handling.

The Query Phase is triggered by requests to the Query API and occurs in the Query Web Service. Any errors that occur return an error message in the response.

Each Phase has a number of internal steps. Every Phase can be customized by the use of Plugins.

Phase 1: Receive

The Receive Phase is the simplest Phase. It is responsible for handling incoming envelopes from client applications and preparing them for the rest of the pipeline.

Input: Envelopes from client applications.
Output: A batch of messages to be used for the Extract, Propagate, and Publish Phases.
Components Used:

  • Endpoint Web Service
  • RabbitMQ
  • Computation Service
  • Plugins
    • Geolocation Implementation

Envelope Queuing

The Endpoint Web Service receives an envelope from a client application via HTTP POST. It then parses and queues it in a RabbitMQ queue named workbench-endpoint for later processing by the Computation Service.

During this process, the geographic location of the originating client is determined, based on the earliest known IP address that sent this envelope. This process can be customized.

If the POST is malformed or not a valid envelope, the HTTP request is rejected, and nothing is queued.

This step does not require the Computation Service to be running.

Example

A client application is started and sends an envelope containing a few analytics messages to the Workbench endpoint. The Endpoint Web Service determines the location of the original client - even if the envelopes traveled through the Analytics Data Hub or other proxies - and records it with the envelope. Finally, the service queues the envelope in RabbitMQ and sends a success response to the client application.

Batch Creation

The Computation Service reads a set of envelopes from the endpoint queue (workbench-endpoint) into an envelope "batch". The envelopes are split into their constituent messages, and each message is sent to the Extract Phase.

After all of the messages in the envelope have been processed, the batch enters the Propagate and Publish Phases.

To prevent data loss in the event of a Computation Service shutdown, the envelopes are not actually removed from the queue during these phases, but they are reserved ("unacknowledged") to indicate they have been picked up by the Computation Service. The last step of the Publish Phase, Batch Cleanup, is responsible for removing these envelopes once the batch is complete.

Example

Two envelopes are read into a batch. The first is split into two messages: an application-start and a session-start. The second contains only one message, a session-stop. Each message is now ready for consumption by the Extract Phase.

Phase 2: Extract

The Extract Phase operates on individual messages, allowing Indexers to populate fields in temporary storage based on the incoming data.

Input: An analytics message.
Output: Modifications to temporary storage.
Components Used:

  • Computation Service
  • MongoDB
  • Plugins
    • Indexers
    • Custom Data Counters

De-duplication

The message's messageID is checked against the database's list of previously-processed messageIDs. If the Workbench previously processed this message, this copy is ignored, and the Computation Service skips to the next message in the batch, bypassing the rest of the Process Phase for this message.

Example

A client application sends an envelope, E, to the Workbench, but due to network connectivity issues, the 200 OK response sent during the Envelope Queuing step is not received by the client. The client, thinking that the envelope was not received by the Workbench, sends another copy, E', and this time gets the response correctly.

When the Extract, Propagate, and Publish Phases run on messages from E, the Computation Service successfully processes the data, then records the messageIDs during the Batch Cleanup step. When the messages from E' arrive, none of them are processed, due to each message's messageID already being recorded. The batch proceeds to the Propagate Phase without doing any Extraction.

Indexer Selection and Grouping

The message's type is checked, and all Indexers that can extract this particular type are selected. These Indexers are then grouped by what Scope (and what Parent Scopes) they operate on; if there are multiple Scopes, the rest of the Process Phase repeats on this message for each Scope.

Indexers declare which message types they can extract by the DefineMessageTypesToExtract method, and use their constructor calls to determine the Scopes and Parent Scopes they operate on.

Example 1

A session start message is checked; its type is SessionLifeCycle. There are three default Indexers that extract this type: the Custom Data Indexer, the User Indexer, and the Session Indexer. All three are selected to process this message.

Because the Custom Data Indexer operates on Message Scope, and the other two Indexers operate on Session Scope, they are grouped accordingly. The remainder of the Process Phase proceeds on one of the two groups, and then repeats for the other group.

Example 2

Consider a scenario where a custom "Company-Specific Feature Indexer" has been installed to the Workbench, and that this Indexer extracts messages of type Feature, operates on Feature Scope, but doesn't declare any Parent Scopes.

Even though this custom Indexer operates on the same type of message and the same type of Scope as the default Feature Indexer, because the Feature Indexer declares a Parent Scope (in its case, Session), the two Indexers will not be grouped together when a Feature message is processed, thus causing the message to be processed twice for the associated Feature temp-bin: once for each Indexer.

Temp-bin Acquisition

For each group (Scope) of Indexers, the appropriate ID (messageGroupID, sessionID, featureGroupID, or messageID) is read from the message, and a corresponding temp-bin is acquired from temporary storage.

If the needed temp-bin does not already exist, it is created, and two special values, application and date are automatically extracted from the message and stored in the temp-bin. This process does not cause any parent temp-bins to be created.

Example 1

We are processing a message with a messageGroupID of A. Our current group of Indexers are in the Application Run scope. The Computation Service looks up a Application Run temp-bin with messageGroupID = A, and as it does not find such a temp-bin, creates it. The created temp-bin will automatically be populated with application and date values.

Example 2

We are processing a message with a messageGroupID of A, sessionID of S, and featureGroupID of F. Our current group of Indexers are in the Feature scope. The Computation Service looks up a Feature temp-bin with featureGroupID = F, and as it does not find such a temp-bin, creates it. However, it does not check if there is a Session temp-bin with sessionID = S; if such a temp-bin does not exist yet (due to this message being processed before any session messages, for instance), it is not created.

Extraction Calls

The Computation Service now calls the Extract method of each Indexer in the group in order to process the data from the message and store it in the temp-bin. This is where the Workbench reads and handles the values stored in the message, forming the core of the Extract Phase.

Extract may also define dynamic keys and Output Schemas in the temp-bin.

The return object of the method, ExtractResult, indicates whether changes were made to the temp-bin, and optionally defines a Delayed Action that will be registered.

Example

The Session Indexer's Extract method is called and given a Session temp-bin and a SessionLifeCycle message.

public override ExtractResult Extract
    (IStateBin tempStateBin, EnvelopeAttributes envelopeAttributes, Message message)
{
    if (null == message.Event)
        throw new InvalidOperationException("Event information on message was null");
    ExtractTime(tempStateBin, envelopeAttributes, message);
    if (message.Event.Code.EndsWith("Start"))
    {
        tempStateBin.AddValue(GetFieldKey(SessionStartCount), 1);
        return new ExtractResult(true);
    }
    if (message.Event.Code.EndsWith("Stop"))
    {
        tempStateBin.AddValue(GetFieldKey(SessionStopCount), 1);
        return new ExtractResult(true);
    }
    throw new InvalidOperationException("Unknown event code " + message.Event.Code);
}

The Extract method calls a helper method to extract the message's timestamps, then determines that this message marks the start of a session, not a stop. It stores the value 1 into the temp-bin under the the key corresponding to the SessionStartCount field, then returns an ExtractResult indicating that the temp-bin was modified.

Delayed Action Registration

If an Extraction Call returned a Delayed Action, the Computation Service determines the parent Scope and associated ID of the current Indexer. It then attempts to look up a temp-bin identified by these values.

  • If such a temp-bin is found, the Delayed Action runs immediately.
  • Otherwise, the Delayed Action is registered on the current temp-bin, to be activated later.
Example 1

We are processing a feature message with a messageGroupID of A, sessionID of S, and featureGroupID of F, and have just finished extraction from the Feature Indexer, which modified the Feature temp-bin F. However, it also returned a Delayed Action:

return new ExtractResult(true)
{
    DelayedAction = GetSessionAction
};

where GetSessionAction is a static method in the FeatureIndexer class. The Feature Indexer's constructor, when it called the base constructor in IndexerBase, declared that it primarily operates on Feature Scope, but has Session Scope as a parent:

public FeatureIndexer
    (FieldKeyFactory fieldKeyFactory, FeatureScope featureScope, SessionScope sessionScope,
        IFunctionalLogger logger)
    : base(fieldKeyFactory, Namespace, featureScope, sessionScope)
{
    _logger = logger;
}

The sessionID of this message is S. The Workbench looks up the Session temp-bin S and does not find it, so it registers this Delayed Action on Feature temp-bin F to be run when Session temp-bin S is created.

Example 2

The same scenario as Example 1, except that this time, Session temp-bin A does exist. In this case, the Delayed Action runs immediately, modifying both the Session temp-bin and Feature temp-bin, per the GetSessionAction method. No Delayed Action is ultimately registered on the Feature temp-bin F.

Indexer Registration

After a group of Indexers have made all of their Extraction Calls and registered any appropriate Delayed Actions, each Indexer in the group registers itself with the temp-bin, if it hasn't already. Registering means:

  • This temp-bin will be designated as a child of the Indexer's parent Scopes.
    • If the parent temp-bin exists, field values that have AllowCopyToChildren=true will be copied from the parent to the child at this time.
    • This will not create any parent temp-bins if they don't already exist.
  • This temp-bin will have the Indexer's Transforms associated with it, allowing further modification as more messages are processed.
  • This temp-bin will have the Indexer's Output Schemas associated with it, to enable publishing.
    • Additional Output Schemas may be registered based on the currently-installed set of Patterns.

After this, the pipeline repeats from the Temp-bin Acquisition step until all groups of Indexers previously identified have processed the message.

Example

We just finished processing a session message with a messageGroupID of A and sessionID of S, using Session temp-bin S and the Location and Session Indexers. Because this temp-bin was created during the Temp-bin Acquisition step, it does not have any registered Indexers.

The Location Indexer registers itself on this temp-bin, but because that Indexer has no parent scopes, Indexer Transforms, or Output Schemas, this operation otherwise has no effect on the temp-bin.

The Session Indexer registers itself on this temp-bin:

  • The Indexer has Application Run Scope as a parent, so the temp-bin is registered as a child of Application Run temp-bin A.
  • It has an Indexer Transform, LengthTransform, which is registered on the temp-bin.
  • It also has four Output Schemas: StartInformation, StopInformation, CompleteSessionStartInformation, and CompleteSessionStopInformation. All of these are registered on the temp-bin.
  • Additionally, because the OS/Runtime/Location Pattern adds additional Output Schemas to those declared by Indexers, new copies of those four Output Schemas are registered, modified to include the Pattern's Pivot Keys.

At this point, we are finished using this Session temp-bin for this message. Since our grouping identified another group of Indexers, using Message Scope, we repeat from the the Temp-bin Acquisition step, this time acquiring a Message temp-bin.

Phase 3: Propagate

After all data in a batch of envelopes has been extracted, the Propagate Phase makes additional changes to temporary storage based on various Indexer-defined conditions that are now satisfied. The Phase operates on each temp-bin that was modified during the Extract Phase, and recursively on any temp-bins that were modified during this Phase.

Input: A temp-bin that was modified during either the Extract Phase or this Phase.
Output: Further changes to temporary storage.
Components Used:

  • Computation Service
  • MongoDB
  • Plugins
    • Indexers

Delayed Action Activation

It is possible that other temp-bins had declared this temp-bin's Scope as a parent. If any of these child temp-bins contain Delayed Actions that have not yet been run, they will run at this time.

Each Delayed Action on a particular temp-bin will run only once.

Note: A given temp-bin (e.g. a Session bin) may have many Delayed Actions registered against it, particularly if there is a long running session. The arrival of a message that triggers the creation of that bin may cause extensive processing to occur as all the Delayed Actions are processed. This may appear to "pause" ingestion for long periods of time.

Example

The Extract Phase finished processing a session message with a sessionID of S. This modified a Session temp-bin with sessionID = S. Previously, a feature message with a sessionID of S and featureGroupID of F was processed. (It arrived first because of network issues.)

When the Feature Indexer ran extraction on the feature message, it declared a Delayed Action for the Session scope, with sessionID S. Now that such a Session temp-bin exists, the Delayed Action is activated, modifying both the Session temp-bin S and the Feature temp-bin F.

Indexer Transforms

An Indexer may define Transforms that are registered on temp-bins they modify. If this temp-bin satisfies the criteria for one of its registered Indexer Transforms, the Transform runs, modifying the temp-bin further.

Each Indexer Transform on a particular temp-bin may run only once.

Example

The Session Indexer defines an Indexer Transform, LengthTransform, which requires both the session's start and stop timestamps to be present in the temp-bin before activating. These timestamps are extracted from the session-start and session-stop messages, respectively.

  • The session-stop message arrives first (due to network issues), in its own envelope.
    1. The Session temp-bin is created.
    2. The Session Indexer's Extract method adds the stop timestamp to this temp-bin.
    3. The Session Indexer is registered on this Session temp-bin, which includes registering LengthTransform.
    4. After the batch has been extracted, the Session temp-bin is checked for Indexer Transforms. The LengthTransform is found, but it requires the start timestamp to be in the temp-bin, so the Transform is not run.
  • The session-start message arrives later, also in its own envelope.
    1. The same Session temp-bin is acquired.
    2. The Session Indexer's Extract method adds the start timestamp to this temp-bin.
    3. The Session Indexer has already been registered on this temp-bin, so it is not registered again.
    4. After the batch has been extracted, the Session temp-bin is checked for Indexer Transforms. The LengthTransform is found, and its necessary fields have been populated.
    5. LengthTransform runs, calculating the difference between the two timestamps, and recording it to additional fields in the temp-bin.

Parent Value Copying

All modifications to this temp-bin are copied to its descendant temp-bins, if any exist, and if they have AllowCopyToChildren=true. If this modifies a descendant temp-bin that was not modified during the Extract Phase, this entire Phase applies recursively to that descendant.

It is possible that many children will need values copied, particularly if there is a long running session. This may cause the Workbench to delay ingestion for long periods.

Example

The Feature Indexer previously declared the Feature temp-bin F to be a child of Session scope S, and the Session Indexer previously declared the Session temp-bin S to be a child of Application Run scope A. After the OS Indexer extracts operating system information from an ApplicationLifeCycle message, this information will be stored in Application Run temp-bin A.

Now, the Computation Service checks for modifications in Application Run temp-bin A, and notes that OS information has been added. The value of this field (e.g., Windows 8.1) is copied to the Session temp-bin S. Later, when that Session temp-bin is processed, it will pass the value down to Feature temp-bin F.

Thus, even though the OS Indexer operates on Application Run Scope, because the Feature and Session Indexers specified a hierarchy from Application Run Scope to Feature Scope, information specific to the Feature (e.g., its name) can be associated with the OS information. This means the OS/Runtime/Location Pattern can publish this association to permanent storage and allow requests to the Query API to filter Feature information based on OS values.

Phase 4: Publish

The Publish Phase is responsible for updating permanent storage with newly-processed information from temporary storage. Indexers define when these updates occur, and on which fields.

This Phase is also the last Phase that operates on an envelope batch, so it is also responsible for cleaning up the batch and flushing changes to the database.

Input: A temp-bin that was modified during either the Extract Phase or Propagate Phase.
Output: Modifications to permanent storage.
Components Used:

  • RabbitMQ
  • Computation Service
  • MongoDB
  • Plugins
    • Indexers
    • Patterns
    • Custom Data Counters

Output Schemas

Indexers define Output Schemas, which are registered on temp-bins they modify. Each schema specifies which fields are required to be present before publishing.

Patterns can add additional Output Schemas.

At this step, each of the temp-bin's Output Schemas are checked to see if all of its fields are populated, and thus ready for publishing. The ability to have an Indexer declare multiple Output Schemas, and thus have different sets of fields be published all at once, allows both flexibility and consistency when updating permanent storage.

Each Output Schema on a particular temp-bin may run only once.

Example 1

The Session Indexer defines several Output Schemas, which activate at different times:

  • After a session-start message is extracted, the temp-bin will contain all the required fields for the StartInformation schema to activate; this allows the database to reflect that a session has been observed as soon as possible, without having to wait for the session-stop.
  • After a session-stop message is extracted, the StopInformation schema will be activated, indicating that a session has reported completion.
  • The other two schemas depend on fields defined by the LengthTransform Indexer Transform; that is, these schemas will only activate after both session-start and session-stop messages have been extracted. This ensures that session length information is not published until both timestamps are available and the length has been correctly calculated.

The default Plugins have one Pattern that affects Output Schemas: the OS/Runtime/Location Pattern.

public OutputSchema[] ExtendOutputSchemaWithPattern(OutputSchema[] outputSchemas)
{
    return outputSchemas
        .Where(s =>
            !s.PivotKeys.Contains
            (_fieldKeyFactory.GetFieldKey(LocationIndexer.Namespace, LocationIndexer.LocationCountry))
        )
        .Select(outputSchema =>
    {
        var pivotKeys = new HashSet<FieldKey>(outputSchema.PivotKeys)
        {
            _fieldKeyFactory.GetFieldKey(OsIndexer.Namespace, OsIndexer.OsName),
            _fieldKeyFactory.GetFieldKey(RuntimeIndexer.Namespace, RuntimeIndexer.RuntimeInformation),
            _fieldKeyFactory.GetFieldKey(LocationIndexer.Namespace, LocationIndexer.LocationCountry)
        };
        return new OutputSchema(outputSchema.Name + "_OSAndRuntimePattern", _osIndexer)
        {
            RequiredFields = outputSchema.RequiredFields,
            PivotKeys = pivotKeys
        };
    }).ToArray();
}

As a result of this pattern, each of the above Output Schemas can be triggered a second time with three more Pivot Keys, for Operating System, Runtime, and Country. These additional Output Schemas will not be satisfied until the OS, Runtime, and Location Indexers have populated the corresponding fields.

Example 2

The Tamper Indexer defines a single Output Schema, which not only contains information about tamper messages, but also location information. Because this includes the LocationCountry field in the Location namespace, the OS/Runtime/Location Pattern does not apply to this Output Schema (see the Where clause above).

This is important, because tamper messages may not always occur during a session, and therefore the Tamper Indexer operates on Message Scope, with no parent Scopes. As all the fields defined by the Pattern operate on the Session Scope, if the Pattern caused the Tamper Indexer's Output Schema to also require them, it would never activate, as those fields would never be shared from a Session temp-bin to the Message temp-bin.

Publishing (with Aggregation)

Once fields have been selected for publishing from a temp-bin to permanent storage, the new values are actually merged into permanent storage.

The Application ID (including Company and Version) and the timestamp are automatically retrieved from the temp-bin. These, along with the Pivot Keys of the Output Schema, are used to find the unique rows within permanent storage to update or create.

Each Data field in a row is updated with the newly-published value based on the MergeOption of the field.

Example

When the Session Indexer's CompleteSessionStartInformation schema activates, the PivotKey StartTimeByHour (in the Standard namespace) and all Pattern-applied PivotKeys (such as application key) are used to find matching rows in permanent store. In this example, one row matches these keys.

Temp-bin data to be published

FullApplicationId StartTimeByHour SessionLength MaxSessionLength MinSessionLength CompleteSessionStartCount
SampleApp_v1 2014-06-12 13:00 00:01:25 00:01:25 00:01:25 1

Permanent store before new data is published

FullApplicationId StartTimeByHour SessionLength MaxSessionLength MinSessionLength CompleteSessionStartCount
SampleApp_v1 2014-06-12 13:00 02:41:31 01:21:01 00:05:30 3

Permanent store after publishing

FullApplicationId StartTimeByHour SessionLength MaxSessionLength MinSessionLength CompleteSessionStartCount
SampleApp_v1 2014-06-12 13:00 02:42:56 01:21:01 00:01:25 4

Note the different ways the SessionLength, MaxSessionLength, and MinSessionLength fields are updated: though they all have the same new value from the temp-bin, they were each defined with a different MergeOption (Aggregate, Max, and Min, respectively), so the resulting fields in permanent storage each track different values.

Batch Cleanup

Once all temp-bins have been processed by this Phase, the envelopes from which the data was extracted are no longer needed, and can be removed from RabbitMQ. Additional cleanup operations also occur during this step. In summary, the Computation Service:

  1. Flushes pending changes to the temporary and permanent stores, to ensure data consistency before completing the batch.
  2. Adds the messageID of each successfully-extracted message in the batch to the database's list of processed messages, to ensure that it will not be processed again.
  3. Removes all envelopes in the batch from the workbench-endpoint RabbitMQ queue.
  4. Triggers temp-bin cleanup, if necessary.
Example

A batch with three envelopes, X, Y, and Z, was created and processed successfully. The changes are written to the database, including de-duplication information, and the envelopes leave RabbitMQ.

Computation Error Handling

If errors are raised at any time during the Extract, Propagate, or Publish phases, the batch is aborted, remaining phases and steps are skipped, and pending changes are discarded. This mirrors the Batch Cleanup step, except that changes are not saved to the database. The Computation Service:

  1. Logs the error(s).
  2. Removes pending changes to the temporary and permanent stores.
  3. Reconciles the batch with RabbitMQ:
    • If the error was due to a database outage, the envelopes in the batch return to the workbench-endpoint queue, to be retried in the next batch.
    • Otherwise, the envelopes are moved to the workbench-error queue, to be retried at a later time.
Example

A batch with three envelopes, X, Y, and Z, was created. The Extract Phase succeeds on the envelopes in X, but an extraction call for a message in envelope Y threw an exception.

The exception is logged, then all pending changes from the Extract Phase for X and Y are discarded (Z had not processed yet when the error occurred). All three envelopes are removed from workbench-endpoint and queued in workbench-error.

Phase 5: Query

The Query Phase occurs when the Query API receives a request for data. The nature of this Phase is defined by server-side Queries, in the QueryMetaData property. Note that these too may be augmented by Patterns.

The Queries also have several opportunities to transform or calculate fields through this Phase. However, they make no changes to the database; all transforms operate entirely on working copies before they are returned to the Query API's client.

Input: A request, via HTTP POST to the Query API, to provide data from permanent storage.
Output: A JSON response, per the Data Query specification.
Workbench Components Used:

  • Query Web Service
  • MongoDB (read-only)
  • Plugins
    • Queries
    • Patterns
    • Custom Data Counters

Web Service Request

The Phase begins when a client of the Query API makes a request. The queries available through the Metadata Query are defined based on the Queries loaded.

Example

When a Portal user refreshes a report, the browser makes a request to the Query Web Service. An excerpt of the JSON body follows:

{
    "id" : "preemptive:key-stats",
    "name" : "KeyStats",
    "domain" : "PreEmptive.Sessions",
    "aggregate_on" : [{
            "field" : "AppId_Version"
        }
    ],
    "include_aggregates" : [],
    "fields" : [{
            "type" : "formatted",
            "field" : "AppId_Version",
            "as" : "Application and Version",
            "filter" : {
                // <Sample App, any version>
            },
            "include_data" : true
        }, {
            "type" : "timestamp",
            "field" : "Time",
            "as" : "Date Range",
            "filter" : {
                // <2014-06-12> through <2014-06-13> inclusive
            },
            "include_data" : false
        }, {
            "type" : "number",
            "field" : "CompleteCount",
            "as" : "Complete Sessions",
            "include_data" : true
        }, {
            "type" : "number",
            "field" : "ReturningUsers",
            "as" : "Returning Users",
            "include_data" : true
        }
    ]
}

The designers of the Portal components know about this Query (KeyStats in the PreEmptive.Sessions domain) and its fields because all installed Queries are listed in the Metadata Query.

Database Retrieval

The permanent store is queried for the rows that match the Application and Time filters given by the request. The appropriate fields are loaded into this Phase of the pipeline.

Note that the names of fields will match the Query's specification, not necessarily adhering to the associated database-backed field.

Example

Our request asked the KeyStats Query for Session and User data from the "Sample App" application, occurring on 12 June or 13 June, 2014. We find a three rows in the database, as there were two applications that ran on the 12th.

AppId_Version Time CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 2014-06-12 3 {E} {}
SampleApp_v1.0 2014-06-12 5 {A, B, C} {}
SampleApp_v1.0 2014-06-13 2 {B, D} {D}

The loaded fields have different names than the database fields they draw from; for instance, the database's CompleteSessionStartCount is associated with the Key Stats Query field CompleteCount, based on the definition in the Query's QueryMetaData property:

new FieldMetadata
{
    AssociatedFieldKey = _fieldKeyFactory.GetFieldKey
        (SessionIndexer.Namespace, SessionIndexer.CompleteSessionStartCount),
    DataType = typeof(int),
    FieldName = "CompleteCount",
    FriendlyName = "Complete Sessions"
},

Note also that the Time field no longer has an hour associated with it; these results are already aggregated to daily summaries of the hours within that day. This is a special performance optimization for the Time field; data is stored both by hour and by day within the database and, as we are not aggregating by hour, it is more efficient to query for the day-level data.

Simple Field Aggregation

While the permanent store is being accessed, some data is aggregated within MongoDB before they are returned to the application, to take advantage of the database's ability to efficiently summarize data. Simple fields, such as int, are merged using the same rules as Publishing Aggregation. Complex fields are not fully merged at this point; they remain as an array of values to be merged later.

Example

We have requested data to be aggregated by application, but not by date. Therefore, the rows that differ only by Time will be merged.

Before aggregation

AppId_Version Time CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 2014-06-12 3 {E} {}
SampleApp_v1.0 2014-06-12 5 {A, B, C} {}
SampleApp_v1.0 2014-06-13 2 {B, D} {D}

After aggregation

AppId_Version CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 3 [{E}] [{}]
SampleApp_v1.0 7 [{A, B, C}, {B, D}] [{}, {D}]

The CompleteCount field, having a simple type, is merged immediately (by its Merge Option, Aggregate, which sums the original rows). The UniqueUsers and NewUsers fields, being of a complex type HyperLogLog, are stored as arrays of their original values*.

Pre-Aggregation Transform

The Query may apply a Query Transform at this point on fields that have been retrieved.

This type of transform is defined by the PreAggregationValueTransform property of a Query metadata's FieldMetadata object.

Example

The Exception Indexer stores the full information of all reported exceptions as a Pivot Key, StackTrace. However, users of the Exception Query might want to disregard differences in the stack trace, instead only grouping by the actual exception message. To accomplish this, the Exception Query essentially "splits" the database Pivot Key into two different Pivot Keys using Pre-Aggregation Transforms:

new FieldMetadata
{
    AssociatedFieldKey = _fieldKeyFactory.GetFieldKey
        (ExceptionIndexer.Namespace, ExceptionIndexer.StackTrace),
    FieldName = "ExceptionMessage",
    FriendlyName = "Message",
    DataType = typeof(string),
    PreAggregationValueTransform = (md, v) => (v as ExceptionInformation).Entries[0].Message
},
new FieldMetadata
{
    AssociatedFieldKey = _fieldKeyFactory.GetFieldKey
        (ExceptionIndexer.Namespace, ExceptionIndexer.StackTrace),
    FieldName = "ExceptionTrace",
    FriendlyName = "Stack Trace",
    DataType = typeof(string),
    PreAggregationValueTransform = (md, v) => (v as ExceptionInformation).ToString()
},

This causes additional grouping, if necessary, in the next step.

Complex Field Aggregation

Query-time aggregation completes by merging the remaining complex values. This completes the query-time aggregation.

Example

Previously, Simple Field Aggregation produced the following results:

AppId_Version CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 3 [{E}] [{}]
SampleApp_v1.0 7 [{A, B, C}, {B, D}] [{}, {D}]

The complex fields, UniqueUsers and NewUsers, can now be merged by application code:

AppId_Version CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 3 {E} {}
SampleApp_v1.0 7 {A, B, C, D} {D}

For instance, the UniqueUsers field in the v1.0 row has been merged from {A, B, C} and {B, D} into {A, B, C, D}.

Post-Aggregation Computation

After aggregation, computed fields may be derived from the data.

This type of computation is defined by the PostAggregationComputation property of a Query metadata's ComputedFieldMetadata object.

Example

Our Sample Indexer contains a Pivot Key called MemoryBucket, which groups available memory in a session into buckets by megabyte (MB). Because we aren't doing any calculations on this field, we originally defined it to be a String, but let's say we now have a user of our Query that wants the value to be in gigabytes (GB) and numeric. We can define an additional field, called MemoryBucketGB, for this purpose, in the Sample Query QueryMetaData:

return new QueryMetadata
{
    Name = "Sessions By Memory",
    ComputedFields = new List<ComputedFieldMetadata>()
    {
        new ComputedFieldMetadata()
        {
            DataType = typeof(int),
            FieldName = "MemoryBucketGB",
            FriendlyName = "Memory Amount in Gigabytes",
            Fields = new []{"MemoryBucket"},
            PostAggregationComputation = dict =>
            {
                return Int32.Parse((string)dict["MemoryBucket"]) / 1024;
            }
        }
    },
    Fields = new List<FieldMetadata>
    {
        // ...omitted...
    }
};

Post-Aggregation Transform

After aggregation, further Query Transforms on both database-backed and computed fields may run.

This type of transform is defined by the PostAggregationValueTransform property of a Query metadata's FieldMetadata and ComputedFieldMetadata objects.

Example

The Workbench's default User fields (such as UniqueUsers and NewUsers) use a third-party object, called a HyperLogLog, to track hashed user identities. However, the result of the Key Stats Query needs to report the number of users in a particular tracking object. This is done by transforming the field values from this tracking object to a count of the number of elements in the set (i.e. a cardinality):

Before transform

AppId_Version CompleteCount UniqueUsers* NewUsers*
SampleApp_v0.2 3 {E} {}
SampleApp_v1.0 7 {A, B, C, D} {D}

After transform

AppId_Version CompleteCount UniqueUsers NewUsers
SampleApp_v0.2 3 1 0
SampleApp_v1.0 7 4 1

Post-Transform Computation

At this point, all transforms are complete, but additional computed fields may be generated.

This type of computation is defined by the PostTransformComputation property of a Query metadata's ComputedFieldMetadata object.

Example

The ReturningUsers field in the KeyStats query isn't backed by a database field; rather, it is calculated based on the values of the UniqueUsers and NewUsers fields.

Before computation

AppId_Version CompleteCount UniqueUsers NewUsers
SampleApp_v0.2 3 1 0
SampleApp_v1.0 7 4 1

After computation

AppId_Version CompleteCount UniqueUsers NewUsers ReturningUsers
SampleApp_v0.2 3 1 0 1
SampleApp_v1.0 7 4 1 3

Output

Finally, the rows, with the requested fields, are returned to the Query API's client.

Example

A JSON response is sent back to the source of the request, e.g., the Portal. An abbreviated example:

{
    "table" : {
        "id" : "673d33b9-76f0-4086-9f46-45750eb0d169",
        "title" : "KeyStats",
        "columns" : [{
                "label" : "AppId_Version",
                "formatted_label" : "Application and Version",
                "column_type" : "formatted"
            }, {
                "label" : "CompleteCount",
                "formatted_label" : "Complete Sessions",
                "column_type" : "number"
            }, {
                "label" : "ReturningUsers",
                "formatted_label" : "Returning Users",
                "column_type" : "number"
            }
        ],
        "rows" : [{
                "values" : [<Sample App v0.2>, 3, 1]
            }, {
                "values" : [<Sample App v1.0>, 7, 3]
            }
        ]
    }
}

* These fields are represented here as sets of users, though in reality they are HyperLogLog data.



Workbench Version 1.2.0. Copyright © 2016 PreEmptive Solutions, LLC