PreEmptive Analytics Workbench User Guide

Computation Architecture

At its heart, the Workbench computation process is very similar to a traditional Extract-Transform-Load (ETL) pattern. The Workbench consists of a series of software components called Indexers that interact with the Workbench core to define how data is extracted and published. The data in permanent storage is exposed via Server Queries into that storage, made available through the Query API.

Indexers and Queries, as well as Enhanced API components, are contained in .NET binaries called Plugins. While the Workbench comes with a robust set of Plugins for general use cases, they can be supplemented or replaced with more customized data-processing components, when customizing the Portal is not sufficient.

This page explains the basics of the Workbench's data-processing; detailed instructions on creating Plugins can be found in the remainder of this section, as well as a detailed description of the data-processing pipeline for reference.

Database Structure

The core database of the Workbench is backed by a local MongoDB instance. The database is divided into two fundamental parts:

  • Temporary storage contains state-tracking data that is "live" and is to be later aggregated into permanent storage.
  • Permanent storage contains data that has been indexed and aggregated, and is ready for retrieval by the Query API.

Temporary Bins

Temporary storage is comprised of temporary bin objects. These "temp-bins" store data that has been processed by Indexers, but may not yet have been published into permanent storage, and thus are not available for querying.

Temp-bins consist of key-value pairs corresponding to database fields. For instance, if there is a FeatureName field in permanent storage, an Indexer would extract the Feature Name from an incoming message and store it in a temp-bin (e.g., as {"FeatureName": "The Extracted Name of the Feature"}). When the temp-bin is ready to be published, these keys will be used to correlate merging into permanent storage.

Each temp-bin is defined by a Scope, which consists of a Scope type, and an associated scope ID. There are four types of Scopes:

  • Application Run, with a messageGroupID - contains data pertaining to an entire run of a client application. Each time a client application starts it creates a messageGroupID to be used in all outgoing messages from that run of the application.
  • Session, with a sessionID - contains data pertaining to a particular session during the application run. Some APIs permit multiple sessions during an application run (each with a unique sessionID), but many do not.
  • Feature, with a featureGroupID - contains data pertaining to an instance of a Feature. When a Feature Start message is generated by the client, an associated featureGroupID is also created, and used to correlate that Feature Start with a matching Feature Stop (Feature Tick messages also create this ID).
  • Message, with a messageID - contains data from a single analytics message. Each message has its own messageID.

Each Indexer defines a Scope type to work with, which determines what kind of temp-bin it will be able to store processed data in. For instance, the User Indexer operates on Application Run Scope, because information about the user of the application applies to all sessions that are started by a particular application run. On the other hand, the Custom Data Indexer operates on the Message Scope, because many types of messages could potentially contain Custom Data.

Indexers that use the same Scope type will share the same temp-bins. For instance, the Sample Indexer defined in this guide uses Application Run Scope, so it can access data extracted by the Application Run Indexer.

While there is no inherent hierarchy among Scopes, and thus between temp-bins, Indexers may also define Parent Scopes. Doing so causes data from the Parent Scope's temp-bin to also be available in the temp-bin the Indexer normally operates on. With the default Plugins, this means that Application Run is a parent of Session, which is a parent of both Feature and Message. See the Advanced page for an example.

Permanent Storage

Permanent Storage can be conceptualized as a traditional tabular database, with columns defined by database fields. Each "row" in this database is uniquely identified by a combination of Pivot Keys (a type of field), as well as two special keys:

  • Application - the identity of the client application associated with the data. This includes the Company and Application IDs, the application name, and application version.
  • Date - the time in which the data was originally generated. This is always expressed in UTC, with a granularity of 1 hour.

These values are automatically extracted and published by the Workbench. There is also a default Plugin, AppAndTimeQueryPattern, that exposes this data to the Query API.

Permanent Storage consists of multiple "tables", each having a different set of Pivot Keys, allowing queries to quickly find data that is relevant to their request. For instance, there may be two tables with the following keys:

  1. Application, Date
  2. Application, Date, OS, Runtime, Location

If a query requests data organized by Operating System, or only from a certain Location, Permanent Storage will provide table #2. If the query doesn't care about these fields, Permanent Storage will provide table #1.

Note that table #2 provides its three additional fields together, rather than there being a separate table for each field. This is due to these three fields being published together, via the OsRuntimeLocationPattern. This concept is important when defining filterable fields for queries, but otherwise is handled automatically by the Query Web Service.

Namespacing

Extracted data may be shared among indexers and other components. To allow similar names to be used by multiple components without interference, each field in temporary and permanent storage additionally has a namespace, which is usually defined on a per-component basis (i.e., each Indexer has its own namespace, and all fields declared by that Indexer use that namespace by default).

This is not to be confused with the Domain declared by Server Queries, which is exposed to consumers of the Query API and applies to the visible queries and fields.

Database Fields

Fields in temporary and permanent storage are defined by each Indexer's DefineFields method, and are comprised of the following characteristics:

  • Namespace (see above).
  • Name used internally (each Server Query may define its own names for fields).
  • Type, a compatible C# type, such as Int32 or String.
  • FieldType, which determines how it affects rows in permanent storage; one of:
    • PivotKey - will be part of the unique key for rows in permanent storage,
    • Data - will be non-key field in permanent storage, or
    • Metadata - will not be stored in permanent storage (only in temporary storage, for use by other calculations).
  • MergeOption, which determines how Data fields from temporary storage merge their values into permanent storage; one of:
    • Aggregate (default) - sum the existing (permanent) value with the new (temporary) value,
    • Max - take the maximum of the existing and new values,
    • Min - take the minimum of the existing and new values, or
    • Replace - always take the temporary value.
  • AllowLargeValue flag, which allows String values larger than 1024 characters (at a performance cost). Defaults to false.
  • AllowCopyToChildren flag, which determines whether the field is usable by child scopes. Defaults to false.

The public fields exposed by the Query API are defined separately, by Server Queries, but usually correspond to these internal database fields.



Workbench Version 1.2.0. Copyright © 2016 PreEmptive Solutions, LLC