PreEmptive Analytics Workbench User Guide

Queries

Queries define which data is requested from the Query Web Service for use in the Portal.

Query configuration files have the extension .query.

Query Configuration

A Query configuration file lets you specify:

  • which fields to fetch from the server.
  • any server-side aggregation to do.
  • which aggregated fields should not be retrieved from the server (i.e. "include_data" : false), if you only want to use them for server-side aggregation, not for display or client-side calculations.

This creates a Query Dataset for use by Transformations and Widgets.

Example Syntax

{
    "domain" : "PreEmptive.Sessions",
    "name" : "KeyStats",
    "fields" : [
        {
            "field" : "AppId_Version",
            "include_data" : false
        },
        {
            "field" : "Time"
        },
        {
            "field" : "StartCount"
        },
        {
            "field" : "StopCount"
        },
        {
            "field" : "CompleteCount"
        },
        {
            "field" : "NewUsers"
        },
        {
            "field" : "ReturningUsers"
        },
        {
            "field" : "UniqueUsers"
        },
        {
            "field" : "Count"
        },
        {
            "field" : "TotalLength"
        },
        {
            "field" : "MinLength"
        },
        {
            "field" : "MaxLength"
        }
    ],
    "aggregate_on" : [
        {
            "field" : "Time",
            "options" : "day"
        }
    ]
}
  • domain (string): the domain of Server Query, as provided by the Metadata Query
  • name (string): the name of the Server Query, as provided by the Metadata Query
  • fields (object): an array of fields to fetch with the query, or filter on. Each field can have:
    • field (string): the name of the field, as defined by the Metadata Query
    • include_data (optional, boolean): whether to return the data with the query or not. Defaults to true. If the field is only being used to filter on the server, you can set this to false to allow filtering without returning the values of the field.
  • aggregate_on: an array of fields to aggregate on the server. Each field can have:
    • field (string): the name of the field, as defined by the Metadata Query
    • options: Time fields must specify how they should be aggregated: (hour|hour_of_day|day|week|month).

Note that the order of the fields specified does not reflect the order they will ultimately be displayed in. The display order is dependent on the Widget used to display the fields.

Metadata Query

The server provides a "metadata query" that catalogs all the available Server Queries and their details (domain, name, fields). You can access this query directly to see which queries are available from the server: 127.0.0.1:88/Interaction/query_metadata (or substitute the appropriate host name).

This query can be viewed in JSON format as well: 127.0.0.1:88/Interaction/query_metadata?format=json

A note on dates and timestamps

When a query receives a response from the server, it automatically converts any field with a timestamp type (as specified in the metadata and query response) from a .NET timestamp to JS date. It also changes the data_type of the field to date so that this change is understood by downstream components.

Note that this does not apply to timelength types, which remain expressed in .NET ticks.

Making Meaningful Queries

Before you begin making queries, first take some time to understand what the data actually means. There are some aspects that might surprise you.

Let's take an example. Imagine that you want to know how many times each application feature was used on a given day. So you make a query that asks for the feature name, the count, and the sum of feature lengths, and then you limit the query to just yesterday's data. You expect to get back a table like this:

Feature Name Count Length
ClickButton 100 0
RunJob 78 3000
JobError 22 0

(Note that Length is only reported for features that use separate Start and Stop messages; Feature Ticks always have zero-length.)

This simple example contains a number of possible surprises, as detailed below.

You must aggregate certain fields

In order to provide a meaningful "Count" value, the query has to be counting something. In this case, you wanted to count distinct "Feature Name" values (i.e. ClickButton, RunJob, JobError). So when you specified the query, if you didn't tell it to aggregate the Feature Name field, then you wouldn't have gotten meaningful results. If you are familiar with SQL, this is roughly equivalent to the GROUP BY clause.

More than that, though, there are certain fields that must be aggregated - and FeatureName is one of them (as defined by the default indexers, which can be changed). The database doesn't keep track of each individual feature event; it only tracks aggregate data, and it automatically aggregates by Feature Name. So there is no meaningful way to ask for data about Feature Names without asking for aggregate data.

The changing truth about yesterday

In our example above we asked for yesterday's data, and we saw the result in the example table.

Let's imagine that today is Wednesday and we ran that query for Tuesday. But now imagine that it's tomorrow - Thursday - and we run the query again for Tuesday. It's quite likely that we'll see different results.

The reason is that the query is telling us about the date on which the event happened, but the data about that date doesn't necessarily all arrive on that day. Imagine that someone used the RunJob feature on Tuesday, but then their app crashed or shut down quickly. It might not have had time to send the message to the server, telling it about the run. (This is especially common on mobile platforms.) The next time they run the app, it will send the cached information about the RunJob feature (and about the crash, if there was one). But if they don't run the app again until Thursday, then the server won't hear about Tuesday's run until Thursday - so Wednesday's query results will be "incomplete".

This is a very common occurrence with application analytics - the results will continue to change so long as new data continues to come in.

Ticks, Starts, Stops, and Total

Feature data can come from applications in two forms - as a "tick" or as a pair of "start" and "stop" messages. A tick is just a single message that tells us that a feature happened at a moment in time. A start/stop pair, on the other hand, tells us when a feature started, and when it stopped. Both types of data are reported by the same query, and it's important to understand the difference when you construct a query.

Think about the "Count" field in the example table above. What is it counting? Is it counting ticks? Starts? Stops? Note that there is no guarantee that a Start will always be accompanied by a Stop, or vice-versa.

In fact, there are a number of different fields - StartCount, StopCount, TickCount, and Count (which can be thought of as "total features" - the number of ticks and pairs of start/stop messages). (Plus two more, discussed below.) Each of them counts a specific aspect of the feature data, and you have to choose which one(s) you want, when you construct your query.

Note that Count does what you might expect - if a Start and Stop are received on the same day, Count will only report 1 for that day.

This same consideration also applies to Session data, although in simpler form. Session data doesn't have the concept of a "tick" but it still has "start" and "stop" messages, and the corresponding options for how to count them.

Complete and Incomplete counts

When querying for session and/or feature data, one of the options is to ask for CompleteCount or IncompleteCount counts. CompleteCount tells you that both the Start and Stop were received. IncompleteCount tells you that one or the other was received, but not both.

There's a hidden complexity here, that can be demonstrated with an example. Imagine that a session was started on Monday but didn't finish until Tuesday. Now imagine a query that just asked for Monday's Complete and Incomplete counts. What should be reported?

What the Workbench does is report 1 Complete and 0 Incomplete sessions. In other words, it reports the truth based on all the data available to it, even if that data (i.e. the Stop message) isn't in the query window.

There's another complexity here - imagine that the query asked for Tuesday's data. What should be reported?

The Workbench will report 0 Complete and 0 Incomplete sessions. This is because Complete sessions (and features) are always treated as if they happened at the start time, regardless of how long they lasted and when they ended. This helps make queries consistent across different ranges, and makes more-complex calculations that use Complete sessions as an input work out appropriately.

Incomplete sessions (and features) are always reported at whichever time we are aware of, e.g. the Start time or the Stop time.

New, Returning, and Unique users

The Workbench keeps track of three distinct concepts that seem easy to understand but often have surprising behaviors: New users, Returning users, and Unique users. Part of the complexity arises because these values are determined a query-time using advanced algorithms that give a better answer than simple counting would provide.

Let's start with New users. The Workbench considers a user to be "new" if it has never seen their unique ID before. (The ID is specified by the analytics client - it is typically a hashed version of the user ID and the hostname.) A user can only be considered "new" exactly once. So if they were New on Tuesday, then Tuesday is the only day that they'll ever be reported as New.

But what if the same user ID was reported for two different instrumented apps on the same day? In such a case, if a query just asked for total New users, the answer would just be 1. But if a query asked for New user counts, by application, then each application will report a single New user.

Now let's look at Returning users. The Workbench considers a user to be "returning" if it has seen their unique ID before. So if they were New on Tuesday, and a query just asks for Friday's data, they'll still just appear as Returning (assuming they used the app on Friday).

Now let's look at Unique users. The Workbench counts each distinct user ID as a unique user, within whatever period is being counted. So if one user used an app on Monday and Tuesday, then a query for total unique users in that week will report 1. But a query that asks for that week's data, aggregated by Day, will report 1 on Monday and 1 on Tuesday.

What if that user used two different applications (and both reported the same user ID)? It depends on how the query is structured. If the query just asks for total unique users, then the answer will be 1. But if the query asks for unique users per application, then each application will report 1 (i.e. a total of 2).

Finally, there's a complexity with the way user counts interact with incomplete sessions. Specifically, if a Stop message is the only message received for a session, then the user associated with that Stop will not be counted. The Workbench always allocates users to the Start time, so if the start time is missing, the user isn't counted.



Workbench Version 1.2.0. Copyright © 2016 PreEmptive Solutions, LLC