PreEmptive Analytics Workbench User Guide

Temp-bin Cleanup

Recall from the Data-Processing Overview that indexers can save message data within temporary storage so that it can be combined with later messages to generate new data. To prevent this storage from growing indefinitely, the Workbench periodically removes bins that are considered outdated with respect to messages most recently processed.

Configuration

Settings for this process may be adjusted by editing the keys in <Installation Folder>\Windows\Computation Service\WorkbenchComputationService.exe.config, and restarting the Computation Service:

<appSettings>
    <add key="StaleTime" value="30.00:00:00"/> <!-- Days.HH:MM:SS -->
    <add key="GcCollectEvery" value="1000"/>
    <add key="GcMaximumCollectionInterval" value="01:00:00"/>
    <add key="GcRollingCount" value="5000"/>
</appSettings>

Where

  • StaleTime is the age a bin needs to be, relative to the recently processed messages, before it is removed. The default is 30.00:00:00 (30 days).
  • GcCollectEvery is the number of processed messages that will trigger a cleanup. The default is 1000.
  • GcMaximumCollectionInterval is the amount of time that will trigger a cleanup. The default is 01:00:00 (1 hour).
  • GcRollingCount is the number of messages to consider "recent". The default is 5000.

Adjusting the Cleanup Frequency

The temp-bin cleanup process stops ingestion by the Computation Service in order to prevent an invalid database state. By default, this happens every 1000 messages processed, or every hour, whichever comes first. The frequency of these cleanups can be configured.

The process is triggered when a batch is processed from the endpoint queue and, since the last cleanup or the start of the Computation Service:

  • More than GcCollectEvery messages have been processed, or
  • More than the time specified by GcMaximumCollectionInterval has elapsed.

Increasing these values will decrease the frequency of cleanups, but may increase the amount of data that needs to be removed in each cleanup process (as more messages will be processed between cleanup triggers).

Adjusting the Cleanup Size

In general, a bin is eligible for removal if the time of its last update was more than StaleTime ago. By default, bins that have not been updated within the last 30 days will be removed. For most customers, this default is suitable, but some administrators may need to adjust StaleTime to fit their use cases.

Reducing the Cleanup Size

If this Workbench will receive an exceptionally large amount of data, reducing the StaleTime setting will decrease disk usage and increase ingestion performance.

However, reducing this value too much may lead to data from multiple messages not correctly being correlated. Consider the case where StaleTime is 15.00:00:00. If a session sends a Session.Start message, sends no data for 20 days, and then sends a Session.Stop message, the information from the Session.Start will be removed by the time the Session.Stop arrives, thus leading to 2 incomplete sessions being recorded, rather than 1 complete one.

Note: For the purposes of temp-bin cleanup, when a child bin is modified, its ancestors are considered modified as well. In the previous scenario, if feature messages from that session had been regularly received within the 20 days, the session's last-updated time would continue to be updated as well, preventing the Session.Start information from being removed prematurely.

Increasing the Cleanup Size

Increasing the StaleTime will lead to more disk usage and reduced ingestion performance. We do not recommend increasing the StaleTime beyond the default unless:

  • The Workbench needs to support instrumented applications that emit no messages for long periods (i.e., more than 30 days) during their sessions, or
  • Your network scenario can prevent live data from reaching the Workbench for long periods of time (i.e., more than 30 days).

Replay Considerations

The cleanup process is designed to accommodate ingestion scenarios from both real-time data and data from a replay (from the PreEmptive Analytics Standalone Repository & Replayer), and return the same results no matter which way it arrived. This is because the dates the cleanup process uses are based on when envelopes first arrived at any PreEmptive Analytics Suite product, rather than the server time at which these envelopes were processed.

This date is transmitted via the X-PreEmptive-ReceiveDate HTTP Header. Because Standalone Repository version 1.0 did not support this header, the Workbench endpoint will not accept envelopes of a replay from a 1.0 Standalone Repository (see the Known Issue).



Workbench Version 1.2.0. Copyright © 2016 PreEmptive Solutions, LLC