(Glue-2411) Optimizing Scheduler Configuration

Glue Scheduler can be set up in multiple ways. This page discusses the optimization techniques and recommended ways to use Glue Scheduler.

Important settings

Package size (MB)

Size in MB of a single package transferred during extraction process execution. Recommended value for majority of the storage types is 100MB. This does not apply to replication via Snowpipe streaming or Google BigQuery streaming, where the default recommended value should be 10MB due to API limitations. When increasing this number be mindful of the memory constraints of the system.

Processes on specific tables might have problems with the calculation of the package size, for those tables it might be needed to set a custom package size. More in problematic processes.

Max process runtime (minutes)

Maximum execution time in minutes for one extraction process. Note that this number is not exact, it blocks another package from being fetched after the max runtime is reached but the whole process might take more time to finish. The purpose of this setting is to avoid jobs being blocked by the same extraction process for extensive periods of time.

Debug log level

This setting is meant only for troubleshooting purposes, please do not enable it in normal circumstances as it will generate a considerable amount of logs.

Max number of jobs

Maximum number of jobs that can be used by the scheduler (scheduler job not included). Number of jobs is a very important parameter that must be configured with respect to the capabilities of the system, the number of processes in the scheduler, and their expected data volume. It’s recommended to finetune the value of maximum number of jobs and confirm that the scheduler can reliably transfer all the data in sufficient time.

Max jobs used for full loads (%)

Percentage of jobs that can be used for full loads (so the delta execution is not blocked for longer periods). Similarly to the maximum number of jobs, it’s important to experiment with this option to ensure both full loads and delta loads have sufficient resources.

Allow parallelization

This option is only available for QUEUE scheduler type. If there are spare jobs, extraction processes that are already running can be executed again transferring the data from a single queue in parallel. Using this option can significantly improve performance.

Process grouping

We suggest grouping processes into multiple schedulers so it’s easier to monitor the performance and stability. They can be grouped based on logical characteristics(module, purpose, storage, etc.) or by expected data volume/flow on source tables. Having high-volume tables organized in separate schedulers helps with performance optimization as settings can be customized separately and tested more efficiently.

Problematic processes

Some processes with source tables that contain fields with STRING or RAW types might have problems with package size calculation and the execution may fail on memory limits. To overcome this it’s recommended to use a separate scheduler for such tables and gradually lower the package size setting until all processes run without problems. The minimal package size that can be set through the scheduler is 1MB.

There might be cases where even this value is too high. If that’s the case it’s recommended to create a variant for the problematic process reports and set the package size(records, not MB) manually. Similarly, as with the scheduler package size, this value should be gradually lowered until the process runs without issues.