1 Using Azure SQL with Datafridge
- 1.1 Collation
- 1.2 Loading data into AzureSQL
  - 1.2.1 Recommended strategy (TLDR)
  - 1.2.2 1. Create tables without PK
  - 1.2.3 2. Reach minimal logging scenario
  - 1.2.4 3. Apply page compression
- 1.3 Optimizing read performance
- 1.4 Sources

Using Azure SQL with Datafridge

Azure SQL storage, especially the Serverless option is a good fit for Datafridge, as it provides good performance while keeping the costs low for irregular access to data.

Some considerations:

Storage costs per GB are higher than plain ADLS
Read performance is extremely good with optimizations.
With page compression you can achieve compression ratio 0.3 - 1 compared to HANA

Collation

Collation needs to be set up during database creation on Azure and can’t be change afterwards.

Use Latin1_General_100_BIN2_UTF8

Loading data into AzureSQL

If you are loading large amounts data into AzureSQL, this is how you can achieve the best performance.

Recommended strategy (TLDR)

Switch off Use clustered primary key
Switch on TABLOCK
Create all largest tables
Load all largest tables
Trigger manually PK indexes on tables created so far
Switch on Use clustered primary key
Switch off TABLOCK
Create and load all other tables
Compress all tables

1. Create tables without PK

Primary key in Azure SQL creates clustered index. This means that all inserts into a table need to be inserted into a proper place and table resorted. This causes slow bulk inserts especially as table grows. To keep bulk insert speed linear, create tables (at least the large ones) without PK.

To do this, uncheck Use clustered primary key when creating tables.

After tables are loaded, you will need to recreate indexes!

2. Reach minimal logging scenario

We want to achieve most efficient bulk insert scenario. This means minimal logging value in column without TF610, as it can’t be switched on Azure SQL.

It depends on 3 things:

Active indexes on the table
Rows in table (after first package all will have some, so we care only about “any“ rows)
Hints in the bulk inserts statement.

We can affect indexes and hints.

The optimal combination is heap tables (e.g. without clustered index) + TABLOCK hint.

The second-best scenario is Cluster + no TABLOCKhint, because at least it allows parallelization but whenever possible, go with 1st option.

Avoid all other combinations.

You can set hints in SM.

For reference about minimal logging scenarios, see table below from MSFT documentation.

Table Indexes	Rows in table	Hints	Without TF 610	With TF 610	Concurrent possible

Table Indexes	Rows in table	Hints	Without TF 610	With TF 610	Concurrent possible
Heap	Any	TABLOCK	Minimal	Minimal	Yes
Heap	Any	None	Full	Full	Yes
Heap + Index	Any	TABLOCK	Full	Depends (3)	No
Cluster	Empty	TABLOCK, ORDER (1)	Minimal	Minimal	No
Cluster	Empty	None	Full	Minimal	Yes (2)
Cluster	Any	None	Full	Minimal	Yes (2)
Cluster	Any	TABLOCK	Full	Minimal	No
Cluster + Index	Any	None	Full	Depends (3)	Yes (2)
Cluster + Index	Any	TABLOCK	Full	Depends (3)	No

3. Apply page compression

Page compression provides efficient compression for Datafridge (or Datatiering) use case. It shouldn’t impact read performance, and since writes are rare or non-existent, we don’t mind the impact on writes.

https://learn.microsoft.com/en-us/sql/relational-databases/data-compression/page-compression-implementation?view=azuresqldb-current

You should apply compression after loading all data, because as mentioned, is has impact on writes.

The observed compression ratio is between 0.3 - 1 compared to HANA (should be calibrated with better data).

ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);

Get a list of uncompressed tables:

select distinct t.name AS UncompressedTables
from sys.partitions p
inner join sys.tables t
on p.object_id = t.object_id
where p.data_compression = 0

Apply compression for all tables:

DECLARE @NAME VARCHAR(100)
DECLARE @SQL NVARCHAR(300)
DECLARE @MYSCHEMA VARCHAR(100)

SET @MYSCHEMA = 'gd1'

DECLARE CUR CURSOR FOR
  select distinct t.name AS NAME
		from sys.partitions p
		inner join sys.tables t
		on p.object_id = t.object_id
		where p.data_compression = 0

OPEN CUR

FETCH NEXT FROM CUR INTO @NAME

WHILE @@FETCH_STATUS = 0
  BEGIN
      SET @SQL = 'ALTER TABLE [' + @MYSCHEMA +  '].[' + @NAME +'] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW);'

      PRINT @SQL
      EXEC Sp_executesql
        @SQL

      FETCH NEXT FROM CUR INTO @NAME
  END

CLOSE CUR

DEALLOCATE CUR

Optimizing read performance

To reach optimal performance for reading, secondary indexes are crucial in AzureSQL. There are two options:

Turn on automatic performance tuning on AzureSQL
Create all indexes manually.

You can get existing recommendations for indexes using this SQL:

Please note that the statements generated by the statement should be taken with a bucket of salt, see https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=azuresqldb-current#Nonclustered for optimal nonclustered indexes.

MSFT recommends getting query plans like this, although I had no luck querying by query hashes…:

Sources

https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms177445(v=sql.105)

https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)?redirectedfrom=MSDN