(SM-2402) Best practices
Using Azure SQL with Datafridge
Azure SQL storage, especially the Serverless option is a good fit for Datafridge, as it provides good performance while keeping the costs low for irregular access to data.
Some considerations:
Storage costs per GB are higher than plain ADLS
Read performance is extremely good with optimizations.
With page compression you can achieve compression ratio 0.3 - 1 compared to HANA
Collation
Collation needs to be set up during database creation on Azure and can’t be change afterwards.
Use Latin1_General_100_BIN2_UTF8
Loading data into AzureSQL
If you are loading large amounts data into AzureSQL, this is how you can achieve the best performance.
Recommended strategy (TLDR)
Switch off
Use clustered primary key
Switch on
TABLOCK
Create all largest tables
Load all largest tables
Trigger manually PK indexes on tables created so far
Switch on
Use clustered primary key
Switch off
TABLOCK
Create and load all other tables
Compress all tables
1. Create tables without PK
Primary key in Azure SQL creates clustered index. This means that all inserts into a table need to be inserted into a proper place and table resorted. This causes slow bulk inserts especially as table grows. To keep bulk insert speed linear, create tables (at least the large ones) without PK.
To do this, uncheck Use clustered primary key
when creating tables.
After tables are loaded, you will need to recreate indexes!
2. Reach minimal logging scenario
We want to achieve most efficient bulk insert scenario. This means minimal
logging value in column without TF610, as it can’t be switched on Azure SQL.
It depends on 3 things:
Active indexes on the table
Rows in table (after first package all will have some, so we care only about “any“ rows)
Hints in the bulk inserts statement.
We can affect indexes and hints.
The optimal combination is heap tables (e.g. without clustered index) + TABLOCK
hint.
The second-best scenario is Cluster + no TABLOCK
hint, because at least it allows parallelization but whenever possible, go with 1st option.
Avoid all other combinations.
You can set hints in SM.
For reference about minimal logging scenarios, see table below from MSFT documentation.
Table Indexes | Rows in table | Hints | Without TF 610 | With TF 610 | Concurrent possible |
---|---|---|---|---|---|
Heap | Any | TABLOCK | Minimal | Minimal | Yes |
Heap | Any | None | Full | Full | Yes |
Heap + Index | Any | TABLOCK | Full | Depends (3) | No |
Cluster | Empty | TABLOCK, ORDER (1) | Minimal | Minimal | No |
Cluster | Empty | None | Full | Minimal | Yes (2) |
Cluster | Any | None | Full | Minimal | Yes (2) |
Cluster | Any | TABLOCK | Full | Minimal | No |
Cluster + Index | Any | None | Full | Depends (3) | Yes (2) |
Cluster + Index | Any | TABLOCK | Full | Depends (3) | No |
3. Apply page compression
Page compression provides efficient compression for Datafridge (or Datatiering) use case. It shouldn’t impact read performance, and since writes are rare or non-existent, we don’t mind the impact on writes.
You should apply compression after loading all data, because as mentioned, is has impact on writes.
The observed compression ratio is between 0.3 - 1 compared to HANA (should be calibrated with better data).
ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL
WITH (DATA_COMPRESSION = PAGE);
Get a list of uncompressed tables:
select distinct t.name AS UncompressedTables
from sys.partitions p
inner join sys.tables t
on p.object_id = t.object_id
where p.data_compression = 0
Apply compression for all tables:
DECLARE @NAME VARCHAR(100)
DECLARE @SQL NVARCHAR(300)
DECLARE @MYSCHEMA VARCHAR(100)
SET @MYSCHEMA = 'gd1'
DECLARE CUR CURSOR FOR
select distinct t.name AS NAME
from sys.partitions p
inner join sys.tables t
on p.object_id = t.object_id
where p.data_compression = 0
OPEN CUR
FETCH NEXT FROM CUR INTO @NAME
WHILE @@FETCH_STATUS = 0
BEGIN
SET @SQL = 'ALTER TABLE [' + @MYSCHEMA + '].[' + @NAME +'] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW);'
PRINT @SQL
EXEC Sp_executesql
@SQL
FETCH NEXT FROM CUR INTO @NAME
END
CLOSE CUR
DEALLOCATE CUR
Optimizing read performance
To reach optimal performance for reading, secondary indexes are crucial in AzureSQL. There are two options:
Turn on automatic performance tuning on AzureSQL
Create all indexes manually.
You can get existing recommendations for indexes using this SQL:
Please note that the statements generated by the statement should be taken with a bucket of salt, see https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=azuresqldb-current#Nonclustered for optimal nonclustered indexes.
MSFT recommends getting query plans like this, although I had no luck querying by query hashes…:
Sources
https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms177445(v=sql.105)