(SM-2402) Best practices

Using Azure SQL with Datafridge

Azure SQL storage, especially the Serverless option is a good fit for Datafridge, as it provides good performance while keeping the costs low for irregular access to data.

Some considerations:

  • Storage costs per GB are higher than plain ADLS

  • Read performance is extremely good with optimizations.

  • With page compression you can achieve compression ratio 0.3 - 1 compared to HANA

Collation

Collation needs to be set up during database creation on Azure and can’t be change afterwards.

Use Latin1_General_100_BIN2_UTF8

Loading data into AzureSQL

If you are loading large amounts data into AzureSQL, this is how you can achieve the best performance.

Recommended strategy (TLDR)

  1. Switch off Use clustered primary key

  2. Switch on TABLOCK

  3. Create all largest tables

  4. Load all largest tables

  5. Trigger manually PK indexes on tables created so far

  6. Switch on Use clustered primary key

  7. Switch off TABLOCK

  8. Create and load all other tables

  9. Compress all tables

1. Create tables without PK

Primary key in Azure SQL creates clustered index. This means that all inserts into a table need to be inserted into a proper place and table resorted. This causes slow bulk inserts especially as table grows. To keep bulk insert speed linear, create tables (at least the large ones) without PK.

To do this, uncheck Use clustered primary key when creating tables.

After tables are loaded, you will need to recreate indexes!

2. Reach minimal logging scenario

We want to achieve most efficient bulk insert scenario. This means minimal logging value in column without TF610, as it can’t be switched on Azure SQL.

It depends on 3 things:

  • Active indexes on the table

  • Rows in table (after first package all will have some, so we care only about “any“ rows)

  • Hints in the bulk inserts statement.

We can affect indexes and hints.

The optimal combination is heap tables (e.g. without clustered index) + TABLOCK hint.

The second-best scenario is Cluster + no TABLOCKhint, because at least it allows parallelization but whenever possible, go with 1st option.

Avoid all other combinations.

You can set hints in SM.

image-20240129-162143.png

For reference about minimal logging scenarios, see table below from MSFT documentation.

Table Indexes

Rows in table

Hints

Without TF 610

With TF 610

Concurrent possible

Table Indexes

Rows in table

Hints

Without TF 610

With TF 610

Concurrent possible

Heap

Any

TABLOCK

Minimal

Minimal

Yes

Heap

Any

None

Full

Full

Yes

Heap + Index

Any

TABLOCK

Full

Depends (3)

No

Cluster

Empty

TABLOCK, ORDER (1)

Minimal

Minimal

No

Cluster

Empty

None

Full

Minimal

Yes (2)

Cluster

Any

None

Full

Minimal

Yes (2)

Cluster

Any

TABLOCK

Full

Minimal

No

Cluster + Index

Any

None

Full

Depends (3)

Yes (2)

Cluster + Index

Any

TABLOCK

Full

Depends (3)

No

3. Apply page compression

Page compression provides efficient compression for Datafridge (or Datatiering) use case. It shouldn’t impact read performance, and since writes are rare or non-existent, we don’t mind the impact on writes.

https://learn.microsoft.com/en-us/sql/relational-databases/data-compression/page-compression-implementation?view=azuresqldb-current

You should apply compression after loading all data, because as mentioned, is has impact on writes.

The observed compression ratio is between 0.3 - 1 compared to HANA (should be calibrated with better data).

ALTER TABLE Production.TransactionHistory REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE);

Get a list of uncompressed tables:

select distinct t.name AS UncompressedTables from sys.partitions p inner join sys.tables t on p.object_id = t.object_id where p.data_compression = 0

Apply compression for all tables:

DECLARE @NAME VARCHAR(100) DECLARE @SQL NVARCHAR(300) DECLARE @MYSCHEMA VARCHAR(100) SET @MYSCHEMA = 'gd1' DECLARE CUR CURSOR FOR select distinct t.name AS NAME from sys.partitions p inner join sys.tables t on p.object_id = t.object_id where p.data_compression = 0 OPEN CUR FETCH NEXT FROM CUR INTO @NAME WHILE @@FETCH_STATUS = 0 BEGIN SET @SQL = 'ALTER TABLE [' + @MYSCHEMA + '].[' + @NAME +'] REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW);' PRINT @SQL EXEC Sp_executesql @SQL FETCH NEXT FROM CUR INTO @NAME END CLOSE CUR DEALLOCATE CUR

Optimizing read performance

To reach optimal performance for reading, secondary indexes are crucial in AzureSQL. There are two options:

  1. Turn on automatic performance tuning on AzureSQL

  2. Create all indexes manually.

You can get existing recommendations for indexes using this SQL:

Please note that the statements generated by the statement should be taken with a bucket of salt, see https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-index-design-guide?view=azuresqldb-current#Nonclustered for optimal nonclustered indexes.

 

MSFT recommends getting query plans like this, although I had no luck querying by query hashes…:

Sources

https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms177445(v=sql.105)

https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)?redirectedfrom=MSDN

https://learn.microsoft.com/en-us/sql/relational-databases/data-compression/page-compression-implementation?view=azuresqldb-current