Business Intelligence & Data Warehouse Tutorials

Samstag, 13. Januar 2018

Python vs SQL

In this post I would like to demonstrate how basic SQL commands can be written in Python.
You can find a nice documentation of this topic on the pandas website.

First we will need to import the following packages:
import pandas as pd
import numpy as np
import pyodbc as py

To get our data, we will establish a connection to a SQL server:
conn = py.connect(DRIVER = '{SQL Server}', host = server, Database = 'AW', IntegreatedSecurity = 'yes')

Next we will have to define the SQL query:
sql = 'SELECT so.SalesOrderID \
      ,st.Name \
      ,so.OrderDate \
      ,so.DueDate \
      ,so.ShipDate \
      ,so.SalesOrderNumber \
      ,so.PurchaseOrderNumber \
      ,so.AccountNumber \
      ,so.CurrencyRateID \
      ,so.TotalDue \
FROM Sales.SalesOrderHeader so \
INNER JOIN Sales.SalesTerritory st    ON so.TerritoryID = st.TerritoryID'

We will execute the query and store the result into a dataframe:
df = pd.read_sql(sql, conn)
print(df.head(5))

WHERE Filter
To filter on a certain column following syntax is used:
dataframe[dataframe['columnname'] == 'value']

print(df[df['SalesOrderNumber'] == 'SO70708'].head(5))

OR + AND Filter
Like in other programming languanges a pipe | is used for OR and a & symbol for AND
print(df[((df['SalesOrderNumber'] == 'SO70708') | (df['DueDate'] == '2008-05-28')) & (df['PurchaseOrder'] <=5 ])

NULL and NOT NULL
To find NULLs or NOT NULLs you simply can use the built in functions isnull() or notnull()
print(df[df['PurchaseOrderNumber'].isnull()])
print(df[df['PurchaseOrderNumber'].notnull()])

GROUP BY
Group By can be applied with the size() function
print(df.groupby('PurchaseOrderNumber').size())

It is possible to use mutliple aggregate funcions at once in connection with group by:
print(df.groupby('AccountNumber').agg({'TotalDue' : [np.mean, np.sum], 'AccountNumber' : np.size}))

JOINS
Now, let's move to the core element of SQL...Joins!
This can be achieved in Python with the merge function.
For better demonstration, I will create 2 simple data frames and show how to add an index and how to write a inner, left, right and full outer join in Python.

#Inner Join
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E']})
df2 = pd.DataFrame({'key': ['B', 'C', 'D', 'D', 'F']})
print(pd.merge(df1, df2, on = 'key'))

#Left Outer Join
print(pd.merge(df1, df2, on = 'key', how = 'left'))

#Right Join
print(pd.merge(df1, df2, on = 'key', how = 'right'))

#Full Join
print(pd.merge(df1, df2, on = 'key', how = 'outer'))

UNION (ALL)
The function concat() will append the dataframes listed. This will deliver the same result as UNION ALL in SQL. If you want to get the same result as what an UNION does, add drop_duplicates() after the concat() function.

#Union all
print(pd.concat([df1, df2]))

#Union
print(pd.concat([df1, df2]).drop_duplicates())

Dienstag, 26. September 2017

SQL SERVER ADMINISTRATION: FULL, DIFFERENTIAL AND TRANSACTION LOG BACKUP TYPES

In this tutorial, I will go through the different types available for creating Backups, which will be essential for your decisicion on which Backup Strategy to go for.
I will mainly focus on Full, Differential and Transaction Log Backups.

FULL DATABASE BACKUPS
This backup type will include all objects (views, procedures, functions...), the tables including the data, users and rights.
A Full Backup enables you to restore the database as it was at the time the Backup was created. Transactions occuring during the Backup process will also be captured.

DIFFERENTIAL BACKUPS
Differential Backups will capture the data that has been altered since the last Full Backup. So when restoring, you will need the Full + Differential Backup files.
An advantage is that Differential Backups are faster than Full Backups (could be a huge difference depending on the size of the Database).

TRANSACTION LOG BACKUPS
Transactions Log Backups will capture all changes that occured since the last transaction log backup. Within this process, a cleaning of the transaction logs happens (a removal of transactions that have been comitted or cancelled).
This Backup Type captures data up to the time the Backup Process was started, whereas Full and Differential Backups also capture transactions that are altered during the backup process.

FILE AND FILEGROUP BACKUPS
This type is usually used for large Databases. Basically specific Files and Filegroups are backed up.

COPY-ONLY BACKUPS
Copy-Only Backups will deliver the same results as a Full Database Backup or Transaction Log backups. The reason for using a Copy-Only Backup is that this Backup type will be ignored, meaning a Copy-Only Backup initiated between a Full and Transactional Backup will not affect the timestamp of the last Full Backup.

You can create a Backup either manually or with T-SQL Command.
If you decide to do it manually, right click on any Database -> Tasks -> Back Up

T-SQL:

--Full Backup
BACKUP DATABASE Databasename TO [YourPath]

--Transactional Backup
BACKUP LOG Databasename TO [YourPath]

--Differential Backup
BACKUP DATABASE Databasename TO [YourPath] WITH DIFFERENTIAL

Select the Database, the Backup Type and Destination

To get a better understanding of the different Backup Types, let's have a look at the table below.
At the time I created the Backup, it appears to be empty:

I inserted 1 country (Sweden) into the table and created a Differential and a Transaction Log Backup.

Finally I inserted another row by adding Netherlands to the table and created another Transaction Log Backup.

RESTORING BACKUPS

Now that we have created our Backups, will need to restore them. Right click on the Destination Database -> Tasks -> Restore -> Database

On the General tab, select device and click on the ... button to select location of the stored Backup Files.
You will also need to select a Destination Database.
On the Files tab, ensure your files are pointing to the right folder (and not of the original source)

On the Options tab, select Overwrite and RESTORE WITH RECOVERY.

To restore a FULL Backup you need to select the Full Backup File:

To restore a DIFFERENTIAL Backup you need to select the Full Backup File + Differential Backup files:

To restore a TRANSACTION LOG Backup you need to select the Full Backup File + Transaction Log files:

As you can see in the above picture you have the option to restore to any time where an an Transaction Log Backup was created.
To get a better picture I will summarize the 3 Backup Types in relation to entries in the Country table.

Dienstag, 23. Mai 2017

SQL SERVER ADMINISTRATION: EXTRACT SHOWPLAN RESULTS FROM SQL SERVER PROFILER FOR FUTHER ANALYSIS

The Estimated or Actual Execution Plan provided by SQL Server gives a nice graphical overview, but can be hard to read if you have too many objects and you will also need to point your cursor on each object for further information. There is also an option to extract the Execution Plan in XML, but that requires additonal steps to make it readable. In this tutorial I will show an alternative way with SQL Server Profiler that will provide detailed execution plan information in text form that can be easily reused for further analysis.

Once you've open SQL Server Profiler click on the File tab and then on New Trace.

SQL Server Profiler returns massive information. To narrow down the results and to focus on our main goal, some settings are essential. In the Trace properties click on the Show all events tab and expand the Performance section to capture the Showplan data.

Under Performacnce you will see all types of Showplan information. The one most relevant for our excercise here is Showplan All.

To narrow down the results even more, click on Column Filters in the Event Selections Tab. Depending on column and operator you can basically filter on any available column with an exact serch term or use wildcards. I will filter on my login to avoid getting system admin records returned.

Meanwhile I executed my query from SSMS and SQL Server Profiler immediately returned following records:

Showplan All returns the execution plan and in addition detailed information such as what type of join operator is used, CPU costs, nr of executions etc. Simply mark the relevant text and right click copy.

You can save this info as a txt file or simply paste it in Excel. Once you have it in the right format, the whole thing is very easy to read and you can sort descending by the most expensive objects to lead you straight to the root cause.

Samstag, 11. März 2017

T-SQL: PARSING VALUES BASED ON NTH OCCURENCE OF A CHARACTER IN A STRING

I recently had to parse values stored in a column that had following pattern: -'Word1-Word2-Word3-Word4-' etc.
SQL Server already has the built in function CHARINDEX() which is quite helpful, but the issue is that it only returns the position of the 1st occurence.
Technically I could nest one CHARINDEX() in another one, but that would create a big piece of code, is a little hard to read plus not a dyamic solution. So, I decided to create a function that returns the values based on input parameters:

CREATE FUNCTION dbo.ParseStringValue(
@SplitCharacter nvarchar(5),
@TargetString nvarchar(50),
@NthOccurence int)

RETURNS nvarchar(50)

AS
BEGIN

DECLARE @Pos1 int = 0,
@Pos2 int = 0,
@Count int = 0,
@Result nvarchar(25)

WHILE @Count <= @NthOccurence
BEGIN

SET @pos1 = @pos2
SET @pos2 = CHARINDEX(@SplitCharacter, @TargetString, @pos1 + 1)
SET @Count = @Count +1

END

SET @Result = SUBSTRING(@TargetString, @pos1+1, (@pos2-1)- @pos1)

RETURN (@Result)
END

I will use this string for our sample '-Aziz-Sharif-blogspot-com-' and will search for the value behind the 4th occurence of the split character '-'

The function will be called in the SELECT part and I need to pass 3 parameters
@SplitCharacter: The seperator value '-'
@TargetString: The string value - here a fixed string for demonstration, could be a column as well
@NthOccurence: The nth occurence of a character in a string I wish to be returned

SELECT dbo.ParseStringValue('-', '-Aziz-Sharif-blogspot-com-', 4)

The while loop will go through the same piece of code and stop until it reached the nth occurence:

WHILE @Count <= @NthOccurence
BEGIN

SET @pos1 = @pos2
SET @pos2 = CHARINDEX(@SplitCharacter, @TargetString, @pos1 + 1)
SET @Count = @Count +1

END

I will run the query in debug mode and break down the values of each run.

Within the loop I define the positions of the split character. Since I need to find the values between 2 characters, I have declared 2 integer variables: @pos1 and @pos2.

SET @pos1 = @pos2
SET @pos2 = CHARINDEX(@SplitCharacter, @TargetString, @pos1 + 1)

@pos2 is passed to @pos1. So @pos1 stores the most recent value and @pos2 returns the next position after @pos1 of the split character I'm searching for:
SET @pos2 = CHARINDEX(@SplitCharacter, @TargetString, @pos1 + 1)
The @pos1 + 1 part will define ther start position .

The final relevant values stored are:
@pos1 = 22 -> '-Aziz-Sharif-blogspot-com-'
@pos1 = 26 -> '-Aziz-Sharif-blogspot-com-'

Since I need to extract the value between the 2 split characters, I need to add 1 position to the start position and deduct 1 from the end position:
SET @Result = SUBSTRING(@TargetString, @pos1+1, (@pos2-1)- @pos1)

The function returns the value after the 4th occurence in the defined string

Sonntag, 26. Februar 2017

T-SQL: JOIN OPERATORS - NESTED LOOP JOIN, MERGE JOIN AND HASH JOIN

In this tutorial, I will describe the 3 Join Operators SQL Server uses and the reason why it comes to the individual selections by the Optimizer.

I will focus on the Inner Join and will stick to using 2 tables. We basically have 3 scenarios:

1) None of the tables have an index on the joining column
2) One of the two tables has an index on the joining column
3) Both columns have an index on the joining column

In the samples below I will use 2 tables from the AdventureWorks Database and modify indexes to demonstrate why the Optimizer decides to vary on the join operator. Following 2 tables will be used: Sales.SalesOrderHeader
Sales.SalesTerritory

Following query will be used throughout all samples:

SELECT so.SalesOrderID
    ,st.Name
      ,so.OrderDate
      ,so.DueDate
      ,so.ShipDate
      ,so.SalesOrderNumber
      ,so.PurchaseOrderNumber
      ,so.AccountNumber
      ,so.CurrencyRateID
    ,so.TotalDue
FROM Sales.SalesOrderHeader so
INNER JOIN Sales.SalesTerritory st   ON so.TerritoryID = st.TerritoryID

NESTED LOOP JOIN

Tablename	Row Count	Indexed Column
Sales.SalesOrderHeader	31465	Clustered Index on joining column
Sales.SalesTerritory	10	No index

Nested Loop Join will consist out of inner (lower object in Execution Plan) and outer (upper object in Execution Plan) inputs.
The outer input gets scanned row by row and an index seek is processed on the inner table (the larger indexed table).
This is exactly the advantage of Nested Loop Joins compared to Hash and Merge Joins. The outer, smaller input, is placed in memory and is compared with an indexed outer input.
Let's have a closer look at the Execution plan by placing the cursor above the object in question. The Clustered Index Scan shows us that there was 1 Execution returning 10 rows (all rows in that table).

The Clustered Index Seek provides information that 10 Executions took place returning 31465 rows. Therefor keep in mind that Nested Loop Joins will always work well with a small outer input and a indexed inner input.

MERGE JOIN

Tablename	Row Count	Indexed Column
Sales.SalesOrderHeader	31465	Clustered Index on joining column
Sales.SalesTerritory	10	Clustered Index on joining column

The requirements for a Merge Join is that both tables are sorted on the joining column. Assuming you have an index on both columns, then the sort is already covered here.
If you force a Merge Join by passing a hint and an index is missing on one or both of the columns, the Sort Operator will be used here, which is quite expensive due to high memory and I/O resources.
Merge Join can be very fast when columns are ordered by the index. In this case matching rows are created while the sorted columns are compared for equality.
The big advantage with Merge Joins is that both outputs will be only executed once.

Although I have an index on both tables, the Optimizer will probably still go for a Nested Loop Join here, since the outer input is quite small. Merge Joins are perfect for indexed and larger tables.

HASH JOIN

Tablename	Row Count	Indexed Column
Sales.SalesOrderHeader	31465	No index
Sales.SalesTerritory	10	No Index

Hash Joins are usually selected when there is no other option, meaning the data is not sorted and nonindexed. A Hash Join consists of 2 inputs.
1) Build Input: the upper object in the execution plan and usually the smaller table since it is saved on the system and to keep used memory low
2) Probe Input: the lower object in the execution plan
Hash joins are used by the Optimizer to process large and unsorted as well nonindexed values.

In most cases the Optimizer will make the right choice, but in cases where statistics are not up to date, it may be necessary to enforce the right option.
You can do this, by either adding a OPTION (x JOIN) at the end of the query

OPTION (LOOP JOIN)
OPTION (MERGE JOIN)
OPTION (HASH JOIN)

or add the hint in the inner join part of the query

INNER LOOP JOIN
INNER MERGE JOIN
INNER HASH JOIN