Tuesday, May 15, 2012

Find pattern-matched duplicate rows within a group using PARTITION BY in SQL Server.


This article aims to explain how to find duplicate rows within a group matching patterns.

Scenario:
Following is the structure and data of “tblPeople” table:

PersonID
CityName
ForeNames    
LastName
1
Ahmedabad
Bhavin
Parikh
2
Ahmedabad
Bharat
Soni
3
Ahmedabad
Jagdish
Shah
4
Ahmedabad
Byomesh
Soni
5
Ahmedabad
Jagdish
Shyam
6
Vadodara
Bhavin
Patel
7
Vadodara
Amay
Kapoor
8
Vadodara
Ajit
Shah
9
Vadodara
Ajit
Shah
10
Vadodara
Soham
Patel
11
Vadodara
Amay
Shah
12
Surat
Rajesh
Desai
13
Surat
Bhavin
Parikh
14
Surat
Bhaumik
Shah

Our SQL should have input parameters for ForeNames, LastName, and CityName to define search pattern for duplicate items search.
Parameters - @foreNames, @lastName, @cityName
Our objective is as follow:
1.       If none of the above parameters are supplied, find all persons having same ForeNames and LastName within same City.
Expected result:

2.       If @cityName is only supplied, then find all persons having same ForeNames within particular City.
Expected result: (input >> @cityName = ‘%Vad%’)


3.       If @foreNames is supplied (with/ without @cityName supplied), but @lastName is not supplied.
In that case, consider all persons matching @foreNames within same city as duplicates, and return them.
Expected result: (input >> @foreNames=’%Bh%’)


4.       If @lastName is supplied (with /without @cityName supplied), but @foreNames is not supplied.
In that case, consider all persons matching @lastName within same city as duplicates, and return them.
Expected result: (input >> @lastName=’%Sh%’)



5.       If @foreNames and @lastName are supplied (with/without @cityName supplied).
In that case, consider all persons matching @foreNames, and @lastName within same city as duplicates, and return them.
Expected result: (input >> @foreNames=’%A%’, @lastName=’%S%’, @cityName=’%Ahm%’)

Solution:
First of all, we cannot use GROUP BY here, because we need also need ID, and other details of the rows which are duplicated. GROUP BY simply cannot allow selecting those columns which are not used in grouping or aggregate.

To achieve this, we need to partition the rows in groups of “City”, “Forenames”, and “LastName”. Here, while partitioning the rows we also need to consider pattern-matching for ForeNames and LastName.

This can be done using PARTITION BY in SELECT query, and within single T-SQL statement only.

Following is the T-SQL serving the purpose:

WITH CTE AS
(SELECT ID, City, ForeNames, LastName,
COUNT(*) OVER(PARTITION BY City,
CASE WHEN @foreNames <> '' THEN @foreNames
ELSE
CASE WHEN @lastName <> '' THEN '' ELSE ForeNames END
END,
CASE WHEN @lastName <> ''  THEN @lastName
ELSE
CASE WHEN @foreNames <> '' THEN ''
ELSE
CASE WHEN @cityName <> '' THEN ''
ELSE LastName
END
END
END) as MAXCount
FROM tblPeople
WHERE ForeNames LIKE  
CASE WHEN @foreNames <> '' THEN @foreNames ELSE '%' END
AND LastName LIKE CASE WHEN @lastName <> '' THEN @lastName ELSE '%' END
AND City LIKE CASE WHEN @cityName <> '' THEN @cityName ELSE '%' END
)

SELECT * FROM CTE WHERE MAXCount > 1

Here, I have used Common Table Expression (but you can use temporary table to store the result of SELECT statement) before further narrowing down the resultset to exclude non-duplicates.

The point of interest in above T-SQL is how the field “MAXCount” gets generated.
COUNT(*) – indicates the total number of rows in each partition. This is what ultimately will be stored in “MAXCount” field.

Now, each partition is built using following values:
City >> Because we need to find duplicates within the same city always, so this column is an obvious choice in partition.

ForeNames >> Partition on this column depends upon @foreNames parameter.
·         If it is not supplied, and neither @lastName is supplied it will create partition by grouping rows having exact ForeNames.
·         If it is not supplied, but @lastName is supplied, it will ignore ForeNames column in partitioning.
·         If supplied, it will create partition by grouping all matching rows in a single partition.

LastName >> Partition on this column depends upon @lastName parameter.
·         If it is not supplied, and not even @foreNames, and @cityName, then it will create partition by grouping rows having same LastName.
·         If it is not supplied, but either @foreNames, or @cityName is supplied, it will ignore LastName column in partitioning.
·         If supplied, it will create partition by grouping all matching rows in single partition.

No comments:

Post a Comment

Thanks for visiting my blog.
However, if this helped you in any way, please take a moment to write a comment.

Thanks
Nirman