ASP.NET, ASP.NET MVC, Core .NET, and SQL Server: Find pattern-matched duplicate rows within a group using PARTITION BY in SQL Server.

This article aims to explain how to find duplicate rows within a group matching patterns.

Scenario:

Following is the structure and data of “tblPeople” table:

PersonID	CityName	ForeNames	LastName
1	Ahmedabad	Bhavin	Parikh
2	Ahmedabad	Bharat	Soni
3	Ahmedabad	Jagdish	Shah
4	Ahmedabad	Byomesh	Soni
5	Ahmedabad	Jagdish	Shyam
6	Vadodara	Bhavin	Patel
7	Vadodara	Amay	Kapoor
8	Vadodara	Ajit	Shah
9	Vadodara	Ajit	Shah
10	Vadodara	Soham	Patel
11	Vadodara	Amay	Shah
12	Surat	Rajesh	Desai
13	Surat	Bhavin	Parikh
14	Surat	Bhaumik	Shah

Our SQL should have input parameters for ForeNames, LastName, and CityName to define search pattern for duplicate items search.

Parameters - @foreNames, @lastName, @cityName

Our objective is as follow:

1. If none of the above parameters are supplied, find all persons having same ForeNames and LastName within same City.

Expected result:

2. If @cityName is only supplied, then find all persons having same ForeNames within particular City.

Expected result: (input >> @cityName = ‘%Vad%’)

3. If @foreNames is supplied (with/ without @cityName supplied), but @lastName is not supplied.

In that case, consider all persons matching @foreNames within same city as duplicates, and return them.

Expected result: (input >> @foreNames=’%Bh%’)

4. If @lastName is supplied (with /without @cityName supplied), but @foreNames is not supplied.

In that case, consider all persons matching @lastName within same city as duplicates, and return them.

Expected result: (input >> @lastName=’%Sh%’)

5. If @foreNames and @lastName are supplied (with/without @cityName supplied).

In that case, consider all persons matching @foreNames, and @lastName within same city as duplicates, and return them.

Expected result: (input >> @foreNames=’%A%’, @lastName=’%S%’, @cityName=’%Ahm%’)

Solution:

First of all, we cannot use GROUP BY here, because we need also need ID, and other details of the rows which are duplicated. GROUP BY simply cannot allow selecting those columns which are not used in grouping or aggregate.

To achieve this, we need to partition the rows in groups of “City”, “Forenames”, and “LastName”. Here, while partitioning the rows we also need to consider pattern-matching for ForeNames and LastName.

This can be done using PARTITION BY in SELECT query, and within single T-SQL statement only.

Following is the T-SQL serving the purpose:

WITH CTE AS

(SELECT ID, City, ForeNames, LastName,

COUNT(*) OVER(PARTITION BY City,

CASE WHEN @foreNames <> '' THEN @foreNames

ELSE

CASE WHEN @lastName <> '' THEN '' ELSE ForeNames END

END,

CASE WHEN @lastName <> '' THEN @lastName

ELSE

CASE WHEN @foreNames <> '' THEN ''

ELSE

CASE WHEN @cityName <> '' THEN ''

ELSE LastName

END

END) as MAXCount

FROM tblPeople

WHERE ForeNames LIKE

CASE WHEN @foreNames <> '' THEN @foreNames ELSE '%' END

AND LastName LIKE CASE WHEN @lastName <> '' THEN @lastName ELSE '%' END

AND City LIKE CASE WHEN @cityName <> '' THEN @cityName ELSE '%' END

)

SELECT * FROM CTE WHERE MAXCount > 1

Here, I have used Common Table Expression (but you can use temporary table to store the result of SELECT statement) before further narrowing down the resultset to exclude non-duplicates.

The point of interest in above T-SQL is how the field “MAXCount” gets generated.

COUNT(*) – indicates the total number of rows in each partition. This is what ultimately will be stored in “MAXCount” field.

Now, each partition is built using following values:

City >> Because we need to find duplicates within the same city always, so this column is an obvious choice in partition.

ForeNames >> Partition on this column depends upon @foreNames parameter.

· If it is not supplied, and neither @lastName is supplied it will create partition by grouping rows having exact ForeNames.

· If it is not supplied, but @lastName is supplied, it will ignore ForeNames column in partitioning.

· If supplied, it will create partition by grouping all matching rows in a single partition.

LastName >> Partition on this column depends upon @lastName parameter.

· If it is not supplied, and not even @foreNames, and @cityName, then it will create partition by grouping rows having same LastName.

· If it is not supplied, but either @foreNames, or @cityName is supplied, it will ignore LastName column in partitioning.

· If supplied, it will create partition by grouping all matching rows in single partition.

ASP.NET, ASP.NET MVC, Core .NET, and SQL Server

Tuesday, May 15, 2012

Find pattern-matched duplicate rows within a group using PARTITION BY in SQL Server.

No comments:

Post a Comment

Blog Archive

About Me