This article aims to explain how to find duplicate rows
within a group matching patterns.
Scenario:
Following is the structure and data of “tblPeople” table:
PersonID
|
CityName
|
ForeNames
|
LastName
|
1
|
Ahmedabad
|
Bhavin
|
Parikh
|
2
|
Ahmedabad
|
Bharat
|
Soni
|
3
|
Ahmedabad
|
Jagdish
|
Shah
|
4
|
Ahmedabad
|
Byomesh
|
Soni
|
5
|
Ahmedabad
|
Jagdish
|
Shyam
|
6
|
Vadodara
|
Bhavin
|
Patel
|
7
|
Vadodara
|
Amay
|
Kapoor
|
8
|
Vadodara
|
Ajit
|
Shah
|
9
|
Vadodara
|
Ajit
|
Shah
|
10
|
Vadodara
|
Soham
|
Patel
|
11
|
Vadodara
|
Amay
|
Shah
|
12
|
Surat
|
Rajesh
|
Desai
|
13
|
Surat
|
Bhavin
|
Parikh
|
14
|
Surat
|
Bhaumik
|
Shah
|
Our SQL should have input parameters for ForeNames,
LastName, and CityName to define search pattern for duplicate items search.
Parameters - @foreNames, @lastName, @cityName
Our objective is as follow:
1.
If none of the above parameters are supplied,
find all persons having same ForeNames and LastName within same City.
Expected
result:
2.
If @cityName is only supplied, then find all
persons having same ForeNames within particular City.
Expected
result: (input >> @cityName = ‘%Vad%’)
3.
If @foreNames is supplied (with/ without @cityName
supplied), but @lastName is not supplied.
In that case, consider all persons matching
@foreNames within same city as duplicates, and return them.
Expected
result: (input >> @foreNames=’%Bh%’)
4.
If @lastName is supplied (with /without
@cityName supplied), but @foreNames is not supplied.
In that case, consider all persons matching
@lastName within same city as duplicates, and return them.
Expected
result: (input >> @lastName=’%Sh%’)
5.
If @foreNames and @lastName are supplied
(with/without @cityName supplied).
In that case, consider all persons matching
@foreNames, and @lastName within same city as duplicates, and return them.
Expected
result: (input >> @foreNames=’%A%’, @lastName=’%S%’, @cityName=’%Ahm%’)
Solution:
First of all, we cannot use GROUP BY here, because
we need also need ID, and other details of the rows which are duplicated. GROUP
BY simply cannot allow selecting those columns which are not used in grouping
or aggregate.
To achieve this, we need to partition the
rows in groups of “City”, “Forenames”, and “LastName”. Here, while partitioning
the rows we also need to consider pattern-matching for ForeNames and LastName.
This can be done using PARTITION BY in SELECT
query, and within single T-SQL statement only.
Following is the T-SQL serving the purpose:
WITH CTE AS
(SELECT ID,
City, ForeNames,
LastName,
COUNT(*) OVER(PARTITION BY City,
CASE WHEN
@foreNames <> ''
THEN @foreNames
ELSE
CASE WHEN
@lastName <> ''
THEN '' ELSE ForeNames END
END,
CASE WHEN
@lastName <> '' THEN
@lastName
ELSE
CASE WHEN
@foreNames <> ''
THEN ''
ELSE
CASE WHEN
@cityName <> ''
THEN ''
ELSE LastName
END
END
END) as MAXCount
FROM tblPeople
WHERE ForeNames LIKE
CASE WHEN
@foreNames <> ''
THEN @foreNames ELSE
'%' END
AND LastName LIKE
CASE WHEN
@lastName <> ''
THEN @lastName ELSE
'%' END
AND City LIKE
CASE WHEN
@cityName <> ''
THEN @cityName ELSE
'%' END
)
SELECT * FROM CTE WHERE
MAXCount > 1
Here, I have used Common Table Expression
(but you can use temporary table to store the result of SELECT statement)
before further narrowing down the resultset to exclude non-duplicates.
The point of interest in above T-SQL is how
the field “MAXCount” gets generated.
COUNT(*) – indicates the total number of
rows in each partition. This is what ultimately will be stored in “MAXCount”
field.
Now, each partition is built using
following values:
City >> Because we need to find
duplicates within the same city always, so this column is an obvious choice in
partition.
ForeNames >> Partition on this column
depends upon @foreNames parameter.
·
If it is not supplied, and neither @lastName is
supplied it will create partition by grouping rows having exact ForeNames.
·
If it is not supplied, but @lastName is
supplied, it will ignore ForeNames column in partitioning.
·
If supplied, it will create partition by grouping
all matching rows in a single partition.
LastName >> Partition on this column
depends upon @lastName parameter.
·
If it is not supplied, and not even @foreNames,
and @cityName, then it will create partition by grouping rows having same
LastName.
·
If it is not supplied, but either @foreNames, or
@cityName is supplied, it will ignore LastName column in partitioning.
·
If supplied, it will create partition by
grouping all matching rows in single partition.