Learn SQL with me: NZ names

sql
NZ names
SQL, database design and the most popular names in New Zealand
Published

March 22, 2023

In this blogpost, we will look at the most popular names in New Zealand and build a relational database out of them. We will go through database design, such as first, second and third normal form, consider how the database will be used, how the data will be queried, and how to try to de-duplicate any information.

Note of warning: I am not an expert on database design or SQL, I am just learning and trying to put the theoretical knowledge of a few books into practice using an interesting project that is more than just a toy example.

The Data

The data we will be working with comes from the Department of Internal Affairs. It contains the 100 most popular male and female names from 1954 to 2022. The data is presented as a series of columns containing a tuple (name, count), with male and female names separated into their own tables.

Rank 1954 - 1955 - 1956 -
- Name Count Name Count Name Count
1 Christine 779 Susan 743 Susan 851
2 Susan 735 Christine 689 Christine 754
3 Margaret 562 Margaret 559 Karen 615

For the following database design, there are some considerations:

  • Each column contains 100 names, but that doesn’t mean that we are operating with only 100 names. Some names popular in 1960 might not be popular in 2000.
  • The data covers names over 69 years, but we would like to make adding new data (new years) easy.
  • Some names are unisex and may appear in both girl and boy tables. For instance, Ashley used to be a popular unisex name in the 1980s.

The data is presented in what would one call the wide format. We could go even wider by including a sex column and merging the tables into a single wide table, moving each name into its row and removing rank, and so on.

Consider a much simpler row-oriented long format:

Name Year Count
Christine 1954 779
Susan 1954 735
Margaret 1954 562
Susan 1955 743
Christine 1955 689
Margaret 1955 559

We could easily add an additional column for sex and represent the whole data with just 4 columns. Or 5, if we also include rank, but rank feels like a derived variable. We can imagine that new data, such as a newly discovered book from a now-defunct hospital could bring information about a few more births such that historical rank would change. We would have to recalculate the rank.

In data science, the long format is especially popular because it is easy to work with. And from my experience, it is much easier for software to grog a million rows rather than a million columns.

Both of these forms are flat file representations, something you would get in a CSV file (and loaded in an R or such).

The Design

When designing the SQL table, we want to consider a few issues:

  • how the data will be queried
  • design a database to prevent duplication
  • design a database to prevent adding invalid data

Invalid data and duplication is prevented by normalizing the database schema, the famous first, second and third normal forms. Database in Depth from C. J. Date mentions fourth and fifth normal forms and more, but that book feels to me very theoretical, it will probably make more sense to me once I am a bit more familiar with the practice.

On the other hand, the practical usage of the database might sometimes run against normalization. Apparently, there is a lot of discussion about the need for normalization or, on the other hand, considering normalization as common sense.

First of all, the wide format. If we had only a few columns, it would be incredibly easy to define a names table, to serve as a foreign key, and then a table for every year, such as year1954, year1955 and so on. Querying data and joining tables would be relatively easy. And since there is no relationship between years, they all would depend only on the names table, adding an additional year table would be relatively easy.

Example of a database design using the wide format.

However, most databases are optimized towards rows, not columns. Adding a new table is a relatively expensive operation compared to adding an additional row. This way, we would end with 69 tables for each year, with more tables every year. That is quite a lot of tables! Additionally, if we would like to know the most common name since 1954, we would have to query every single table.

Yet, this approach has some advantages. Typically, old tables are not modified and tables are populated only once. New tables are added only once a year as the data are obtained. The database structure would be simple, every name could occur in every year table only once, giving it a simple constraint.

What I think would be a better and more traditional approach is to model this relationship as one-to-many, one name to many years. We would have a name table, years table, and a value table. The first two tables are simple tables containing names or years respectively.

Example of a database design using the long format.

And this is the database we will implement.

The Implementation

The primary key is a constraint that uniquely identifies every row of a table. Each table can have only a single primary key. This is an incredibly powerful tool to enforce the integrity of a table.

Similarly, the foreign key is a field that refers to a primary key of another table, this enforces referential integrity.

A common practice is to create an ID that serves as a primary key, but we do not need that. We might need it if we allowed for varied spelling and treated all instances of that as a single unique name.

CREATE TABLE Names (
    name varchar(50),
    PRIMARY KEY (name)
);

CREATE TABLE Years (
    year int,
    PRIMARY KEY (year)
);

The _counts_ table is more interesting. We know that we can have any combination of a name and a year, but each combination can be in a table only once. We can enforce this by creating a primary key from a tuple of (name, year) while enforcing that both of these values depend on the Names and Years table through foreign keys.

CREATE TABLE Counts (
    name varchar(50) NOT NULL,
    year int NOT NULL,
    count int NOT NULL,
    PRIMARY KEY (name, year),
    FOREIGN KEY (name) REFERENCES Names(name),
    FOREIGN KEY (year) REFERENCES Years(year)
);

Now we can populate tables:

INSERT INTO Years(year) VALUES (1954);

INSERT INTO Names(name) VALUES ("Christine");
INSERT INTO Counts (name, year, count) VALUES ("Christine", 1954, 779);

INSERT INTO Names(name) VALUES ("Susan");
INSERT INTO Counts (name, year, count) VALUES ("Susan", 1954, 735);

INSERT INTO Names(name) VALUES("Margaret");
INSERT INTO Counts (name, year, count) VALUES ("Margaret", 1954, 562);

The Names table is a bit extra but gives us referential integrity. It protects us against typos, gives us the option to include a potential Rank table, and would allow us to add alternative spelling, translation and other features. It gives us also a visible list of all names. All these advantages are, I think, worth a little more pain when populating tables.

Now, populating tables manually is a pain, so here we do it in R using the DBI and RSQLite packages.

Click to unroll

First, we need to parse the file and get data for boys and girls:

library("openxlsx")
file = "Top-100-girls-and-boys-names-since-1954.xlsx"


get_name_table = function(file, sheet){
    # The raw data has rather complex structure:
    #  * 4 rows are noise/empty
    #  * 2 rows of headers (year and name/number)
    #  * 1 empty row
    #  * 100 rows of data
    #   * 3 rows are noise/empty
    # Helper function
    get_year_table = function(data, cols, year){
        data.frame("Name" = data[,cols[1]],
                   "Year" = as.integer(year),
                   "Count" = data[,cols[2]])
        }

    data = openxlsx::read.xlsx(
        file, sheet=sheet,
        skipEmptyCols=FALSE, skipEmptyRows=FALSE,
        colNames=FALSE
        )

    # Automatic skipEmptyCols and skipEmptyRows does not work properly in this case
    data = data[-(1:4),] # 4 rows are noise
    data = head(data, n=-3) # last 3 rows of noise
    data = data[,-(1:2)] # First two columns are noise/rank

    # convert to numeric first to remove any NAs that would turn into "NA" string otherwise
    years = as.numeric(data[1,]) |> na.omit() |> c() |> as.character()

    # Now we can remove the 2 rows of header and one empty row
    data = data[-(1:3),]

    # First year is formed by two column (name, value)
    res = list()
    res[[years[1]]] = get_year_table(data, c(1,2), years[1])

    # Every other year is formed by three column (empty, name, value)
    for(i in seq(2, length(years))){
        cols = 2:3+3*(i-2)+2
        year = years[i]
        res[[year]] = get_year_table(data, cols, year)
        }

    res = do.call(rbind, res)
    rownames(res) = NULL

    res
    }

girls = get_name_table(file, 1)
boys = get_name_table(file, 2)

Then we create tables as shown before:

library("RSQLite")
library("DBI")
library("glue") # for better multiline strings
database = dbConnect(RSQLite::SQLite(), "nznames.db")

# Names and counts splitted according to gender
dbExecute(database, "PRAGMA foreign_keys = ON;") |> invisible()
dbExecute(database, glue("
    CREATE TABLE GirlNames (
        name varchar(50),
        PRIMARY KEY (name)
    );")) |> invisible()

dbExecute(database, glue("
    CREATE TABLE BoyNames (
        name varchar(50),
        PRIMARY KEY (name)
    );")) |> invisible()

dbExecute(database, glue("
    CREATE TABLE Years (
        year int,
        PRIMARY KEY (year)
    );")) |> invisible()

dbExecute(database, glue("
    CREATE TABLE GirlCounts (
        name varchar(50) NOT NULL,
        year int NOT NULL,
        count int NOT NULL,
        PRIMARY KEY (name, year),
        FOREIGN KEY (name) REFERENCES GirlNames(name),
        FOREIGN KEY (year) REFERENCES Years(year)
    );")) |> invisible()

dbExecute(database, glue("
    CREATE TABLE BoyCounts (
        name varchar(50) NOT NULL,
        year int NOT NULL,
        count int NOT NULL,
        PRIMARY KEY (name, year),
        FOREIGN KEY (name) REFERENCES BoyNames(name),
        FOREIGN KEY (year) REFERENCES Years(year)
    );")) |> invisible()

And now we can populate tables:

years = data.frame("year"=union(girls$Year, boys$Year) |> sort())
girl_names = data.frame("name" = girls$Name |> unique())
boy_names = data.frame("name" = boys$Name |> unique())

dbAppendTable(database, "Years", years)
dbAppendTable(database, "GirlNames", girl_names)
dbAppendTable(database, "BoyNames", boy_names)
dbAppendTable(database, "GirlCounts", girls)
dbAppendTable(database, "BoyCounts", boys)

Interestingly, the last statement fails because a primary key would be duplicated. How so? Turns out someone typed Michael twice!

any(duplicated(boys[,1:2]))

find_duplicated = function(x, cols){
    x[duplicated(x[,cols]) | duplicated(x[,cols], fromLast=TRUE),]
    }
find_duplicated(boys, 1:2)

In 1989, 741 boys were named Michael. And somehow also 52 boys. This is surely wrong and given the popularity of Michael, the second occurence is surely wrong. Unfortunately, we don’t know what the true value should be. Maybe if they used properly designed database like we did, it would warn them about this typo!

So we clean the data and rerun the statement.

boys = boys[-anyDuplicated(boys[,1:2]),]
dbAppendTable(database, "BoyCounts", boys)
dbDisconnect(database)

Now that we have the database, we can construct queries!

The Querying

Select the most popular girl name since 2010:

SELECT name, MAX(count),year
FROM GirlCounts
WHERE year >= 2010
GROUP BY year;
Click to unroll
Ruby|335|2011
Olivia|312|2012
Charlotte|303|2013
Charlotte|255|2014
Olivia|268|2015
Olivia|266|2016
Charlotte|277|2017
Charlotte|233|2018
Amelia|255|2019
Isla|243|2020
Charlotte|227|2021
Isla|246|2022

Charlotte seems to be quite popular, how popular was the name historically?

SELECT count, year
FROM GirlCounts
WHERE name == "Charlotte";
Click to unroll
65|1974
65|1975
75|1976
106|1977
93|1978
95|1979
88|1980
85|1981
93|1982
111|1983
87|1984
106|1985
135|1986
123|1987
118|1988
158|1989
149|1990
155|1991
175|1992
155|1993
160|1994
130|1995
140|1996
138|1997
133|1998
150|1999
149|2000
152|2001
201|2002
243|2003
330|2004
278|2005
324|2006
263|2007
269|2008
257|2009
305|2010
258|2011
285|2012
303|2013
255|2014
260|2015
262|2016
277|2017
233|2018
248|2019
222|2020
227|2021
208|2022

Charlotte started to be popular since 1974 and steadily increase till 2010, after which it begun to drop slightly.

Another interesting question would be to find names that are steadily popular over the 69 year period. This means names that occur multiple times in our table.

SELECT name, count(name)
FROM GirlCounts
GROUP BY name
ORDER BY COUNT(name) DESC;
Click to unroll
Elizabeth|69
Anna|63
Sarah|62
Maria|56
Emma|54
Victoria|50
Charlotte|49
Jennifer|49
Emily|48
Stephanie|47
Olivia|46
Rebecca|46
Catherine|45
Hannah|45
Alice|44
Amber|44
Megan|44
Hayley|43
Jessica|43
Michelle|43
Jasmine|42
Kate|42
Lucy|42
Natalie|42
Rachel|42
Katherine|41
Kathryn|41
Louise|41
Angela|40
Nicola|40
Nicole|39
Sophie|39
Amy|38
Amanda|37
Amelia|37
Christina|37
Christine|37
Helen|37
Holly|37
Katie|37
Mary|37
Samantha|37
Grace|36
Leah|36
Lisa|36
Melissa|36
Chloe|35
Jade|35
Natasha|35
Zoe|35
Claire|34
Jane|34
Lauren|34
Georgia|33
Jacqueline|33
Joanne|33
Julie|33
Karen|33
Andrea|32
Fiona|32
Tracey|32
Abigail|31
Deborah|31
Donna|31
Ruby|31
Toni|31
Laura|30
Paige|30
Susan|30
Alexandra|29
Brooke|29
Caroline|29
Danielle|29
Ella|29
Julia|29
Kayla|29
Kelly|29
Madison|29
Melanie|29
Renee|29
Tania|29
Vanessa|29
Aimee|28
Eden|28
Isabella|28
Kim|28
Sharon|28
Teresa|28
Wendy|28
Heather|27
Maddison|27
Paula|27
Phoebe|27
Rose|27
Sophia|27
Suzanne|27
Alison|26
Ashley|26
Katrina|26
Lily|26
Marie|26
Rachael|26
Sally|26
Shannon|26
Caitlin|25
Linda|25
Shelley|25
Summer|25
Tracy|25
Bronwyn|24
Carolyn|24
Gemma|24
Joanna|24
Maia|24
Margaret|24
Monique|24
Sandra|24
Stacey|24
Kirsty|23
Leanne|23
Molly|23
Patricia|23
Penelope|23
Vicki|23
Zara|23
Bridget|22
Mia|22
Robyn|22
Stella|22
Tessa|22
Anita|21
Brenda|21
Courtney|21
Erin|21
Eva|21
Isabelle|21
Janine|21
Kerry|21
Rochelle|21
Taylor|21
Ashleigh|20
Ava|20
Bella|20
Chelsea|20
Denise|20
Diane|20
Kathleen|20
Mackenzie|20
Maree|20
Mikayla|20
Philippa|20
Rebekah|20
Ruth|20
Aaliyah|19
Anne|19
Annette|19
Debra|19
Isla|19
Kylie|19
Nikita|19
Tina|19
Carol|18
Casey|18
Lilly|18
Pauline|18
Poppy|18
Raewyn|18
Scarlett|18
Sienna|18
Tayla|18
Alana|17
Barbara|17
Cheryl|17
Judith|17
Kimberley|17
Lynda|17
Lynette|17
Pamela|17
Tanya|17
Ann|16
Aria|16
Brooklyn|16
Debbie|16
Jamie|16
Jessie|16
Madeleine|16
Millie|16
Riley|16
Sara|16
Sheryl|16
Alyssa|15
Ayla|15
Dianne|15
Faith|15
Frances|15
Hazel|15
Jan|15
Kirsten|15
Piper|15
Sofia|15
Yvonne|15
Belinda|14
Cherie|14
Crystal|14
Gabrielle|14
Georgina|14
Gillian|14
Shona|14
Addison|13
Janet|13
Jillian|13
Layla|13
Matilda|13
Maya|13
Mila|13
Olive|13
Rosemary|13
Violet|13
Willow|13
Alexis|12
Angel|12
Colleen|12
Diana|12
Ellie|12
Evelyn|12
Ivy|12
Jenna|12
Jordan|12
Kay|12
Keira|12
Lesley|12
Lorraine|12
Lynne|12
Peyton|12
Alicia|11
Annabelle|11
Ariana|11
Brianna|11
Brittany|11
Claudia|11
Gail|11
Harper|11
Indie|11
Isabel|11
Janice|11
Jeanette|11
Krystal|11
Maureen|11
Michele|11
Morgan|11
Nevaeh|11
Nina|11
Shirley|11
Aurora|10
Evie|10
Florence|10
Heidi|10
Justine|10
Kelsey|10
Vivienne|10
Abby|9
Ashlee|9
Beverley|9
Caitlyn|9
Daisy|9
Freya|9
Harriet|9
Imogen|9
Lola|9
Pippa|9
Savannah|9
Thea|9
Adrienne|8
Cassandra|8
Charlie|8
Eleanor|8
Emilia|8
Frankie|8
Glenda|8
Josephine|8
Michaela|8
Quinn|8
Sadie|8
Sonya|8
Tara|8
Tiana|8
Zoey|8
Elsie|7
Jocelyn|7
Kaitlyn|7
Lydia|7
Naomi|7
Paris|7
Trinity|7
Billie|6
Cheyenne|6
Harmony|6
Jill|6
Jodie|6
Jorja|6
Joy|6
Karyn|6
Katelyn|6
Luna|6
Mya|6
Tegan|6
Amaia|5
Carla|5
Elaine|5
Gina|5
Glenys|5
Jolene|5
Kiara|5
Lucia|5
Tamara|5
Valerie|5
Ada|4
Dorothy|4
Esther|4
Harlow|4
Jemma|4
Lois|4
Madeline|4
Maisie|4
Manaia|4
Marilyn|4
Marion|4
Sonia|4
Anahera|3
Bailey|3
Carmen|3
Destiny|3
Eliza|3
Haley|3
Hope|3
Joan|3
Kaia|3
Kaye|3
Kristy|3
Lynley|3
Maeve|3
Margot|3
Rosie|3
Shania|3
Sharlene|3
Sheree|3
Vicky|3
Aleisha|2
Amara|2
Athena|2
Bonnie|2
Cora|2
Delilah|2
Ellen|2
Eloise|2
Gabriella|2
Gloria|2
Irene|2
Isobel|2
Khloe|2
Kiri|2
Kora|2
Lynnette|2
Marley|2
Nadine|2
Ngaire|2
Rhonda|2
Skye|2
Skyla|2
Tyla|2
Abbey|1
Angelina|1
Anika|1
April|1
Arabella|1
Aroha|1
Briana|1
Britney|1
Chantelle|1
Cindy|1
Clara|1
Clare|1
Cleo|1
Dawn|1
Eilish|1
Eliana|1
Elise|1
Gaylene|1
Indi|1
Iris|1
Jaime|1
Janette|1
Jean|1
Jodi|1
Kaitlin|1
Karla|1
Kellie|1
Kyla|1
Leonie|1
Libby|1
Lynn|1
MacKenzie|1
Marian|1
Mikaela|1
Monica|1
Niamh|1
Nikki|1
Payton|1
Rachelle|1
Sasha|1
Shakira|1
Sinead|1
Skylar|1
Tori|1
Virginia|1
Whitney|1

Only a single name, Elizabeth, was steadily popular over the 69 year period, occuring among the top 100 names every single year.

Finally, I said that we don’t need rank table, because we can calculate the ranks from counts. We can do that using advanced SQL commands as well!

Here we look at 5 most popular names since 2020

SELECT name, year, count_rank, count
FROM
    (SELECT
        name,
        year,
        count,
        RANK() OVER (PARTITION BY year ORDER BY count DESC) count_rank
    FROM GirlCounts
    )
WHERE count_rank <= 5 AND YEAR >= 2020;
Click to unroll
Isla|2020|1|243
Charlotte|2020|2|222
Amelia|2020|3|213
Olivia|2020|4|208
Willow|2020|5|184
Charlotte|2021|1|227
Isla|2021|2|214
Amelia|2021|3|206
Olivia|2021|4|185
Ava|2021|5|184
Isla|2022|1|246
Amelia|2022|2|210
Charlotte|2022|3|208
Mila|2022|4|182
Lily|2022|5|180

The Books

Finally, if you would like to learn more about SQL as I did, here is a small review of books.

The best book for beginners is The Manga Guide to Databases. It is the most comprehensive introduction to databases, database design, and SQL. Every other book I have read assumes you already know about databases or SQL, so they often skip over important details.

Next, I would suggest SQL Antipatterns. It introduces you to some patterns and anti-patterns in SQL and database design. Easy to read, comprehensive and quite practical if you want to get into designing databases.

The Seven Databases in Seven Weeks would be next in line. Unlike the previous ones, it doesn’t teach you about database design or SQL tricks. Instead, it is an overview of different database types and products, not just relational databases and not just SQL. It is not an in-depth introduction to SQL or relational databases, so do not make this the first book you read, instead, this book will introduce you to a multitude of other solutions.

Finally, there is a gold standard Database in Depth. It already assumes that you know quite a lot about databases, and doesn’t talk much about SQL, instead, it introduces language D, the perfect relational language, and talks more about the theory. Unlike other theory books I have read on other topics, this one seems to be quite fluffy. I have read 30 pages and I think that information could be condensed into 10. But then, you don’t need to stare into a single equation with no accompanying explanation for 2 hours to understand it, like in those very dense theory books, because there is a 10-page explanation. So fluffiness might not be disadvantageous.

The Summary

We have picked a dataset and gone over some ideas and thoughts on how to convert it into a database. We then implemented the database using SQLite, populated it through R and queried a bunch of interesting data. The database we designed was not the most exciting, since the structure of the dataset was quite simple. Yet, I have learned quite a lot. I hope you enjoyed this journey as I did. I will try to explore the dataset a bit more because I think there is a story hidden inside.