Week 02: Library Survey Tables Case Study

Author

Eric Araujo

Published

September 18, 2023

0. Creating the Library Survey Tables

In the US, the Institute of Museum and Library Services (IMLS) measures library activity as part of its annual Public Libraries Survey. The survey collects data from more than 9000 library administrative entities, defined by the survey as agencies that provide library services to a particular locality. Data includes the number of branches, staff, books, hours open per year, etc. To teach the concepts below, we will build three tables containing the data from the survey related to the years of 2016, 2017 and 2018. For doing so, we read from the CSV files downloaded from their website. More especifically, some columns will be selected in the process to reduce the amount of non used attributes.

We are running SQL queries in a Jupyter environment.

%%capture
%load_ext sql
%sql sqlite:///dbs/w02/library.db
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

This will open our database library.db for us. Don’t bother yourself with the config details. That is a trick to run the SQL queries in this environment.

Now we turn to our tables. We have to create 3 tables. Let’s start with the table for 2018.

%%sql
-- Creating the 2018 Public Libraries Survey table

-- We drop an old copy of the table, if it exists.
DROP TABLE IF EXISTS libraries_2018;

CREATE TABLE libraries_2018 (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_2018_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);
 * sqlite:///dbs/w02/library.db
Done.
Done.
[]

This is an empty table. To fill the table, we need to convert our CSV entries to entitites in our database.

Here I came up with another trick. I created a subprocess that opens SQLite3 and calls the command to convert a CSV file into a table inside the database. The code is as follows:

import subprocess
import tempfile
import os

# Paths and settings
csv_path = "dbs/w02/pls_fy2018_libraries.csv"
db_path = "dbs/w02/library.db"
table_name = "libraries_2018"

# Step 1: Create a temporary CSV file without the header
with open(csv_path, 'r', encoding='utf-8') as original, tempfile.NamedTemporaryFile('w', delete=False, newline='', encoding='utf-8') as noheader:
    next(original)  # skip the header
    for line in original:
        noheader.write(line)
    temp_csv_path = noheader.name

# Step 2: Build the SQLite shell command
sqlite_cmd = f"""
.mode csv
.import '{temp_csv_path}' {table_name}
"""

# Step 3: Run the command with subprocess
subprocess.run(["sqlite3", db_path], input=sqlite_cmd, text=True)

# Step 4: Clean up the temp file
os.remove(temp_csv_path)

print(f"✅ Imported '{csv_path}' into '{table_name}' in '{db_path}' (header skipped)")
✅ Imported 'dbs/w02/pls_fy2018_libraries.csv' into 'libraries_2018' in 'dbs/w02/library.db' (header skipped)

Let’s see if the data was loaded? Let’s peak on the 10 first rows.

%%sql
SELECT * FROM libraries_2018
LIMIT 10;
 * sqlite:///dbs/w02/library.db
Done.
stabr fscskey libid libname address city zip county phone c_relatn c_legbas c_admin c_fscs geocode lsabound startdate enddate popu_lsa popu_und centlib branlib bkmob totstaff bkvol ebook audio_ph audio_dl video_ph video_dl ec_lo_ot subscrip hrs_open visits reference regbor totcir kidcircl totpro gpterms pitusr wifisess obereg statstru statname stataddr longitude latitude
AK AK0001 AK0001-002 ANCHOR POINT PUBLIC LIBRARY 34020 NORTH FORK ROAD ANCHOR POINT 99556 KENAI PENINSULA 9072355692 NO NP SO Y CI1 N 7/1/17 6/30/18 2057 2040 1 0 0 0.67 18201 0 179 0 3320 0 0 9 1245 6032 4 1726 11316 2294 60 8 1495 1182 08 00 00 00 -151.825 59.77965
AK AK0002 AK0002-011 ANCHORAGE PUBLIC LIBRARY 3600 DENALI STREET ANCHORAGE 99503 ANCHORAGE 9073432892 NO CO MO Y MA1 N 1/1/17 12/31/17 295365 292940 1 4 0 76.65 370812 21587 30963 12992 76404 0 14 7311 11400 723272 54306 135828 1574942 553651 1924 165 126846 90135 08 00 00 00 -149.876 61.18744
AK AK0003 AK0003-002 ANDERSON COMMUNITY LIBRARY 101 FIRST STREET ANDERSON 99744 DENALI 9075822628 NO CI SO Y CI1 N 7/1/17 6/30/18 269 267 1 0 0 0.75 15314 0 201 0 1403 0 0 0 480 592 300 113 1137 264 4 5 225 0 08 00 00 00 -149.187 64.34363
AK AK0006 AK0006-002 KUSKOKWIM CONSORTIUM LIBRARY 420 CHIEF EDDIE HOFFMAN HIGHWAY BETHEL 99559 BETHEL 9075434516 NO MJ SO Y CI1 N 7/1/17 6/30/18 6135 6085 1 0 0 3 34860 21587 425 12992 3782 0 0 5 2040 51000 1530 1912 14067 3746 406 8 6600 5716 08 00 00 00 -161.771 60.79114
AK AK0007 AK0007-002 BIG LAKE PUBLIC LIBRARY 3140 SOUTH BIG LAKE ROAD BIG LAKE 99652 MATANUSKA-SUSITNA 9078617635 NO CO SO Y CO1 N 7/1/17 6/30/18 12847 12742 1 0 0 3 28698 21587 2140 12992 3805 0 0 47 2551 62249 887 2890 36670 15691 363 18 15022 8125 08 00 00 00 -149.819 61.5475
AK AK0008 AK0008-002 CANTWELL COMMUNITY LIBRARY 1 SCHOOL ROAD CANTWELL 99729 DENALI 9077682372 NO NP SO Y CI1 N 7/1/17 6/30/18 202 200 1 0 0 0.8 10376 21587 275 12992 562 0 0 15 720 2500 390 183 2828 1198 45 1 50 95 08 00 00 00 -148.9 63.3912
AK AK0011 AK0011-002 CHINIAK PUBLIC LIBRARY 43318 SPRUCE WAY CHINIAK 99615 KODIAK ISLAND 9075120880 NO NP SO N CI1 N 7/1/17 6/30/18 44 44 1 0 0 0 2300 21587 0 12992 257 0 0 18 132 302 0 39 465 301 27 2 75 220 08 00 00 00 -152.231 57.61245
AK AK0014 AK0014-002 COLD BAY PUBLIC LIBRARY 10 BARANOV ROAD COLD BAY 99571 ALEUTIANS EAST 9075322878 NO NP SO Y CI1 N 7/1/17 6/30/18 63 62 1 0 0 0.25 3226 0 445 0 2673 0 0 0 400 1080 31 36 121 11 5 4 235 893 08 00 00 00 -160.693 55.69687
AK AK0015 AK0015-002 COOPER LANDING COMMUNITY LIBRARY 18511 BEAN CREEK ROAD COOPER LANDING 99572 KENAI PENINSULA 9075951241 NO NP SO N CI1 N 7/1/17 6/30/18 265 263 1 0 0 0 4163 0 192 0 926 0 0 1 1092 2452 70 834 2232 772 11 4 395 357 08 00 00 00 -149.823 60.49404
AK AK0016 AK0016-002 CORDOVA PUBLIC LIBRARY 601 FIRST STREET CORDOVA 99574 VALDEZ-CORDOVA 9074246667 NO CI SO Y CI1 N 1/1/17 12/31/17 2360 2341 1 0 0 5 17888 21587 560 12992 1200 0 0 48 2575 20844 1024 2527 14435 4298 282 6 5357 6997 08 00 00 00 -145.758 60.54716

Now, lets count how many rows are in this table.

%%sql
SELECT count(*) FROM libraries_2018;
 * sqlite:///dbs/w02/library.db
Done.
count(*)
9261

The same should be done for the years 2016 and 2017. The code below is long, but identical to what we’ve done above.

%%sql

-- Creating the 2016 Public Libraries Survey table

-- We drop an old copy of the table, if it exists.
DROP TABLE IF EXISTS libraries_2016;
DROP TABLE IF EXISTS libraries_2017;

CREATE TABLE libraries_2016 (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_2018_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);

CREATE TABLE libraries_2017 (
    stabr text NOT NULL,
    fscskey text CONSTRAINT fscskey_2018_pkey PRIMARY KEY,
    libid text NOT NULL,
    libname text NOT NULL,
    address text NOT NULL,
    city text NOT NULL,
    zip text NOT NULL,
    county text NOT NULL,
    phone text NOT NULL,
    c_relatn text NOT NULL,
    c_legbas text NOT NULL,
    c_admin text NOT NULL,
    c_fscs text NOT NULL,
    geocode text NOT NULL,
    lsabound text NOT NULL,
    startdate text NOT NULL,
    enddate text NOT NULL,
    popu_lsa integer NOT NULL,
    popu_und integer NOT NULL,
    centlib integer NOT NULL,
    branlib integer NOT NULL,
    bkmob integer NOT NULL,
    totstaff numeric(8,2) NOT NULL,
    bkvol integer NOT NULL,
    ebook integer NOT NULL,
    audio_ph integer NOT NULL,
    audio_dl integer NOT NULL,
    video_ph integer NOT NULL,
    video_dl integer NOT NULL,
    ec_lo_ot integer NOT NULL,
    subscrip integer NOT NULL,
    hrs_open integer NOT NULL,
    visits integer NOT NULL,
    reference integer NOT NULL,
    regbor integer NOT NULL,
    totcir integer NOT NULL,
    kidcircl integer NOT NULL,
    totpro integer NOT NULL,
    gpterms integer NOT NULL,
    pitusr integer NOT NULL,
    wifisess integer NOT NULL,
    obereg text NOT NULL,
    statstru text NOT NULL,
    statname text NOT NULL,
    stataddr text NOT NULL,
    longitude numeric(10,7) NOT NULL,
    latitude numeric(10,7) NOT NULL
);
 * sqlite:///dbs/w02/library.db
Done.
Done.
Done.
Done.
[]
# Importing the 2017 Public Libraries Survey data

# Paths and settings
csv_path = "dbs/w02/pls_fy2017_libraries.csv"
db_path = "dbs/w02/library.db"
table_name = "libraries_2017"

# Step 1: Create a temporary CSV file without the header
with open(csv_path, 'r', encoding='utf-8') as original, tempfile.NamedTemporaryFile('w', delete=False, newline='', encoding='utf-8') as noheader:
    next(original)  # skip the header
    for line in original:
        noheader.write(line)
    temp_csv_path = noheader.name

# Step 2: Build the SQLite shell command
sqlite_cmd = f"""
.mode csv
.import '{temp_csv_path}' {table_name}
"""

# Step 3: Run the command with subprocess
subprocess.run(["sqlite3", db_path], input=sqlite_cmd, text=True)

# Step 4: Clean up the temp file
os.remove(temp_csv_path)

print(f"✅ Imported '{csv_path}' into '{table_name}' in '{db_path}' (header skipped)")
✅ Imported 'dbs/w02/pls_fy2017_libraries.csv' into 'libraries_2017' in 'dbs/w02/library.db' (header skipped)
# Importing the 2016 Public Libraries Survey data

# Paths and settings
csv_path = "dbs/w02/pls_fy2016_libraries.csv"
db_path = "dbs/w02/library.db"
table_name = "libraries_2016"

# Step 1: Create a temporary CSV file without the header
with open(csv_path, 'r', encoding='utf-8') as original, tempfile.NamedTemporaryFile('w', delete=False, newline='', encoding='utf-8') as noheader:
    next(original)  # skip the header
    for line in original:
        noheader.write(line)
    temp_csv_path = noheader.name

# Step 2: Build the SQLite shell command
sqlite_cmd = f"""
.mode csv
.import '{temp_csv_path}' {table_name}
"""

# Step 3: Run the command with subprocess
subprocess.run(["sqlite3", db_path], input=sqlite_cmd, text=True)

# Step 4: Clean up the temp file
os.remove(temp_csv_path)

print(f"✅ Imported '{csv_path}' into '{table_name}' in '{db_path}' (header skipped)")
✅ Imported 'dbs/w02/pls_fy2016_libraries.csv' into 'libraries_2016' in 'dbs/w02/library.db' (header skipped)

Let’s check the rows count for each of the tables now.

%%sql
    SELECT 2016 AS 'Year', count(*) AS NumRows FROM libraries_2016
UNION ALL
    SELECT 2017, count(*) AS NumRows FROM libraries_2017
UNION ALL
    SELECT 2018, count(*) AS NumRows FROM libraries_2018;
 * sqlite:///dbs/w02/library.db
Done.
Year NumRows
2016 9252
2017 9245
2018 9261

1.1 Counting Distinct Values in a Column

We can combine DISTINCT with combinations of values like count(). This will return a count of distinct values from a column.

%%sql
SELECT count(libname)
FROM libraries_2017;
 * sqlite:///dbs/w02/library.db
Done.
count(libname)
9245
%%sql
SELECT count(DISTINCT libname)
FROM libraries_2017;
 * sqlite:///dbs/w02/library.db
Done.
count(DISTINCT libname)
8455

Removing duplicates reduces the number of library names. It happens that some libraries share their names with other agencies. We will see how we can see the duplicates soon.

1.2 Finding Maximum and Minimum Values using max() and min()

The max() and min() funtions return the largest and smallest values in a column. In our case, we will be using them to detect some annomalies in the data. Let’s check the number of visits for 2018.

%%sql
SELECT max(visits), min(visits)
FROM libraries_2018;
 * sqlite:///dbs/w02/library.db
Done.
max(visits) min(visits)
16686945 -3

The value of -3 is unexpected. It happens that the negative values for this database indicate “not applicable”, used when a library has closed temporarily or permanently. This is a terrible way of coding missing data, but we can cope with that.

We could use WHERE min(visits > 0) to guarantee that only positive numbers are in the game.

A better way of handling these situations would be using NULL values and adding a column to hold text explaining why the value is NULL.

1.3 Aggregating with GROUP BY

We can use the GROUP BY clause to group rows that have the same values in specified columns into summary rows. The GROUP BY statement is often used with aggregate functions like COUNT(), SUM(), AVG(), etc., to perform operations on each group of data.

Let’s see how many libraries are in each state. But first, let’s check which states are in the database.

%%sql
SELECT stabr
FROM libraries_2018
GROUP BY stabr
ORDER BY stabr;
 * sqlite:///dbs/w02/library.db
Done.
stabr
AK
AL
AR
AS
AZ
CA
CO
CT
DC
DE
FL
GA
GU
HI
IA
ID
IL
IN
KS
KY
LA
MA
MD
ME
MI
MN
MO
MP
MS
MT
NC
ND
NE
NH
NJ
NM
NV
NY
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VA
VI
VT
WA
WI
WV
WY

Notice there are no duplicates in state names. Those are the 50 states plus Washington D.C. and several U.S. territories. We don’t have to worry about duplicates in this case.

We could also have grouped by city and state, using two columns instead of one. Let’s see how would that work.

%%sql
SELECT city, stabr
FROM libraries_2018
GROUP BY city, stabr
ORDER BY city, stabr
LIMIT 20;
 * sqlite:///dbs/w02/library.db
Done.
city stabr
ABBEVILLE AL
ABBEVILLE LA
ABBEVILLE SC
ABBOTSFORD WI
ABERDEEN ID
ABERDEEN SD
ABERNATHY TX
ABILENE KS
ABILENE TX
ABINGDON IL
ABINGDON VA
ABINGTON CT
ABINGTON MA
ABINGTON PA
ABIQUIU NM
ABSECON NJ
ACCOMAC VA
ACKERMAN MS
ACKLEY IA
ACTON MA

We will limit the number of rows to 10, so we can see the result. The LIMIT clause is used to specify the maximum number of records to return. As you can see, the results are sorted by the first column, which is the city name and then by the state name. The ORDER BY clause is used to sort the result set in ascending or descending order. By default, it sorts in ascending order. We could have used the DESC keyword to sort in descending order. I will leave this as an exercise for you.

By combinint GROUP BY with an aggregate function, we can get the number of libraries in each state. Let’s see how many libraries are in each state.

%%sql
SELECT stabr, count(*) as total
FROM libraries_2018
GROUP BY stabr
ORDER BY total DESC
LIMIT 20;
 * sqlite:///dbs/w02/library.db
Done.
stabr total
NY 756
IL 623
TX 560
IA 544
PA 451
MI 398
WI 381
MA 369
KS 325
NJ 295
ME 263
OH 251
NE 245
IN 236
NH 225
AL 224
CA 218
CT 192
TN 186
VT 184

This is returning the values in the stabrcolumn and the count of how many rows have a given stabr value. The COUNT(*) function counts all rows in each group. The GROUP BY clause groups the result set by the stabr column, and the ORDER BY clause sorts the result set by the count of libraries in descending order.

Library agencies can have many branches. Not all of them will provide an outlet for the public to walk in and borrow books. Some of them are just a warehouse for books, or a place to store the books that are not in use. We can see how many branches are in each state by checking the centlib and branlib columns. You can use the SUM() function in this case. This will be left as an exercise for you.

We can also use GROUP BY to group by multiple columns. For example, we can verify how many libraries changed their addresses by state. The column stataddr contains a code indicating whether the agency’s address changed or not. The code is as follows:

  • 00 - No change
  • 07 - Moved to a new address
  • 15 - Minor address change

The code below will show how many libraries changed their address by state.

%%sql
SELECT stabr, stataddr, count(*) as total
FROM libraries_2018
GROUP BY stabr, stataddr
ORDER BY stabr, stataddr
LIMIT 20;
 * sqlite:///dbs/w02/library.db
Done.
stabr stataddr total
AK 00 82
AL 00 220
AL 07 3
AL 15 1
AR 00 58
AR 07 1
AR 15 1
AS 00 1
AZ 00 88
AZ 15 1
CA 00 180
CA 07 2
CA 15 36
CO 00 113
CT 00 190
CT 07 1
CT 15 1
DC 00 1
DE 00 21
FL 00 79

As expected, most libraries did not change their address.

📙 Visits over the years

We saw before some negative numbers in the visits column due to the way data was collected. Now we turn to see the trends in library visits over the years. We will use the SUM() function to get the total number of visits for each year. The code below will show how many visits were made in each year. We will discard the negative values for this analysis. The SUM() function will ignore the negative values, so we don’t have to worry about them.

%%sql
SELECT sum(visits) AS visits_2018
FROM libraries_2018
WHERE visits > 0;
 * sqlite:///dbs/w02/library.db
Done.
visits_2018
1292348697
%%sql
SELECT sum(visits) AS visits_2017
FROM libraries_2017
WHERE visits > 0;
 * sqlite:///dbs/w02/library.db
Done.
visits_2017
1319803999
%%sql
SELECT sum(visits) AS visits_2016
FROM libraries_2016
WHERE visits > 0;
 * sqlite:///dbs/w02/library.db
Done.
visits_2016
1355648987

We went from 1.36 billion visits in 2016 to 1.32 billion in 2017, and then to 1.29 billion in 2018. This is a decrease of about 5% over the three years.

But we know that the number of libraries change as they close or open. To have a better idea of the trend, we will only consider libraries present in the three years. We can do this by using the INNER JOIN clause. The code below will show how many visits were made in each year, but only for libraries that were present in all three years.

%%sql
SELECT sum(pls18.visits) AS visits_2018,
       sum(pls17.visits) AS visits_2017,
       sum(pls16.visits) AS visits_2016
FROM libraries_2018 AS pls18
    JOIN libraries_2017 AS pls17 ON pls18.fscskey = pls17.fscskey
    JOIN libraries_2016 AS pls16 ON pls18.fscskey = pls16.fscskey
WHERE pls18.visits >= 0
  AND pls17.visits >= 0
  AND pls16.visits >= 0;
 * sqlite:///dbs/w02/library.db
Done.
visits_2018 visits_2017 visits_2016
1278148838 1319325387 1355078384

The results didn’t change much. We can also take a look at the number of wifi accesses the year in the same way we did for the visitors.

%%sql
SELECT sum(pls18.wifisess) AS wifisess_2018,
       sum(pls17.wifisess) AS wifisess_2017,
       sum(pls16.wifisess) AS wifisess_2016
FROM libraries_2018 AS pls18
    JOIN libraries_2017 AS pls17 ON pls18.fscskey = pls17.fscskey
    JOIN libraries_2016 AS pls16 ON pls18.fscskey = pls16.fscskey
WHERE pls18.wifisess >= 0
  AND pls17.wifisess >= 0
  AND pls16.wifisess >= 0;
 * sqlite:///dbs/w02/library.db
Done.
wifisess_2018 wifisess_2017 wifisess_2016
349767271 311336231 234926102

Even though the number of visitors decreased, the number of wifi accesses increased. This is a clear indication that people are using the library for more than just borrowing books. They are also using it as a place to study and work.

To investigate if the decrease in the number of visitors is a trend in all regions, we must group by state. The code below will show how many visits were made in each year, but only for libraries that were present in all three years.

%%sql
SELECT  pls2018.stabr,
        sum(pls2018.visits) AS visits_2018,
        sum(pls2017.visits) AS visits_2017,
        sum(pls2016.visits) AS visits_2016,
        round ( ( 
                 (sum(pls2018.visits) - sum(pls2017.visits) ) * 1.0 / sum(pls2017.visits) ) * 100, 1
               ) AS visits_ratio_2018_2017,
        round ( ( (sum(pls2017.visits) - sum(pls2016.visits) ) * 1.0 / 
                 sum(pls2016.visits) ) * 100, 1) AS visits_ratio_2017_2016
FROM libraries_2018 AS pls2018
    JOIN libraries_2017 AS pls2017 ON pls2018.fscskey = pls2017.fscskey
    JOIN libraries_2016 AS pls2016 ON pls2018.fscskey = pls2016.fscskey
WHERE pls2018.visits >= 0
  AND pls2017.visits >= 0
  AND pls2016.visits >= 0
GROUP BY pls2018.stabr
ORDER BY visits_ratio_2018_2017 DESC;
 * sqlite:///dbs/w02/library.db
Done.
stabr visits_2018 visits_2017 visits_2016 visits_ratio_2018_2017 visits_ratio_2017_2016
SD 3824804 3699212 3722376 3.4 -0.6
MT 4332900 4215484 4298268 2.8 -1.9
FL 68423689 66697122 70991029 2.6 -6.0
ND 2216377 2162189 2201730 2.5 -1.8
ID 8179077 8029503 8597955 1.9 -6.6
DC 3632539 3593201 3930763 1.1 -8.6
UT 15326963 15295494 16096911 0.2 -5.0
NH 7045010 7028800 7236567 0.2 -2.9
ME 6746380 6731768 6811441 0.2 -1.2
DE 4122181 4117904 4125899 0.1 -0.2
OK 13399265 13491194 13112511 -0.7 2.9
WY 3338772 3367413 3536788 -0.9 -4.8
MA 39926583 40453003 40427356 -1.3 0.1
WA 37338635 37916034 38634499 -1.5 -1.9
MN 22952388 23326303 24033731 -1.6 -2.9
NM 6908686 7036582 7178428 -1.8 -2.0
VA 33913162 34563079 35649602 -1.9 -3.0
KS 13483333 13737900 13699223 -1.9 0.3
NY 97921323 100012193 103081304 -2.1 -3.0
WI 30097183 30865470 31442577 -2.5 -1.8
AL 14188647 14583055 15637164 -2.7 -6.7
MI 44758918 46052561 46734166 -2.8 -1.5
CO 31085356 31975615 32011432 -2.8 -0.1
NJ 40947978 42181061 42429576 -2.9 -0.6
CA 146656984 151056672 155613529 -2.9 -2.9
CT 20423515 21051597 21603777 -3.0 -2.6
RI 5490076 5669309 5778025 -3.2 -1.9
PA 40885876 42243049 44105513 -3.2 -4.2
OR 19592295 20244499 20391927 -3.2 -0.7
IN 30836051 31849195 33363879 -3.2 -4.5
IA 16674976 17245764 17753953 -3.3 -2.9
NV 9334070 9684935 9733359 -3.6 -0.5
NE 7449868 7726127 7873829 -3.6 -1.9
VT 3526357 3673501 3721332 -4.0 -1.3
SC 13989511 14567585 15802934 -4.0 -7.8
AK 3268073 3402486 3467234 -4.0 -1.9
IL 63466887 66166082 67336230 -4.1 -1.7
NC 31263894 32621293 33605264 -4.2 -2.9
MD 24976429 26089963 27481583 -4.3 -5.1
AZ 23439707 24584201 25315276 -4.7 -2.9
OH 68176967 71895854 74119719 -5.2 -3.0
WV 4944242 5231251 5231443 -5.5 -0.0
MO 24663467 26117633 27065546 -5.6 -3.5
KY 16910828 17909495 18028488 -5.6 -0.7
LA 16227594 17211007 20262385 -5.7 -15.1
TX 66168387 70514138 70975901 -6.2 -0.7
TN 18102460 19396554 18701973 -6.7 3.7
GA 26835701 28816233 27987249 -6.9 3.0
AR 9551686 10358181 10596035 -7.8 -2.2
GU 75119 81572 71813 -7.9 13.6
MS 7602710 8581994 8915406 -11.4 -3.7
HI 3456131 4135229 4490320 -16.4 -7.9
AS 48828 67848 63166 -28.0 7.4

1.4 Filtering Groups with HAVING

The HAVING clause is used to filter records that work on summarized GROUP BY results. It is similar to the WHERE clause, but it is used for filtering groups instead of individual rows.

Let’s try to compare libraries located in a state with similar characteristics. For that, we will use HAVING to place conditions on gorups created by aggregating.

%%sql
SELECT pls2018.stabr,
sum(pls2018.visits) AS visits_2018,
sum(pls2017.visits) AS visits_2017,
sum(pls2016.visits) AS visits_2016,
round ( ( 
         (sum(pls2018.visits) - sum(pls2017.visits) ) * 1.0 / sum(pls2017.visits) ) * 100, 1
       ) AS visits_ratio_2018_2017,
round ( ( (sum(pls2017.visits) - sum(pls2016.visits) ) * 1.0 / 
         sum(pls2016.visits) ) * 100, 1) AS visits_ratio_2017_2016
FROM libraries_2018 AS pls2018
    JOIN libraries_2017 AS pls2017 ON pls2018.fscskey = pls2017.fscskey
    JOIN libraries_2016 AS pls2016 ON pls2018.fscskey = pls2016.fscskey
WHERE pls2018.visits >= 0
  AND pls2017.visits >= 0
  AND pls2016.visits >= 0
GROUP BY pls2018.stabr
HAVING sum(pls2018.visits) > 50000000
ORDER BY visits_ratio_2018_2017 DESC;
 * sqlite:///dbs/w02/library.db
Done.
stabr visits_2018 visits_2017 visits_2016 visits_ratio_2018_2017 visits_ratio_2017_2016
FL 68423689 66697122 70991029 2.6 -6.0
NY 97921323 100012193 103081304 -2.1 -3.0
CA 146656984 151056672 155613529 -2.9 -2.9
IL 63466887 66166082 67336230 -4.1 -1.7
OH 68176967 71895854 74119719 -5.2 -3.0
TX 66168387 70514138 70975901 -6.2 -0.7

We are including only rows with a sum greater than 50 million visitors in 2018. We got six states as a result of this arbitrary condition. Only Florida experienced an increase in the number of visitors from 2017 to 2018.