First, I will load the require library tidyverse and CSV file to R and display a first few rows to inspect the dataset.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load CSV file to R
setwd("/Users/tampham/Desktop")
getwd()
## [1] "/Users/tampham/Desktop"
AQ <- read.csv("Air_Quality.csv")
# Display first few rows of dataset
head(AQ)
## Unique.ID Indicator.ID Name Measure Measure.Info
## 1 336867 375 Nitrogen dioxide (NO2) Mean ppb
## 2 336741 375 Nitrogen dioxide (NO2) Mean ppb
## 3 550157 375 Nitrogen dioxide (NO2) Mean ppb
## 4 412802 375 Nitrogen dioxide (NO2) Mean ppb
## 5 412803 375 Nitrogen dioxide (NO2) Mean ppb
## 6 412676 375 Nitrogen dioxide (NO2) Mean ppb
## Geo.Type.Name Geo.Join.ID Geo.Place.Name
## 1 CD 407 Flushing and Whitestone (CD7)
## 2 CD 107 Upper West Side (CD7)
## 3 CD 414 Rockaway and Broad Channel (CD14)
## 4 CD 407 Flushing and Whitestone (CD7)
## 5 CD 407 Flushing and Whitestone (CD7)
## 6 CD 107 Upper West Side (CD7)
## Time.Period Start_Date Data.Value Message
## 1 Winter 2014-15 12/01/2014 23.97 NA
## 2 Winter 2014-15 12/01/2014 27.42 NA
## 3 Annual Average 2017 01/01/2017 12.55 NA
## 4 Winter 2015-16 12/01/2015 22.63 NA
## 5 Summer 2016 06/01/2016 14.00 NA
## 6 Winter 2015-16 12/01/2015 26.43 NA
Then, I check for structure of the dataset and missing values. At first, I notice that the Message column contained NA values for all 18862 rows.
# Check structure of datatset
str(AQ)
## 'data.frame': 18862 obs. of 12 variables:
## $ Unique.ID : int 336867 336741 550157 412802 412803 412676 412677 603044 412804 825832 ...
## $ Indicator.ID : int 375 375 375 375 375 375 375 375 375 375 ...
## $ Name : chr "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" ...
## $ Measure : chr "Mean" "Mean" "Mean" "Mean" ...
## $ Measure.Info : chr "ppb" "ppb" "ppb" "ppb" ...
## $ Geo.Type.Name : chr "CD" "CD" "CD" "CD" ...
## $ Geo.Join.ID : int 407 107 414 407 407 107 107 314 407 107 ...
## $ Geo.Place.Name: chr "Flushing and Whitestone (CD7)" "Upper West Side (CD7)" "Rockaway and Broad Channel (CD14)" "Flushing and Whitestone (CD7)" ...
## $ Time.Period : chr "Winter 2014-15" "Winter 2014-15" "Annual Average 2017" "Winter 2015-16" ...
## $ Start_Date : chr "12/01/2014" "12/01/2014" "01/01/2017" "12/01/2015" ...
## $ Data.Value : num 24 27.4 12.6 22.6 14 ...
## $ Message : logi NA NA NA NA NA NA ...
# Check missing value
colSums(is.na(AQ))
## Unique.ID Indicator.ID Name Measure Measure.Info
## 0 0 0 0 0
## Geo.Type.Name Geo.Join.ID Geo.Place.Name Time.Period Start_Date
## 0 0 0 0 0
## Data.Value Message
## 0 18862
I decide to drop this column and create the new data frame AQ1 without the Message column.
# New dataframe, Message column has been removed
AQ1 <- AQ[ , colSums(is.na(AQ))==0]
I loaded the required libraries:
I use ggplot() to build the chart and convert it to an interactive plot using ggplotly().
# Bar Chart of indicators
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(forcats)
p0 <-ggplot(AQ1, aes(y = fct_infreq(Name))) + geom_bar() +
labs(title = "Bar Chart of Indicators", y = "", x = "Count") + theme_light()
ggplotly(p0)
This bar chart displays the indicators used to assess New York Air Quality. The x-axis represents the number of recorded measurements, while the y-axis lists various types of air quality indicators, including:
The most frequently recorded indicators are Nitrogen Dioxide (NO₂) and Fine Particles (PM2.5), each with 6,345 counts, followed by Ozone (O₃) with 2,115 counts.
# Histrogram of indicators measurement
ggplot(AQ1, aes(x=Data.Value)) + geom_histogram( fill="cyan", color = "black", binwidth = 5) +
labs(title = "Histogram of Data Value", x="Pollutant Measurement", y = "Count")
This chart shows how often different air pollution levels were measured. Most of the measurements are low—especially between 0 and 10—and occurred over 6,000 times. A few higher values, up to 400, were also recorded but are much less common. This pattern tells us that low pollutant levels are the most frequent, while high levels are rare but still present in the air quality data.
# Bar Chart of Geography Type
ggplot(AQ1,aes(Geo.Type.Name, fill =Measure)) + geom_bar(color=gray(0.55)) +
labs(title = "Bar Chart of Geography Type", x= "Geo Type Name", y ="Count") +
scale_fill_brewer(palette = "Set2")
This stacked bar chart shows how many air quality measurements were recorded across different geographic areas in New York City:
The tallest bars are from UHF42 and CD, meaning these areas have the most data recorded. The “Mean” measure (shown in yellow) appears most often across all areas.
The meaning of each Geo Type Name:
This chart helps us understand where air quality data is most frequently collected and what type of measurements are most common.
To perform Time Series Analysis of Ozone (O3), Nitrogen dioxide (NO2), and Fine Particles (PM2.5), we need to convert Start Date (character format) to (date format).
# Convert Start Date column to Date format
AQ1$Start_Date <- as.Date(AQ1$Start_Date, format = "%m/%d/%Y")
str(AQ1)
## 'data.frame': 18862 obs. of 11 variables:
## $ Unique.ID : int 336867 336741 550157 412802 412803 412676 412677 603044 412804 825832 ...
## $ Indicator.ID : int 375 375 375 375 375 375 375 375 375 375 ...
## $ Name : chr "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" "Nitrogen dioxide (NO2)" ...
## $ Measure : chr "Mean" "Mean" "Mean" "Mean" ...
## $ Measure.Info : chr "ppb" "ppb" "ppb" "ppb" ...
## $ Geo.Type.Name : chr "CD" "CD" "CD" "CD" ...
## $ Geo.Join.ID : int 407 107 414 407 407 107 107 314 407 107 ...
## $ Geo.Place.Name: chr "Flushing and Whitestone (CD7)" "Upper West Side (CD7)" "Rockaway and Broad Channel (CD14)" "Flushing and Whitestone (CD7)" ...
## $ Time.Period : chr "Winter 2014-15" "Winter 2014-15" "Annual Average 2017" "Winter 2015-16" ...
## $ Start_Date : Date, format: "2014-12-01" "2014-12-01" ...
## $ Data.Value : num 24 27.4 12.6 22.6 14 ...
AQ_NO2 is a new data frame created from AQ1 where Name is filtered by Nitrogen dioxide (NO2). The Date x-axis has been divided by every two years and have year format.
# Time Series Analysis of NO2
library(plotly)
AQ_NO2 <- AQ1[AQ1$Name == "Nitrogen dioxide (NO2)", ]
p <- ggplot(AQ_NO2, aes(Start_Date,Data.Value)) + geom_line(color = "steelblue") +
labs(title = "Time Series of NO2", x = "Date", y ="NO2") +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") + theme_minimal()
ggplotly(p)
## `geom_smooth()` using formula = 'y ~ x'
This time series plot displays the average levels of Nitrogen Dioxide (NO₂) in New York from 2010 to 2023. The data show a clear downward trend, with NO₂ levels gradually decreasing over the years. Most values fall between 10 and 30, with occasional peaks, but the overall pattern suggests improving air quality. This steady decline may reflect the impact of cleaner vehicle technologies, stricter emission regulations, and other air quality improvement efforts in the city.
AQ_FP is a new data frame created from AQ1 where Name is filtered by Fine particles (PM 2.5). The Date x-axis has been divided by every two years and have year format.
# Time Series Analysis of Fine Particles
AQ_FP <- AQ1[AQ1$Name == "Fine particles (PM 2.5)", ]
p1 <- ggplot(AQ_FP, aes(Start_Date,Data.Value)) + geom_line(color = "purple") +
labs(title = "Time Series of Fine Particles", x = "Date", y ="Fine particles (PM 2.5)") +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") + theme_minimal()
ggplotly(p1)
## `geom_smooth()` using formula = 'y ~ x'
This time series plot shows the average levels of Fine Particles (PM2.5) in New York from 2010 to 2023. The overall trend is downward, indicating that PM2.5 levels have gradually decreased over time—suggesting an improvement in air quality. Most values fall between 7 and 12, with occasional peaks. However, there is a slight increase in 2023, which may indicate a potential reversal or temporary rise in PM2.5 levels.
AQ_O is a new data frame created from AQ1 where Name is filtered by Ozone (O3). The Date x-axis has been divided by every two years and have year format.
# Time Series Analysis of Ozone
AQ_O <- AQ1[AQ1$Name == "Ozone (O3)", ]
p2 <- ggplot(AQ_O, aes(Start_Date,Data.Value)) + geom_line(color = "red") +
labs(title = "Time Series of Ozone", x = "Date", y ="Ozone (O3)") +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") + theme_minimal()
ggplotly(p2)
## `geom_smooth()` using formula = 'y ~ x'
This time series plot shows the average levels of Ozone (O₃) in New York from 2010 to 2023. The overall trend is upward, indicating that Ozone levels have gradually increased over time. Most values fall between 27 and 35, with occasional peaks. This rising pattern may suggest growing concern for ozone-related air quality in recent years.
top_NO2 is a new data frame created from AQ_NO2, grouped by geographic place name and summarized by the average NO₂ levels. The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x- and y-axes are flipped for readability, and the y-axis is marked at intervals of 5 units.
# Summarize average NO2 by Geo Place Name (top 10)
top_NO2 <- AQ_NO2 %>%
group_by(Geo.Place.Name) %>%
summarise(Avg_NO2 = mean(Data.Value, na.rm = TRUE)) %>% # calculate mean of Data Value excluding NA (na.rm = TRUE)
arrange(desc(Avg_NO2)) %>%
slice_head(n = 10)
# Bar chart
p3 <- ggplot(top_NO2, aes(x = reorder(Geo.Place.Name, Avg_NO2), y = Avg_NO2, fill = Geo.Place.Name)) +
geom_bar(stat = "identity") + scale_fill_brewer(palette="Set3") +
scale_y_continuous(breaks = seq(0, max(top_NO2$Avg_NO2), by = 5)) + # break y axis every 5 units
coord_flip() + # flip between x and y axis
labs(title = "Top 10 Neighborhoods with Highest Avg NO2",
x = "",
y = "Average NO2") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p3)
This bar chart highlights the top 10 neighborhoods in New York City with the highest average levels of Nitrogen Dioxide (NO₂), a common air pollutant. The bars are arranged from highest to lowest, making it easy to identify the areas with the most pollution. Midtown (CD5) ranks the highest with an average NO₂ level of 33.95, followed by Gramercy Park – Murray Hill at 31.66, and Chelsea – Clinton at 29.98. This visualization helps pinpoint neighborhoods that may face more serious air quality issues and could benefit from targeted environmental improvement efforts.
top_FP is a new data frame created from AQ_FP, grouped by geographic place name and summarized by the average fine particles levels. The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x- and y-axes are flipped for readability, and the y-axis is marked at intervals of 5 units.
# Summarize average Fine Particles (PM 2.5) by Geo Place Name (top 10)
top_FP <- AQ_FP %>%
group_by(Geo.Place.Name) %>%
summarize(Avg_FP =mean(Data.Value, na.rm = TRUE)) %>%
arrange(desc(Avg_FP)) %>%
slice_head(n = 10)
# Bar Chart
p4 <- ggplot(top_FP, aes(x = reorder(Geo.Place.Name, Avg_FP), y = Avg_FP, fill = Geo.Place.Name)) +
geom_bar(stat ="identity") + scale_fill_brewer(palette="Paired") +
scale_y_continuous(breaks = seq(0, max(top_FP$Avg_FP), by = 5)) +
coord_flip() +
labs(title = "Top 10 Neighborhoods With Highest Average FP",
x = "",
y = "Average Fine Particles (PM 2.5)") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p4)
This bar chart shows the top 10 neighborhoods in New York City with the highest average levels of fine particles (PM 2.5), a common air pollutant. The bars are ordered from highest to lowest, helping to easily identify the most affected areas. Midtown (CD5) has the highest average PM 2.5 level at 12.79, followed by Gramercy Park – Murray Hill (11.86), and Chelsea – Clinton (11.57). This visualization highlights neighborhoods that may face more serious air quality challenges and could be priorities for environmental improvement efforts.
Midtown (CD5), Gramercy Park – Murray Hill, and Chelsea – Clinton appear in the top three for both Nitrogen Dioxide (NO₂) and Fine Particles (PM 2.5), suggesting these areas may benefit most from targeted air quality interventions.
top_O3 is a new data frame created from AQ_O, grouped by geographic place name and summarized by the average Ozone (O3). The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x- and y-axes are flipped for readability, and the y-axis is marked at intervals of 5 units.
# Summarize average Ozone (O3) by Geo Place Name (top 10)
top_O3 <- AQ_O %>%
group_by(Geo.Place.Name) %>%
summarise(Avg_O3 =mean(Data.Value, na.rm = TRUE)) %>%
arrange(desc(Avg_O3)) %>%
slice_head(n = 10)
# Bar Chart
p5 <-ggplot(top_O3, aes(x = reorder(Geo.Place.Name, Avg_O3), y = Avg_O3, fill = Geo.Place.Name)) +
geom_bar(stat ="identity") + scale_fill_viridis_d(option = "plasma") +
scale_y_continuous(breaks = seq(0, max(top_O3$Avg_O3), by = 5)) +
coord_flip() +
labs(title = "Top 10 Neighborhoods With Highest Avg O3",
x = "",
y = "Average Ozone (O3)") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p5)
The earlier time series plot showed an upward trend in ozone (O₃) levels across New York City.
This bar chart highlights the top 10 neighborhoods in New York City with the highest average ozone (O₃) levels. The bars are arranged from highest to lowest to allow easy comparison. Rockaway and Broad Channel (CD14) ranks the highest with an average level of 37.13, followed by Rockaways at 37.04 and Coney Island (CD13) at 35.02. This visualization helps identify neighborhoods that may face greater air quality issues and could be priorities for targeted environmental improvement efforts.
AQ_annual is the new data frame created from AQ1, where data is filtered for all information related to annual vehicle miles. top_Annual is a new data frame created from AQ_Annual, grouped by geographic place name and summarized by the average annual vehicle miles. The data is then filtered to show the top 5 locations with the highest averages. In the plot, the x- and y-axes are flipped for readability.
# Filter out data with Annual Vehicles Miles
Annual_vehicle <- c("Annual vehicle miles traveled", "Annual vehicle miles traveled (cars)", "Annual vehicle miles traveled (trucks)")
AQ_Annual <- AQ1[AQ1$Name %in% Annual_vehicle, ]
# Summarize average Annual Vehicle Miles by Geo Place Name (top 5)
top_Annual <- AQ_Annual %>%
group_by(Geo.Place.Name) %>%
summarise(Avg_Annual =mean(Data.Value, na.rm = TRUE)) %>%
arrange(desc(Avg_Annual)) %>%
slice_head(n = 5)
# Bar Chart
p6 <-ggplot(top_Annual, aes(x = reorder(Geo.Place.Name, Avg_Annual), y = Avg_Annual, fill = Geo.Place.Name)) +
geom_bar(stat ="identity") + scale_fill_viridis_d("A") +
coord_flip() +
labs(title = "Top 5 Neighborhoods of Avg Annual Miles",
x = "",
y = "Average Annual Miles (millions of miles)") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p6)
This bar chart highlights the top 5 neighborhoods in New York City with the highest average annual vehicle miles traveled. The bars are arranged from highest to lowest for easy comparison. Stuyvesant Town and Turtle Bay (CD6) ranks first with an average of 119.24 million miles per year, followed by Gramercy Park – Murray Hill at 118.81 million and Upper East Side (CD8) at 91.54 million. These high travel volumes may contribute to more traffic-related emissions and could be key areas to consider for transportation and environmental planning.
AQ_Health is the new data frame created from AQ1, where data is filtered for all information related to negative effects of Ozone (O₃) and Fine Particles (PM2.5) to health. Avg_Health is a new data frame created from AQ_Health, grouped by Name and summarized by the average annual rate of negative effects per 100,000 people. A new column has been added to the Avg_Health data frame to indicate pollutant types (Ozone or PM 2.5).
# Filter out data with negative effects of Ozone (O3) and Fine Particles (PM2.5) to health
Health <- c("Asthma emergency department visits due to PM2.5", "Respiratory hospitalizations due to PM2.5 (age 20+)", "Asthma hospitalizations due to Ozone", "Cardiovascular hospitalizations due to PM2.5 (age 40+)", "Deaths due to PM2.5", "Cardiac and respiratory deaths due to Ozone", "Asthma emergency departments visits due to Ozone")
AQ_Health <- AQ1[AQ1$Name %in% Health, ]
# Summarize average annual rate of negative effects per 100,000 children or adults.
Avg_Health <- AQ_Health %>%
group_by(Name) %>%
summarise(Avg_Annual_Rate =mean(Data.Value, na.rm = TRUE)) %>%
arrange(desc(Avg_Annual_Rate))
# Adding a new column per pollutant types (Ozone or PM2.5)
Avg_Health <- Avg_Health %>%
mutate(Pollutant = case_when(
grepl("Ozone", Name) ~ "Ozone",
grepl("PM2.5", Name) ~ "PM2.5",
TRUE ~ "Other"
))
To distinguish between different pollutant types (Ozone or PM 2.5), the Pollutant column is converted to a factor. This enables the use of scale_fill_manual() in the dot plot to assign different colors based on pollutant type. In the plot, the x-axis is scaled with breaks every 10 units to enhance readability.
# Convert Pollutant to factors
Avg_Health$Pollutant <- factor(Avg_Health$Pollutant, levels = c("Ozone", "PM2.5"))
# Dot Plot
p7 <- ggplot(Avg_Health, aes(x = Avg_Annual_Rate, y = reorder(Name, Avg_Annual_Rate))) +
geom_point(aes(fill = Pollutant), shape = 21, color = "black", size = 3) +
scale_fill_manual(values = c("Ozone" = "violet", "PM2.5" = "cyan")) +
scale_x_continuous(breaks = seq(0, max(Avg_Health$Avg_Annual_Rate), by = 10)) +
labs(title = "Negative Effects of Ozone and PM2.5 to Health",
x = "Average Annual Rate per 100,000 people", y = "") +
theme_minimal() +
theme(plot.title = element_text(hjust = 1)) +
theme(legend.title = element_blank(), legend.position = "bottom")
ggplotly(p7)
This dot plot illustrates the average annual rates of health impacts linked to air pollutants Ozone and PM 2.5, measured per 100,000 people. Each dot represents a specific health outcome and is colored by pollutant type to help distinguish between them. The points are ordered from highest to lowest to make comparison easier.
Asthma emergency department visits due to Ozone have the highest average rate at 71.87, followed by visits caused by PM 2.5 at 65.85, and deaths related to PM 2.5 at 46.08. These results highlight the significant health burden associated with air pollution.
In the next section, I will further explore the top ten neighborhoods with the highest average annual rates of asthma emergency visits and deaths due to Ozone and PM 2.5.
Asthma_O is the new data frame created from AQ_Health, where data is filtered for asthma emergency visits due to Ozone, grouped by Name and Geo.Place.Name, and summarized by average annual rate per 100,000 people. The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x-axis is scaled with breaks every 10 units to enhance readability.
# Filter out top 10 neighborhood names with Asthma emergency departments visits due to Ozone
Asthma_O <- AQ_Health %>%
filter(Name == "Asthma emergency departments visits due to Ozone") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Annual_Rate = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Annual_Rate)) %>%
slice_head(n = 10)
# Dot Plot
p8 <-ggplot(Asthma_O, aes(x = Avg_Annual_Rate, y= reorder(Geo.Place.Name, Avg_Annual_Rate))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
scale_fill_viridis_d(option = "inferno") +
scale_x_continuous(breaks = seq(0, max(Asthma_O$Avg_Annual_Rate), by = 10)) +
labs(title ="Top 10 Neighborhoods with Highest Avg Asthma ER Visits due to Ozone",
x = "Average Annual Rate per 100,000 people", y = "") +
theme_minimal() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 1))
ggplotly(p8)
This dot plot highlights the top 10 neighborhoods in New York City with the highest average annual rates of asthma emergency visits attributed to Ozone exposure, measured per 100,000 people. The neighborhoods are ranked from highest to lowest to facilitate comparison. East Harlem has the highest average rate at 194.12, followed by Central Harlem – Morningside Heights at 189.88, and Hunts Point – Mott Haven at 167.24.
Asthma_PM is the new data frame created from AQ_Health, where data is filtered for asthma emergency visits due to Ozone, grouped by Name and Geo.Place.Name, and summarized by average annual rate per 100,000 people. The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x-axis is scaled with breaks every 10 units to enhance readability.
# Filter out top 10 neighborhood names with Asthma emergency departments visits due to PM2.5
Asthma_PM <- AQ_Health %>%
filter(Name == "Asthma emergency department visits due to PM2.5") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Annual_Rate = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Annual_Rate)) %>%
slice_head(n = 10)
# Dot Plot
p9 <-ggplot(Asthma_PM, aes(x = Avg_Annual_Rate, y= reorder(Geo.Place.Name, Avg_Annual_Rate))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
scale_fill_viridis_d(option = "plasma") +
scale_x_continuous(breaks = seq(0, max(Asthma_PM$Avg_Annual_Rate), by = 10)) +
labs(title ="Top 10 Neighborhoods with Highest Avg Asthma ER Visits due to PM2.5",
x = "Average Annual Rate per 100,000 people", y = "") +
theme_minimal() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 1))
ggplotly(p9)
This dot plot highlights the top 10 neighborhoods in New York City with the highest average annual rates of asthma emergency visits attributed to PM2.5 exposure, measured per 100,000 people. The neighborhoods are ranked from highest to lowest to facilitate comparison. East Harlem shows the highest average rate at 195.15, followed by Central Harlem – Morningside Heights at 193.35, and Hunts Point – Mott Haven at 175.02.
When comparing this chart with the previous one (focused on Ozone-related asthma visits), the same three neighborhoods—East Harlem, Central Harlem – Morningside Heights, and Hunts Point – Mott Haven have consistently ranked at the top. This suggests these areas are especially affected by both PM2.5 and Ozone in terms of asthma-related health issues.
Death_PM is the new data frame created from AQ_Health, where data is filtered for deaths due to PM2.5, grouped by Name and Geo.Place.Name, and summarized by average annual rate per 100,000 people. The data is then filtered to show the top 10 locations with the highest averages. In the plot, the x-axis is scaled with breaks every 5 units to enhance readability.
# Filter out top 10 neighborhood names with deaths due to PM2.5
Death_PM <- AQ_Health %>%
filter(Name == "Deaths due to PM2.5") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Annual_Rate = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Annual_Rate)) %>%
slice_head(n = 10)
# Dot Plot
p10 <-ggplot(Death_PM, aes(x = Avg_Annual_Rate, y= reorder(Geo.Place.Name, Avg_Annual_Rate))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
scale_fill_viridis_d(option = "rocket") +
scale_x_continuous(breaks = seq(0, max(Death_PM$Avg_Annual_Rate), by = 5)) +
labs(title ="Top 10 Neighborhoods With Highest Avg Deaths due to PM2.5",
x = "Average Annual Rate per 100,000 adults/ children", y = "") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p10)
This dot plot shows the top 10 neighborhoods in New York City with the highest average annual death rates linked to PM2.5 exposure, measured per 100,000 people. The neighborhoods are sorted from highest to lowest to make comparison easier. Kingsbridge – Riverdale ranks highest at 75.53, followed by East Harlem at 65.49, and Rockaways at 62.92.
Kingsbridge – Riverdale appears here for the first time, indicating a high rate of deaths due to PM2.5 but not a high rate of asthma emergency visits. In contrast, East Harlem ranks second here and has consistently appeared at the top in previous charts for asthma-related emergencies due to both Ozone and PM2.5, suggesting a broader impact of air pollution on health in that area.
Boiler emissions refer to the pollutants released into the atmosphere when fuel is burned in a boiler to produce heat. These pollutants include particulate matter, carbon monoxide, sulfur dioxide, nitrogen oxides (NOx), volatile organic compounds (VOCs), and greenhouse gases like carbon dioxide. Effective control of these emissions is crucial for protecting air quality and public health. These Boiler Emissions measurement have been included in the data set: Particulate Matter (PM), Sulfur Dioxide (SO2), Nitrogen Oxides (NOx) for year of 2013 and 2015.
BE is the new data frame created from AQ1 for all information related to Boiler Emissions - total NOx, SO2 and PM 2.5 Emissions.
# Filter out Boiler Emissions data
Boiler <-c("Boiler Emissions- Total NOx Emissions", "Boiler Emissions- Total SO2 Emissions", "Boiler Emissions- Total PM2.5 Emissions")
BE <- AQ1[AQ1$Name %in% Boiler, ]
# Box Plot
p11 <- ggplot(BE, aes(x=Name, y=Data.Value, color=Name)) +
geom_boxplot() +
labs(title = "Boiler Emissions of NOx, SO2, PM2.5", x="", y="Number per km2") +
theme_minimal() + theme(legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 10, hjust = 0.5))
ggplotly(p11)
This box plot visualizes total boiler emissions of NOx, PM2.5, and SO2, represented by red, green, and blue colors respectively.
NOx emissions range from a minimum of 2 to a maximum of 284.40, with a median of 27.80 and an upper fence of 132.50. There are many high outliers. PM2.5 emissions have much smaller values overall, ranging from approximately 4.10 to 11.40. SO2 emissions range from 0 to a maximum of 99.70, with a median of 1.95 and an upper fence of 26.80. Similar to NOx, there are many outliers in SO2 data.
This plot highlights the differences in emission levels and variability among the three pollutants.
top_NOx_E is the new data frame created from BE, where data is filtered for total NOx Emissions, grouped by Name and Geo.Place.Name, and summarized by average total NOx Emissions. The data is then filtered to show the top 10 neighborhoods with the highest averages. In the plot, the x-axis is scaled with breaks every 20 units to enhance readability.
# Summarize Average Number per km2 for NOx Emissions in 2013 and 2015 (top 10)
top_NOx_E <- BE %>%
filter(Name == "Boiler Emissions- Total NOx Emissions") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Emissions = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Emissions)) %>%
slice_head(n = 10)
# Dot Plot
p12 <- ggplot(top_NOx_E, aes(x = Avg_Emissions, y= reorder(Geo.Place.Name, Avg_Emissions))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
scale_x_continuous(breaks = seq(0, max(top_NOx_E$Avg_Emissions), by = 20)) +
labs(title ="Top 10 Neighborhoods With Highest Avg NOx",
x = "Average Emissions (Number per km2)", y = "") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p12)
This dot plot displays the top 10 neighborhoods in New York City with the highest average total NOx emissions per square kilometer. Neighborhoods are ranked from highest to lowest to support visual comparison. Gramercy Park – Murray Hill has the highest average at 270.45, followed by Upper East Side at 247.85, and Upper West Side at 229.20.
top_SO is the new data frame created from BE, where data is filtered for total SO2 Emissions, grouped by Name and Geo.Place.Name, and summarized by average total SO2 Emissions. The data is then filtered to show the top 10 neighborhoods with the highest averages. In the plot, the x-axis is scaled with breaks every 10 units to enhance readability.
# Summarize Average Number per km2 for SO2 Emissions in 2013 and 2015 (top 10)
top_SO <- BE %>%
filter(Name == "Boiler Emissions- Total SO2 Emissions") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Emissions = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Emissions)) %>%
slice_head(n = 10)
# Dot Plot
p13 <- ggplot(top_SO, aes(x = Avg_Emissions, y= reorder(Geo.Place.Name, Avg_Emissions))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
scale_x_continuous(breaks = seq(0, max(top_SO$Avg_Emissions), by = 10)) +
labs(title ="Top 10 Neighborhoods With Highest Avg SO2",
x = "Average Emissions (Number per km2)", y = "") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p13)
This dot plot shows the top 10 neighborhoods in New York City with the highest average total SO₂ emissions per square kilometer. Neighborhoods are ranked from highest to lowest to facilitate visual comparison. Upper West Side has the highest average at 75.30, followed by Upper East Side at 67.20, and Gramercy Park – Murray Hill at 60.15.
top_PM is the new data frame created from BE, where data is filtered for total PM2.5 Emissions, grouped by Name and Geo.Place.Name, and summarized by average total PM2.5 Emissions. The data is then filtered to show the top 10 neighborhoods with the highest averages.
# Summarize Average Number per km2 for PM 2.5 Emissions in 2013 and 2015 (top 10)
top_PM <- BE %>%
filter(Name == "Boiler Emissions- Total PM2.5 Emissions") %>%
group_by(Name, Geo.Place.Name) %>%
summarise(Avg_Emissions = mean(Data.Value, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Avg_Emissions)) %>%
slice_head(n = 10)
# Dot Plot
p14 <- ggplot(top_PM, aes(x = Avg_Emissions, y= reorder(Geo.Place.Name, Avg_Emissions))) +
geom_point(aes(fill = Geo.Place.Name), shape = 21, color = "black", size = 3) +
labs(title ="Top 10 Neighborhoods With Highest Avg PM2.5",
x = "Average Emissions (Number per km2)", y = "") +
theme_minimal() + theme(legend.position = "none")
ggplotly(p14)
This dot plot shows the top 10 neighborhoods in New York City with the highest average total PM2.5 emissions per square kilometer. Neighborhoods are ranked from highest to lowest to support visual comparison. Upper West Side has the highest average at 9.20, followed by Upper East Side at 8.25, and Gramercy Park – Murray Hill at 7.25.
Compared to the previous NOx and SO₂ plots, this one shows a much smaller data range. Among the three pollutants, NOx has the highest overall average emissions per neighborhood, followed by SO₂, and then PM2.5.
Time series analysis using the ARIMA (AutoRegressive Integrated Moving Average) model in R is a widely used approach for modeling and forecasting data that varies over time. R offers functions such as arima() and auto.arima() from the forecast package to fit ARIMA models to historical data and generate forecasts based on observed patterns.
In this analysis, the necessary libraries are dplyr and lubridate. The ozone_annual data frame is derived from the AQ_O dataset, where the Year is extracted from the Start_Date column. The data is then grouped by year and summarized to calculate the average ozone levels. The resulting time series spans from 2009 to 2023.
library(dplyr)
library(lubridate)
# Summarize average annual ozone per year to fit ARIMA model
ozone_annual <- AQ_O %>%
# extract year from Start_Date
mutate(Year = year(Start_Date)) %>%
# group by calendar year
group_by(Year) %>%
# take mean ozone (ppb) for that year
summarize(Avg_O3 = mean(Data.Value, na.rm = TRUE),
.groups = "drop") %>%
arrange(Year)
head(ozone_annual)
## # A tibble: 6 × 2
## Year Avg_O3
## <dbl> <dbl>
## 1 2009 24.8
## 2 2010 32.4
## 3 2011 31.8
## 4 2012 32.9
## 5 2013 30.0
## 6 2014 30.5
I will load the required forecast package to create time series object, since it is yearly data, the frequency is 1.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# start at the first year, annual frequency = 1
o3_ts <- ts(ozone_annual$Avg_O3,
start = min(ozone_annual$Year),
frequency = 1)
Then I will fit ARIMA model to my time series object (o3_ts) and forecast two years ahead, 2024 and 2025.
# Fit ARIMA model
fit <- auto.arima(o3_ts)
# forecast 2 steps ahead = years 2024 and 2025
fc <- forecast(fit, h = 2)
I will load required library ggrepel, create the forecast dataframe for plotting, and use autoplot() to plot.
library(ggrepel)
# Prepare forecast values for annotation
forecast_years <- seq(max(ozone_annual$Year) + 1, by = 1, length.out = 2)
forecast_df <- data.frame(
Year = forecast_years,
Forecast = as.numeric(fc$mean))
# Basic forecast plot using autoplot
p <- autoplot(fc) +
labs(title = "Annual Average Ozone: Observed + Forecast",
x = "Year", y = "Ozone (ppb)") +
theme_minimal(base_size = 13) +
theme( plot.title = element_text(face = "bold", hjust = 0.5),
axis.title.x = element_text(margin = margin(t = 10)),
axis.title.y = element_text(margin = margin(r = 10)))
# Extract forecast mean for annotation for 2025
mean_ozone_2025 <- round(fc$mean[2], 1)
# Add forecast mean label
p <- p + annotate("text",
x = 2025.25, y = mean_ozone_2025,
label = paste0(mean_ozone_2025),
color = "blue", fontface = "bold", size = 4)
print(p)
The plot shows the trend of average annual ozone levels from 2009 to 2023, along with a forecast for 2025. Based on the ARIMA model, the predicted average ozone level for 2025 is approximately 30.8 ppb. The shaded area represents the forecast uncertainty range.
This forecast is based solely on historical ozone data from 2009 to 2023. It does not take into account other influential factors such as changes in traffic patterns, weather conditions, or air quality regulations. The model also assumes that past trends will continue, which may not always be the case. As a result, the further the forecast looks into the future, the greater the uncertainty. This projection offers a general outlook but should not be interpreted as a precise prediction.