Abstract

Squirrels are mammals and apart of the family Sciuridae, which is a family consisting of small/medium sized rodents. They are an important part of our ecosystem due to their impacts on nature and gardening. They have a tendency to bury seeds throughout the year and the seeds that done get retrieved, have the ability to grow and become trees which are crucial to life. Squirrels in New York City are just as important. Because there are primarily buildings and public transportation in the city, it makes the squirrels jobs at regualting and preserving the nature even more crucial. This dataset is going to look at the squirrels in New York City, focusing on where they are (longitude and latitude), fur color, age, and time they are spotted.

 #reading in the data
nyc_squirrels <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-29/nyc_squirrels.csv")
## Rows: 3023 Columns: 36
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): unique_squirrel_id, hectare, shift, age, primary_fur_color, highli...
## dbl  (9): long, lat, date, hectare_squirrel_number, zip_codes, community_dis...
## lgl (13): running, chasing, climbing, eating, foraging, kuks, quaas, moans, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(222)
squirrels_split <- initial_split(nyc_squirrels, prop= 0.8)
train <- training(squirrels_split)
test <- testing(squirrels_split)

Exploratory Analysis

Hypothesis:

There will be a higher population of squirrels in Central Park, and a nonexistent population in Times Square.

The majority of squirrels will be grey and seen in the AM.

Most squirrels will either be running or climbing trees.

Below we are going to look at just the first six rows of the dataset to get an idea of what variables we are working with. In our whole dataset, there are a total of 3023 observations and 26 variables. An important thing to note is the NA on multiple columns as answers. Because this was an observation based experiment, it might have been hard to collect exact data on the frequent movements and details of every single one, especially if there were many squirrels.

train %>%
  head()
## # A tibble: 6 × 36
##    long   lat uniqu…¹ hectare shift   date hecta…² age   prima…³ highl…⁴ combi…⁵
##   <dbl> <dbl> <chr>   <chr>   <chr>  <dbl>   <dbl> <chr> <chr>   <chr>   <chr>  
## 1 -74.0  40.8 10A-AM… 10A     AM    1.01e7       3 Juve… Cinnam… <NA>    Cinnam…
## 2 -74.0  40.8 36I-PM… 36I     PM    1.01e7       1 Adult Gray    Cinnam… Gray+C…
## 3 -74.0  40.8 34A-PM… 34A     PM    1.01e7       1 Adult Gray    Cinnam… Gray+C…
## 4 -74.0  40.8 37H-AM… 37H     AM    1.02e7       2 Adult Gray    <NA>    Gray+  
## 5 -74.0  40.8 19E-PM… 19E     PM    1.02e7       5 <NA>  Gray    White   Gray+W…
## 6 -74.0  40.8 7H-AM-… 07H     AM    1.01e7      14 Adult Cinnam… Gray    Cinnam…
## # … with 25 more variables: color_notes <chr>, location <chr>,
## #   above_ground_sighter_measurement <chr>, specific_location <chr>,
## #   running <lgl>, chasing <lgl>, climbing <lgl>, eating <lgl>, foraging <lgl>,
## #   other_activities <chr>, kuks <lgl>, quaas <lgl>, moans <lgl>,
## #   tail_flags <lgl>, tail_twitches <lgl>, approaches <lgl>, indifferent <lgl>,
## #   runs_from <lgl>, other_interactions <chr>, lat_long <chr>, zip_codes <dbl>,
## #   community_districts <dbl>, borough_boundaries <dbl>, …

We are about to look at the primary fur color of each squirrel and compare them by the number of observations. This is important because the different colors could signify different species who could potentially have different behaviors than other squirrels.

train %>%
  ggplot() +
  geom_bar(aes(y=primary_fur_color), fill = "maroon") +
           labs(title= "The Color of Squirrels in NYC", x= "Count", y="Fur Color")

From the data above, we can see that out of our 2500 observations in our training data set, almost 2000 of them are gray. In second place we have squirrels that are a cinnamon color (at about 400) and black squirrels the fall last place. We also have a NA section whose color was unfortunately not detected.

We are about to create a scatter plot to look at our latitude and longitude of each squirrel observed. This is important because it will help solidify or reject our hypothesis of having a large group of squirrels in Central Park and close to none in Times Square.

train %>%
  ggplot() +
  geom_point(aes(x=lat, y=long), color = "maroon") +
  labs(title= "Location of NYC Squirrels", x= "Latitude", y="Longitude")

From the scatter plot above, we can see that squirrels dominate New York and were observed in the locations 40.765 N, -73.98 W, to 40.80 N, -73.95. The exact coordinates of Central Park are 40.7826 N, 73.9656 W, so we can see that there may not be a lot of squirrels that were spotted on this point in the graph. Whether that be because none were seen or this park was not used in the observational study, our hypothesis turned out to be wrong. The coordinates of Time Square do not fall on this chart, so we cannot necessarily answer that other part of our hypothesis.

Below, we are about to see if age has an impact on the primary fur color of our squirrels. Prior, we saw that the majority of our squirrels were gray, however is that because they are all older or younger? Or does age not necessarily play a role in fur color.

train %>%
  ggplot() +
  geom_bar(aes(x=age, fill = primary_fur_color))+
           labs(title="Age and Fur Color", x= "Age", y= "Count")

From the bar plot above, we can see that the majority of our squirrels in general are adults. Primary fur color does not seem to have an association with the age of the squirrels in this observation, however this could be due to the adult squirrels being the majority observed. There are instances where we see cinnamon and black colored squirrels though so this may just be a coincidence as well. Further research would be needed to give any correlation and direct conclusion on age and fur color.

Below, we are going to compare the different behaviors of the squirrels observed. The behaviors that are examined in these bar plots are running, climbing, and foraging.

p1 <- train %>%
  ggplot() +
  geom_bar(aes(x=age, fill = running),
           show.legend = FALSE)+
           labs(title="Running", x= "Age", y= "Count")

p2 <- train %>%
  ggplot() +
  geom_bar(aes(x=age, fill = climbing),
           show.legend= FALSE)+
           labs(title="Climbing", x= "Age", y= "Count")

p3 <- train %>%
  ggplot() +
  geom_bar(aes(x=age, fill = foraging)) +
           labs(title="Foraging", x= "Age", y= "Count") +
  labs(fill = "Behavior")

(p1 + p2) / p3

From the three bar plots above, we can see that foraging was the most common behavior that was seen. The hypothesis I stated declared that climbing and running were going to be most common, however it is incorrect. Foraging, which is looking for resources is seen in about more than half the amount of squirrels observed.

Below we are about to look at what times the majority of our observed squirrels were seen. This is important because it will answer out hypothesis of whether or not squirrles being seen in the AM more frequently.

train %>%
  ggplot() +
  geom_bar(aes(y=shift), fill = "maroon") +
           labs(title= "The Time Squirrels Are Seen in NYC", x= "Count", y="Time")

From our bar plot, we can see that the majority of our squirrels in NYC were seen in the PM hours. This is not what I thought we would be seeing in our data. I figured that since I always see squirrels in the morning, that that was their prime time to be out. However, this plot tells a different story. A little more than a thousand of squirrels were seen in the AM but more than half of the squirrels were seen in the PM.

Conclusion

Looking at our training data, we can answer the hypothesis that were presented in our introduction. We assumed that (1) There will be a higher population of squirrels in Central Park, and a nonexistent population in Times Square (2) the majority of squirrels will be grey and seen in the AM and (3) most squirrels will either be running or climbing trees. However, not all of these were found out to be true. The locations of Central Park and Times Square were not necessarily clarified, so we could not make any direct conclusions in our data. This was shocking to me because Central Park is full of trees and trails which is why I expected to see more squirrels than there was. The majority of squirrels were grey, but they were seen in the PM not the AM which conflicted with our hypothesis. Lastly, the majority of the squirrels that were observerd were foraging instead of climbing or running. This was also quite shocking to me because now I did not take into consideration how far these squirrels were observed from. Further observation would take into account why these squirrels weren’t neccessarily running away. Overall, this data showed us the behavior, age, color, etc. of squirrels in New York City. Although we did not come up with any definite conclusions, that just means there is room for new observations and new material. If we looked at other locations in the state of New York, would we see different observations