Statement of Purpose

The objective of this report is to develop a model that will predict the amount of CO2 emissions a certain car gives off. When visualizing this model and trying to create predictions, we compare different properties of cars, such as the engine size, number of cylinders, transmission, fuel consumption, etc. This report is intended to give buyers an understanding ow how their car model can potentially impact CO2 and their carbon footprint associated with climate change.

#reading in the data
c02 <- read_csv("~/Desktop/c02data/CO2 Emissions_Canada.csv")
## Rows: 7385 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Make, Model, Vehicle Class, Transmission, Fuel Type
## dbl (7): Engine Size(L), Cylinders, Fuel Consumption City (L/100 km), Fuel C...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(c02) <- janitor::make_clean_names(names(c02))

c02
## # A tibble: 7,385 × 12
##    make  model   vehic…¹ engin…² cylin…³ trans…⁴ fuel_…⁵ fuel_…⁶ fuel_…⁷ fuel_…⁸
##    <chr> <chr>   <chr>     <dbl>   <dbl> <chr>   <chr>     <dbl>   <dbl>   <dbl>
##  1 ACURA ILX     COMPACT     2         4 AS5     Z           9.9     6.7     8.5
##  2 ACURA ILX     COMPACT     2.4       4 M6      Z          11.2     7.7     9.6
##  3 ACURA ILX HY… COMPACT     1.5       4 AV7     Z           6       5.8     5.9
##  4 ACURA MDX 4WD SUV - …     3.5       6 AS6     Z          12.7     9.1    11.1
##  5 ACURA RDX AWD SUV - …     3.5       6 AS6     Z          12.1     8.7    10.6
##  6 ACURA RLX     MID-SI…     3.5       6 AS6     Z          11.9     7.7    10  
##  7 ACURA TL      MID-SI…     3.5       6 AS6     Z          11.8     8.1    10.1
##  8 ACURA TL AWD  MID-SI…     3.7       6 AS6     Z          12.8     9      11.1
##  9 ACURA TL AWD  MID-SI…     3.7       6 M6      Z          13.4     9.5    11.6
## 10 ACURA TSX     COMPACT     2.4       4 AS5     Z          10.6     7.5     9.2
## # … with 7,375 more rows, 2 more variables: fuel_consumption_comb_mpg <dbl>,
## #   co2_emissions_g_km <dbl>, and abbreviated variable names ¹​vehicle_class,
## #   ²​engine_size_l, ³​cylinders, ⁴​transmission, ⁵​fuel_type,
## #   ⁶​fuel_consumption_city_l_100_km, ⁷​fuel_consumption_hwy_l_100_km,
## #   ⁸​fuel_consumption_comb_l_100_km

In this chunk of code we are splitting the CO2 emissions data. We are splitting it into two different categories so that we can see how well our data is performing. The train set is going to be used for modeling whereas the test set is going to be used for prediction based questions and answers.

#splitting the data into test and train
set.seed(823)
cO2_splits <- initial_split(c02, prop = 0.90) 
train <- training(cO2_splits)
test <- testing(cO2_splits)

Below, we are using a technique called cross validation. This is going to allow us to really asses our model and use a sampling method that uses a k number of different folds to test as well as train a model and make sure it is not overfit.

#building cross validation folds
train_folds <- vfold_cv(train)

Executive Summary

Throughout this experiment, we are trying to predict the amount of CO2 emissions per vehicle in Canada. CO2 emissions are a problem in the world right now because of their big contribution to global warming and climate change. CO2 emissions are bad for the environment because they can impact us for the following thousands of years. One of the main causes of emissions are from vehicles. Because automation and cars are everywhere today, there has been a spike in these emission values and that spike has the potential to hurt the future of the Earth. This is extremely important because everyone has their own part in doing what they think will benefit the climate and Earth for the greater good. So, this data set is extremely helpful for every individual in the case that they do want to minimize their CO2 emissions per vehicle but don’t necessarily know where to start. This report will give a brief overview on different variables that come into play when discussing vehicles and their emission and provide viewers with a right place to start.

To do this, we are using the variables that were given to us such as type of vehicle, type of fuel, engine size, fuel consumption, the make of the car, and the amount of cylinders the car has. We are using these variables to positively predict the value of emissions so that individuals who are looking for a vehicle and are concerned with their carbon footprint or how much emissions their vehicle is giving off can look at this information prior to making a decision. In order to predict the amount of emissions, we used multiple tests. First we used a linear regression model, which allows us to only look at one variable as our main predictor of emissions. Secondly, we created a step poly model which allows us to see if there is any curvature in our model by using higher order terms, as well as allowing us to add in multiple predictors unlike the linear model. Lastly, we used a multiple linear regression model which is very similar to the linear regression model except for the fact that we are able to add more variables. Because there were so many variables to work with in this data set, we had a lot of room to play around and look for our best predictors. Each one of these models were validated through cross validation folds to prevent the outcome of overfitting. We separated our data into a train and a test set to allow us to eventually use our best model on the test set and compare RMSE cross validation numbers.

Concluding, this report provided us with three different models; a linear model, a step_poly model, and a multiple regression model. Each model was tested through cross validation and our training data set to see how well it predicted emissions prices. The best model used was the multiple linear regression model. This model gave us better r squared and adjusted r squared values and when we cross validated it, we got a great RMSE value. Overall, this data shows us that these variables of the type of vehicle, type of fuel, engine size, fuel consumption, the make of the car, and the amount of cylinders the car has great impact on the amount of CO2 emissions that are given off. We see that not only the larger the engine size, the more emissions but also the increasing amount of cylinders provides us with the same result. Because we have a best fit model with significant metrics, we can assume there is a postive correlation between these variables and the amount of emissions a vehicle gives off. So, when choosing a car that is more eco friendly and gives off less CO2 emissiosn, the amount of cylinders, the engine size, the fuel type, and the vehicle class are important prospects in searching for an effective car.

Introduction

Carbon dioxide emissions, also known as CO2 emissions are a type of greenhouse gas that is colorless and odorless. Today, with global warming we see an increase of CO2 emissions which can create negative impacts on the earth. A lot of that is caused by vehicles and the CO2 they give off. In this dataset, we are looking at the multiple variables that contribute to the CO2 emissions released. The goal here is to use this data and predict the emissions rate for all of the vehicles in North America, unfortunately this data set is restricted and only gives us data from vehicles in Canada. However, this is still valuable because vehicles are sold worldwide and all of the makes and types of vehicles that we see in this data set are seen here in the United States as well. The only difference may be the fuel consumption, I assume that people in the United States, specifically in places like New York and Boston, we see more travel and more usage of fuel consumption. There are not any significant limitations or concerns when using this data set and applying it to our question of what variable about vehicles impacts the CO2 emissions. This information is important for anyone who is looking to reduce their CO2 emissions by changing/buying a new vehicle.

Exploratory Data Analysis

Here we begin our exploratory data analysis on the dataset. We are going to look at multiple graphs to give us an idea of what has the most/least impact on our CO2 emissions rates. Doing this will allow us to create models based off of our findings.

Below we are going to get a glimpse of 6 of our rows to see what variables we are working with and what they look like.

train %>%
 head()
## # A tibble: 6 × 12
##   make   model   vehic…¹ engin…² cylin…³ trans…⁴ fuel_…⁵ fuel_…⁶ fuel_…⁷ fuel_…⁸
##   <chr>  <chr>   <chr>     <dbl>   <dbl> <chr>   <chr>     <dbl>   <dbl>   <dbl>
## 1 BMW    M235i   SUBCOM…     3         6 AS8     Z          11.6     7.7     9.9
## 2 GMC    SAVANA… VAN - …     4.8       8 A6      X          21.3    14.3    18.1
## 3 NISSAN MURANO… STATIO…     3.5       6 AV7     X          11.2     8.4     9.9
## 4 FORD   EDGE    SUV - …     3.5       6 AS6     X          13.4     9      11.4
## 5 GMC    YUKON … SUV - …     6.2       8 A6      Z          16.8    11.7    14.5
## 6 GMC    Canyon  PICKUP…     2.5       4 A6      X          12.1     9.2    10.8
## # … with 2 more variables: fuel_consumption_comb_mpg <dbl>,
## #   co2_emissions_g_km <dbl>, and abbreviated variable names ¹​vehicle_class,
## #   ²​engine_size_l, ³​cylinders, ⁴​transmission, ⁵​fuel_type,
## #   ⁶​fuel_consumption_city_l_100_km, ⁷​fuel_consumption_hwy_l_100_km,
## #   ⁸​fuel_consumption_comb_l_100_km

We are about to create a bar plot that shows us not only the count of cylinders per vehicle but also what make they are. We want to get an overall look at our data and determine the count of each type of vehicle. The cylinder of a car is the area where fuel is combusted and power is generated. This is important because it can help us determine whether or not the type of vehicle is correlated with the amount of cylinders. The more cylinders a vehicle has means the faster the power can be generated and in return these vehicles will most likely have better fuel economy.

train %>%
  ggplot() + 
  geom_bar(aes(y= cylinders, fill = make))+
  labs(title = "Counts of Cylinders Per Vehicle",
       x = "Count", y="Cylinders")

From the bar plot above, we can see that the majority of the vehicles we are dealing with have four cylinders and the least amount of vehicles have 16 cylinders. The spread of vehicle make and cylinders seems to be evenly distributed for the most part telling us that no specific make only includes a certain amount of cylinders per vehicle. This lets buyers know that if they were looking for a specific amount of cylinders, that the make would not restrict them because there should be options in every cylinder category.

We are about to create a scatter plot to look at the fuel consumption (in the city) and CO2 emissions. The fuel consumption rates are in liters per 100 kilometers (L/100km) and our CO2 emissions are going to be described in grams per kilometer (g/km). Our model will be colored by the amount of cylinders each vehicle has. This is beneficial because it allows us to see how much fuel a vehicle uses and how that can correlate to the CO2.

train %>%
  ggplot() +
  geom_point(aes(x= fuel_consumption_city_l_100_km, y= co2_emissions_g_km, color = cylinders)) +
               labs(title = "Fuel Consumption (City) and CO2 Emissions", y="CO2 Emissions (gm/km)", x= "Fuel Consumption in City Roads (L/100km)")

From the scatter plot above, we can see a correlation. The lower the fuel consumption rates are (L/100km), means the lower the CO2 emissions are. This makes sense because if a vehicle is not using much fuel, then it is expected that their emissions will be relatively low as well. One interesting thing we can note is that the lighter color blue accounts for the higher number of cylinders, which we seem to only see as numbers get higher. As fuel consumption in city gets higher, we can see an increase in CO2 emissions.

We are about to create a scatter plot to look at the fuel consumption (on the highway) and CO2 emissions. The fuel consumption rates are in liters per 100 kilometers (L/100km) and our CO2 emissions are going to be described in grams per kilometer (g/km). Our model will be colored by the amount of cylinders each vehicle has. This is beneficial because it allows us to see how much fuel a vehicle uses and how that can correlate to the CO2. It will also allow us to declare if there is any difference between the previous graph looking at the city and this new graph looking at highway rates.

train %>%
  ggplot() +
  geom_point(aes(x= fuel_consumption_hwy_l_100_km, y= co2_emissions_g_km, color = cylinders)) +
               labs(title = "Fuel Consumption (Highway) and CO2 Emissions", y="CO2 Emissions (gm/km)", x= "Fuel Consumption on Highway (L/100km)")

From the scatter plot above, we can see there is more variance in the numbers of CO2 emissions, however it still looks quite similar to the previous scatter plot. Once again, the lower the fuel consumption rates are (L/100km), means the lower the CO2 emissions are. We also see the higher amount of cylinders playing a larger role in CO2 emissions. One interesting thing to note here is that vehicles with less cylinders and higher fuel consumption have even less CO2 emissions than vehicles with a high number oc cylinders. This is important because it shows that the cylinder amount a vehicle has, can be important when considering fuel consumption as well as CO2 emissions on the highway.

We are about to create a scatter plot to look at the engine size and CO2 emission values. The engine in the car burns fuel and converts it to mechanical power, powering the car. We are coloring the scatter plot by the type of transmission each car has to determine if that may or may not play a role as well. This is important because the bigger the engine size, the more fuel is can produce and more power it gives off, – this in return has the ability to impact CO2 emissions.

train %>%
  ggplot() +
  geom_point(aes(x=engine_size_l, y = co2_emissions_g_km,
             color=transmission)) +
  labs(title = "Engine Size and CO2 values", x = "Engine Size", y = "CO2 Emissions (gm/km)")

Looking at the scatter plot above, we can see a correlation between the engine size within a vehicle and the amount of CO2 emissions. Having a larger engine size can be associated with a higher value of CO2 emissions. The type of transmission seems like it does not necessarily play a role in neither engine size of CO2 emissions. This is important for viewers who want to reduce heir CO2 emissions by getting a vehicle with a smaller engine.

We are about to look at another plot that takes into consideration the engine size and CO2 rates however, we are looking at the fuel type in this case instead of the type of transmission. There are four categories for fuel types, D for diesel, E for Ethanol (E85), X for regular gasoline, and Z for premium gasoline. This is important because it will tell individuals what type of fuel will either increase/decrease their vehicles CO2 emissions.

train %>%
  ggplot() +
  geom_point(aes(x=engine_size_l, y = co2_emissions_g_km,
             color=fuel_type)) +
  labs(title = "Engine Size and CO2 values", x = "Engine Size", y = "CO2 Emissions (gm/km)")

In the scatter plot above, we have the same scatter plot as before however we are only looking at the differences in color. We can see the Z which is premium gasoline is scattered everywhere on this graph but mainly at the higher engine sizes and increasing CO2 emissions. The results of this were quite shocking because logically thinking about it, I assumed that diesel would be doing the worst and having the majority of C02 emissions. However, it did not seem to exceed past 300 gm/km – this could also mean that not enough vehicles that have taken diesel were observed and noted on.

We are about to look at the fuel consumption in the city and CO2 emissions once again. One new thing we are adding is the fuel type. So unlike before where we used engine size on our x axis, we are using fuel consumption (city). This is important because it tells us how each fuel type consumes fuel on the city roads and how it impacts their overall emission rates.

train %>%
  ggplot() +
  geom_point(aes(x= fuel_consumption_city_l_100_km, y= co2_emissions_g_km, color = fuel_type)) +
               labs(title = "Fuel Consumption (City) and CO2 Emissions", y="CO2 Emissions (gm/km)", x= "Fuel Consumption in City Roads (L/100km)")

From the scatter plot above, we can see quite interesting results. All of our ethanol fuels are producing the least amount of fuel consumption where as diesel fuels are producing the most. In response to CO2 emissions, we can see our Z (premium) and X (regular gasoline) play the largest role in CO2 emissions. As fuel consumption in city roads increases CO2 emissions also increase.

Model Construction

We are about to create a simple linear regression model to declare the predicting variable of fuel type and examine how that contributes to the prediction CO2 emissions each vehicle gives off. The simple regression model is going to allow us to view how strong of a connection these two variables have. I chose fuel type because from looking at our exploratory analysis, we can see that some fuel types give off a higher CO2 value, specifically Z.

We are going to create a fit model to predict CO2 emissions by using the variable of fuel type as a predictor. We are going to set an engine for the simple linear regression model, build a recipe, and packing the model and the recipe into our workflow to then put fit this workflow into our training data.

#build and fit model for CO2 using fuel type
co2_reg <- linear_reg() %>%
  set_engine("lm")

co2_rec <- recipe(co2_emissions_g_km ~ fuel_type, data=train)
co2_wf <- workflow() %>%
  add_model(co2_reg) %>%
  add_recipe(co2_rec)

lin_results <- co2_wf %>%
  fit_resamples(train_folds)
lin_results %>%
  collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator    mean     n std_err .config             
##   <chr>   <chr>        <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   56.5       10 0.690   Preprocessor1_Model1
## 2 rsq     standard    0.0747    10 0.00760 Preprocessor1_Model1
co2_fit <- co2_wf %>%
  fit(train)

co2_fit %>%
  glance()
## # A tibble: 1 × 12
##   r.squared adj.r.…¹ sigma stati…²   p.value    df  logLik    AIC    BIC devia…³
##       <dbl>    <dbl> <dbl>   <dbl>     <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1    0.0733   0.0729  56.5    175. 2.93e-109     3 -36240. 72491. 72525.  2.12e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

From the data above, we can see different values from the model such as the r squared, adjusted r squared, sigma (RSME) and p-value. These statistics are going to determine how well our simple linear regression model is. The r squared value here is telling us how much our independent variable of fuel type is impacting our model. The r squared being .07 in this case, is low and means fuel type is not the only variable impacting price. However, the more variables we add to the model, the higher our r.squared is going to be regardless of what the variable is, which is why we also use adjusted r squared. Adjusted r squared is going to tell us whether or not adding predictors is going to improve our model. The adjusted r squared value here is about 7%, this is telling us that only 7% of the variation in CO2 emissions is predicted by this model. That number is very low. Because we are using a linear regression model and have one predictor, the r squared and adjusted r squared are fairly similar. The sigma value can also be considered the RMSE. The RMSE value tells us the standard measure of error. Our sigma value is about 56.5, meaning that when predicting our CO2 emissions with this linear regression model, our standard error will be +/- 56.5. The p-value here is extremely low, telling us our predictor is significant, however looking at our other values, we can determine that we need to use multiple other variables in order to create the best model for predicting CO2 emissions.

We are about to use step_poly to build a model that will include curvature. This will allow us to include higher order terms and in this case we are going to be using engine size and fuel consumption combined. We are going to use a degree value of 3 and only two variables to reduce the risk of overfitting. The more variables and the higher degrees that are being used will cause the model to be overfit meaning that the model is fitting too well with the training data set.

poly_reg_spec <- linear_reg() %>%
  set_engine("lm")

poly_reg_rec <- recipe(co2_emissions_g_km ~  + engine_size_l + fuel_consumption_comb_mpg, data=train) %>%
  step_poly(engine_size_l, degree = 3, options = list( raw = TRUE)) %>%
  step_poly(fuel_consumption_comb_mpg, degree = 3, options = list(raw= TRUE))

poly_reg_wf <- workflow() %>%
  add_model(poly_reg_spec) %>%
  add_recipe(poly_reg_rec)

poly_results <- poly_reg_wf %>%
  fit_resamples(train_folds)
poly_results %>%
  collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   18.1      10 0.300   Preprocessor1_Model1
## 2 rsq     standard    0.905    10 0.00226 Preprocessor1_Model1
poly_reg_fit <- poly_reg_wf %>%
  fit(train)

poly_reg_fit %>%
  glance() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.904841 0.904755 18.11075 10521.41 0 6 -28676.94 57369.88 57424.3 2177588 6639 6646

Above, we have out statistical values for how out step_poly model did. We can see dramatic changes from our linear model to our new model. We have an r squared and adjusted r squared value of 90% which is very good. This is telling us that 90% of other variation in CO2 emissions is predicted by this model. Our sigma value is decreased which is also a good sign. This is telling us that the difference in our predicted values of carbon emissions are going to be +/- 18, which is not much of a difference. Lastly, our p-value here is a value of 0 telling us this correlation is potentially significant.

Below, we are going to perform a multiple linear regression model that allows us to view multiple different variables in order to predict CO2 emissions. In this model, we will be using the variables vehicle class, fuel type, engine size, and the combined rating (55% city, 45% highway). This model will give us an additional three variables to help us predict more accurate emission rates.

#build and fit model for price using multiple variables
mult_reg <- linear_reg() %>%
  set_engine("lm")

mult_rec <- recipe(co2_emissions_g_km ~ fuel_type + engine_size_l + vehicle_class + fuel_consumption_comb_mpg, data=train)

mult_wf <- workflow() %>%
  add_model(mult_reg) %>%
  add_recipe(mult_rec)

mult_results <- mult_wf %>%
  fit_resamples(train_folds)
mult_results %>%
  collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator   mean     n std_err .config             
##   <chr>   <chr>       <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   14.8      10 0.365   Preprocessor1_Model1
## 2 rsq     standard    0.937    10 0.00212 Preprocessor1_Model1
mult_fit <- mult_wf %>%
  fit(train)

mult_fit %>%
  glance()
## # A tibble: 1 × 12
##   r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1     0.937      0.937  14.8   4925.       0    20 -27308. 54659. 54809.  1.44e6
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance

From the data above, we can see different values from our multiple regression model. These values we are dealing with are r squared, adjusted r squared, sigma, and the p-value. Comparing this model to our linear regression model, we can already see a huge difference in how each model performed. The value of r. squared as well as adjusted r squared went from 7% to 93%. This model seems to have fit better than both previous models. We already knew that our r. squared was going to increase depending on how many variables we used as predictors, however our adjusted r-squared being 93% as well tells us that this model did a pretty good job prediciting the amount of CO2 emissions with the variables that were chosen. Even looking at our sigma value, we see the number 14.7. This is very beneficial because it means when our multiple regression model is predicting CO2, emissions, it is expected to only be +/- 14 off. Lastly, our p-value here is 0, meaning that this model and predictors are statistically significant and there is a visible correlation between the two. Overall, this has been the best model made on this dataset in order to predict CO2 emissions by vehicle.

We are about to use our test data to test the best model we made, which is the multiple linear regression model. This will be the test and determine how our model did in predicting the amount of CO2 emissions.

mult_fit %>%
  augment(test) %>%
  rmse(co2_emissions_g_km, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        13.8

Looking at how our test model did, we specifically are trying to compare our RMSE values which is the root mean squared error. The goal of this metric is to measure the average difference between the values that were predicted by a model and the actual values. The RMSE value for our linear regression model was 56, for out step_poly model it was 18, and for our multiple regression model is was 14. Once we ran the multiple regression model (which was our best model) with the test data set to obtain a RMSE value, we got a value of 13.8. This number is very close to our value of 14 we obtained with our multiple regression model which tells us that it is in fact our best model. Depending on what data set we are looking at will tell us whether or not our RMSE values are significant or ‘good’. The goal of the RMSE value is to provide us with information on how well the model is going to predict the values in the test dataset. Because we got a value of 13 here, we can confidently say that that is a good value because we are looking at CO2 emissions. The lower the number in this situation is better because it shows the standard deviation of what was unexplained in the test set.

Model Interpretation and Inference

Looking at our exploratory analysis, we can make quite a few inferences. Quite often we have seen that with a higher fuel consumption rate, there is a higher CO2 emissions rate. A couple of things that came into play when determining this are the amount of cylinders, type of vehicle, and engine size. We have seen that the higher the CO2 emissions rate is, the more cylinders the vehicle carries as well as the bigger the engine size on the vehicle is. The cylinders create power to generate the vehicle and the engine changes the heat to buring gas so that the car functions properly. After looking at this correlation, we can assume that two aspects are crucial when discussing CO2 emissions. This is important for individuals who want to buy a car that is going to produce less CO2 emissions.

Doing the linear regression model, we were allowed to use up to one variable as a sole predictor in what our CO2 emissions would be. The predictor used in this instance was fuel type. Because there are four different fuel types, we would receive one carbon emission per fuel type which is very inaccurate because carbon emissions vary greatly and take into consideration multiple factors. We quickly found out that this model was lacking because of the low r squared and ajr r squared values of 7%. It showed us that only 7% of the variation was explained by the first model. We can confidently say that the fuel type a vehicle has is not the only variable that plays a role in the CO2 emissions. Secondly, we ran a step_poly test to show if our data has any curvature. This would allow us to add multiple terms, in this case we had two: engine size and fuel consumption to create a higher order terms polynomial. When running this test we automatically saw better results than our first linear regression model. Adding more variables helped in determining the CO2 emission rates. Our r squared and adjusted r squared increased all the way to 90% which is very good. This showed us that there was curvature in our CO2 emissions relationship. The last model we attempted was a multiple regression model. The goal of this model was to add multiple variables and see whether or not those variables worked together to predit the amount of CO2 emissions. After running this test, we came out with an even better model. Our r squared and adjusted r squared values went up 3%!. The step_poly model was very good and this model got even better. The 93% shows us that 93% of the variation is explained by this model which is a really good thing. To make this model we used the variables vehicle class, fuel type, engine size, and the combined rating (55% city, 45% highway) to predict the amount of emission. Finally we wanted to test our best model, in this case the multiple regression model to our test data set which has been hidden from us. We used this model to predict the carbon emissions column and found out that our RMSE value was less than one away than the RMSE values our cross validation folds gave us in our multiple regression model. This was a very good sign because it means out model worked well with the test data set.

Conclusion

In conclusion, we came up with significant findings on predicting the CO2 emissions from vehicles. We performed multiple tests and came up with our multiple linear regression model as being the best one. With this test, we cross validated and applied it to our test data set that has been unseen and received a good RMSE value within the cross validation. Some things we can propose here are that the amount of cylinders, fuel type, and engine size play a crucial role in the amount of CO2 emissions a vehicle gives off. We know this because we saw positive correlations between these variables and emission rates. The most important finding is that when the two variables cylinders and engine size are placed together, they can tell a lot about what the emissions rate is going to look like. Our multiple regression model gave us a 93% value for our r squared and adjsuted r squared, telling us that this model accounts for 93% of the variation which is a significant amount. Because of these metrics, we are confident in saying that this model does a good job at predicint the amount of CO2 emissions that a vehicle gives off. This report was intended to help individuals who are looking to reduce their CO2 emissions through the vehicle. They have the ability to look at all of this information regarding engine size and amount of cylinders to determine what works best for them and how much CO2 emission on average they will be giving off by driving their car.

References

https://www.kaggle.com/datasets/debajyotipodder/co2-emission-by-vehicles