1.Which programming language are we going to use throughout this course?
Answer : Python
The Basics
What is Machine Learning?
1.Which of the statements are correct?Answer :
-Machine Learning uses computer power to build models and predict future results.
-Machine Learning takes data and turns it into insights.
2.Which of the following is used for reading data and data manipulation?
Answer : Pandas
3. Which of the following are examples of classification problems?
Answer :
-Predicting if a credit card charge is fraudulent
-Determining if an image is of a car, bus or bike
The Basics
Statistics Review
1. The following list includes the number of children that each of 5 different families have. Based on this information, what are the mean and median number of children per family?
0, 1, 1, 2, 6
2. Say we have a sample of 11 families and the number of kids per family. Each number in the list represents the number of kids in the family. Thus there is 1 family with 0 kids, 5 families with 1 kid, etc.
0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 6
The 25th percentile is 1.
The 50th percentile (aka median) is 1.
The 75th percentile is 2.
3. Which of the following statements are correct?
Answer :
(1) the standard deviation and variance measures of how dispersed the data is.
(2) the standard deviation is the square root of the variance.
4.Complete the code for calculating and printing the 60th percentile of the data array.
Answer :
import numpy as np
print(np. Percentile (data, 60))
print(np. Percentile (data, 60))
The Basics
Reading Data with Pandas
1.What do we call the Pandas data object?
Answer : DataFrame
2.Fill in the blanks to complete the code to read in a file called mydata.csv as a pandas DataFrame and print a table of the first 5 rows.
Answer :
import pandas as pd df = pd.read_csv('mydata.csv') print(df.head()).
3.Based on our dataset, what is the maximum Pclass and the median age?
Answer : max = 3 / median = 28
The Basics
Manipulating Data with Pandas
1. Based on the following code, what datatype is the variable age_col? age_col = df['Age']
Answer : Pandas Series
3.Write the code to create the “First Class” column, which is True if the passenger is in Pclass 1 and False otherwise.
Answer : first class / pclass/ 1
The Basics
Numpy Basics
1. The data object in numpy is called a(n):
Answer : Array
2. Write the code to get a numpy array of the values in the 'Age' column from the DataFrame df.
Answer : Age,values
3. The following code returns what type of numpy array:df[['Survived', 'Pclass']].values
Answer : 2 dimensional
4. Recall that we have 887 datapoints in our Titanic dataset. What is the output of the following code?
arr = df[['Survived', 'Pclass']].values print(arr.shape)
Answer : (887, 2)
The Basics
More with Numpy Arrays
1. We have an array of some of our data: arr = df[['Pclass', 'Fare', 'Age']].values How would you select the Fares of all the passengers?
Answer : arr[:, 1]
2. We continue with our array of some of our data: arr = df[['Pclass', 'Fare', 'Age']].values Which of the following would complete the code to subset the array to get just the passengers in Pclass1?
Answer : arr[:,0]
3.Which of the following is correct for counting the number of passengers in Pclass 1? We have the following array definition: arr = df[['Pclass', 'Fare', 'Age']].values
Answer : (arr[:, 0] == 1).sum()
The Basics
Plotting Basics
1. Write the code to create a scatter plot with Pclass on the y-axis and Fare on the x-axis. Color code it according to whether or not they survived. Add the labels “Fare” and “Pclass” on the x and y axes respectively.
Answer : plt.scatter
(df['Fare
'], df['Pclass
'], c=df['Survived
'])plt.xlabel
('Fare')plt.ylabel
('Pclass')
2. Which of the following would draw a straight line that goes from the point (10, 0) to (100, 3)?
Answer : plt.plot([10, 100], [0, 3])
The Basics
Module 1 Quiz
1. Which of the following is used for reading and manipulating data with the main data object of a DataFrame?
2. Which is used for doing computations and analysis of numerical data with the main data object of an array?
Answer :
Numpy
3. From the list of the temperature highs over the past 5 days, what are the median and mean temperatures? 10, 20, 40, 30, 40
Answer : median 30 / mean 28
4. What is a measure of how spread out the data is?
Answer : standard deviation & variance
Answer : standard deviation & variance
5. We have a csv file called people.csv. The data has three columns: Name, Country, Gender. It looks as follows:
Name, Country, Gender
Maria, USA, female
Davit, Armenia, male
Maria, USA, female
Davit, Armenia, male
Write the code to load the data as a pandas DataFrame and then print a pandas Series of just the Name column.
Answer : read_csv
'people.csv'
'Name'
6. We have a pandas DataFrame of people’s heights (in centimeters) and weights (in kg).Which of the following is the correct code to take the height and weight columns as a numpy array?
Answer : df[['Height', 'Weight']].values
7. Complete the following code to draw a graph of the previous dataframe, with the Height column on the x-axis and the Weight column on the y-axis. import matplotlib.pyplot as plt plt.scatter (df['Height'], df['Weight'])
What is Classification?
1. Classification is:
Answer :
-.a type of supervised learning
-.Prediçtiñg à çàþeģòŕical value
2. A feature is what we’re trying to predict and a target is a piece of data we can use to make our prediction.
Answer : false
Classification
A Linear Model for Classification
1. For a classification problem, we build a model to separate the positive cases from the negative cases.
Answer : true
2. From this equation for a line, which of the following points are on the line?
0 = (2)x + (1) y - 5
Answer : (05)(21)3. Let’s look at the following equation for a line. 0 = 2x + y - 5.Which of these datapoints would have a positive prediction?
Answer : (3,0) (-1, 8)
4. What is the goal of drawing the line in a linear model for classification?
Answer : To separate the two classes
Classification
Logistic Regression Model
1. In Logistic Regression, we calculate a probability. For the Titanic dataset, we predict that the passenger survives if the probability is:
Answer : 0.75,1
2. If the predicted probability is 0.75 and the passenger didn’t survive, what’s the likelihood score?
Answer : 0.25
Classification
Build a Logistic Regression Model with Sklearn
1. Scikit-learn’s primary use is:
Answer : Machine Learning algorithms
2. Complete this code to create a numpy array X of the Fare and Age features and a numpy array y of the target, where the target is the Survived column.
import pandas as pd df = pd.read_csv('./titanic.csv') X = df[['Fare', 'Age']].
Value y = df[' survived '].values
3. Complete the code to build a Logistic Regression model. Assume that we have a 2d numpy array X of the features and a 1d numpy array y of the target.
from .linear_model import LogisticRegression model = LogisticRegression() model. (X, y)
Answer : SKLEARN,FIT
4. Say X is a matrix of features and y is a target of True/False values. Let’s run the following code. model = LogisticRegression() model.fit(X, y) print(model.predict(X[:5])) Which of the following are possible result?
Answer :[00000] and [10101]
5. Assume that y=[0, 0, 0, 1, 1] and the result of model.predict(X) is [0, 0, 1, 1, 0]. What is the expected output of the following code? model.score(X, y)
Answer : 0.6
Classification
Logistic Regression with the Breast Cancer Dataset
1. Complete the code to load the breast cancer dataset from scikit-learn.
from sklearn. from sklearn. import load_breast_cancer cancer_data = breast_cancer()
Answer : datasets and load
2. The target for the first datapoint is 0, so the tumor is:
Answer : Malignant
3.The accuracy score of 96% means: Select all that apply
Answer :
- Our model has 96% of the data on the right side.
- Our model has made the 96% of the correct prediction.
Classification
Module 2 Quiz
1. If the target of a classification problem has a categorical value, it means that it has how many possible values?
Answer : Finite
2. Select all that are true for how we have built models for the Titanic dataset.
Answer :
-The survived column is the target
-The Pclass column is a feature
3. Reorder these lines of code to build a Logistic Regression model with X
and y and print the percent of values predicted correctly.
Answer :
- from sklearn.linear_model import LogisticRegression
- model = LogisticRegression()
- model.fit(X, y)
- print(model.score(X, y))
4. We’ve used Logistic Regression to find a line to separate the Titanic
dataset. In which of the following values of the predicted probability
is the passenger predicted to survive and the datapoint is the furthest
from the line of separation?
Answer : 0.9
5. If we predict a passenger has 0.8 chance of survival, and the passenger
survived, the likelihood is 0.8. If we predict a passenger has 0.6
chance of survival and the passenger did not survive, what is the
likelihood?
Answer : 0.4
Answer : likelihood
7. Given the following code and output, what is the accuracy of the model?
print(model.predict(X))
print(y)
Output:
[1 0 0 0 1]
[1 1 0 0 0]
Answer : 60 %
Model Evaluation
Evaluation Metrics
Answer : 950/1000 .95*100=95%
2. Based on the confusion matrix below, compute the accuracy of the model.
Actual Actual
Positive Negative
Predicted positive 20 26
Predicted negative 10 44
What will the accuracy be?
Answer : 64 %
3. Fill in the blanks based on the provided data:
Our confusion matrix is as follows:
Actual Actual
Positive Negative
Predicted positive 233 65
Predicted negative 109 480
Actual Actual
Positive Negative
Predicted positive 233 65
Predicted negative 109 480
There are 233 .....
.
There are 65 .....
.
There are 109 .....
.
There are 480 .....
There are 65 .....
There are 109 .....
There are 480 .....
Answer :
There are 233
There are 65
There are 109
There are 480
true positives
false positives
false negatives
true negatives
Model Evaluation
Precision and Recall
1. Our confusion matrix is as follows.
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
What is the precision?
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
What is the precision?
Answer : 30/(30+20)=0.6
2. Our confusion matrix is as follows.
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
What is the recall?
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
What is the recall?
3. We’re building a model to predict spam email. The positive cases are
spam and the negative cases are legitimate. If we’re going to delete
email that we predict is spam, which is more important to maximize?
4. You have built a model that has precision of 0.8 and recall of 0.5. What is the equation for the F1 score?
Model Evaluation
Calculating Metrics in Scikit-learn
1. Assume we have a 2D numpy array X of features and a 1D numpy array y of
target values. Rearrange the code to build a model on the data and print
the precision, recall, and f1 score in that order.
1.Model =
2.Model.fit
3.ypred =
4.print ("pps
5.print ("rrs
6.print("f1
2. We get the following output from sklearn’s confusion_matrix function.
[[4 1]
[3 2]]
How many of each of the following are there?
True positives:
False positives:
False negatives:
True negatives:
Answer : 2134[[4 1]
[3 2]]
How many of each of the following are there?
False positives:
False negatives:
True negatives:
Model Evaluation
Training and Testing
1. Overfitting is when:
Select all that apply
Answer : * We do a good job making prediction on data we've already seen
* we don't perform well on new data
2. Which of the following is used to evaluate the model ?
3. We have 2d numpy array X of 100 datapoints and 4 features and 1d array y of 100 target values. What is the output of this code?
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape)
Result:
( , )( ,)
( , )( ,)
Answer : 75,4,75
4. Assume we have a 2-dimensional numpy array X of features and a 1-dimensional numpy array y of target values.
We start with the following:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LogisticRegression()
Which of the following is a correct use of the training and test sets in scikit-learn?
We start with the following:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LogisticRegression()
Which of the following is a correct use of the training and test sets in scikit-learn?
Answer : Model.fit(X_train,y_train)print(model.score(x_test,y_test))
5. We use a random_state parameter to ensure that we get the same random split every time the same code is run.
Answer : True
Model Evaluation
Foundations for the ROC Curve
1. If we choose a threshold of 0.75, which of the following is true about our predictions and precision?
Answer : more negative predictions / precision will be higher
2. From the following confusion matrix, what would the sensitivity and specificity be?
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
Actual Actual
Positive Negative
Predicted positive 30 20
Predicted negative 10 40
Answer :
0.75 & 0.67
3. What value do we get with the following code?
p, r, f, s = precision_recall_fscore_support(y_test, y_pred)
print(r[1])
p, r, f, s = precision_recall_fscore_support(y_test, y_pred)
print(r[1])
Answer : Sensitivity
4. Which of the following gives us an array of the predicted probabilities
(each value will be the probability that the datapoint belongs to the
positive class).
Answer : model.predict_proba(x_test)[:,1]
Model Evaluation
The ROC Curve
1. Order the lines of code to build the ROC curve with scikit-learn.
Answer :
model=
model. fit
y_pred_proba
fpr, tpr
plt.plot
plt.show
2. In which corner of the ROC curve plot would the ideal model lie?
Answer : Upper left
3. Say we have a model for predicting credit card fraud. If we detect a
fraudulent charge on someone’s account we’re going to disable their
credit card. Thus we want to make sure when we make a positive
prediction that we are accurate. Which of the three models from the ROC
plot is preferred in this case?
Answer : A
4. Complete the code to calculate the AUC score of a Logistic Regression model. Assume X_train, X_test, y_train, y_test have already been created.
model =
()
model.
(X_train, y_train)
y_pred_proba = model.predict.proba(
)
print(roc_auc_score(
, y_pred_proba[:, 1]))
model.
y_pred_proba = model.predict.proba(
print(roc_auc_score(
Answer : LogisticRegression
Fit
X_test
y_test
Model Evaluation
k-fold Cross Validation
1. Splitting the dataset into a single training set and test set for evaluation purposes might yield an inaccurate measure of the evaluation metrics when:
Answer : The dataset is small
2. If we were to have a dataset of 100 datapoints and we break it into 5 chunks to make 5 training and test sets, how many datapoints would be in each training set and test set?
Answer : 80 training set datapoints / 20 test set datapoints
3. What is the precision value for our model if we do a 5-fold cross validation and get the following 5 values for precision?
0.7, 0.6, 0.6, 0.8, 0.8
0.7, 0.6, 0.6, 0.8, 0.8
Answer : 0.7
4. After doing a 5-fold cross validation, how do we choose a final single model?
Answer : Build a new model with all of the data
Model Evaluation
k-fold Cross Validation in Sklearn
1. Let’s say we have a 2-dimensional numpy array X of the features and a 1-dimensional numpy array y of the target values. Finish the code below to build the k-fold object with k=5 and generate the 5 chunks.
from sklearn.model_selection import KFold
kf = KFold(n_splits= ____, shuffle=True)
chunks = kf. ____(X)
kf = KFold(n_splits= ____, shuffle=True)
chunks = kf. ____(X)
Answer : 5
Split
2. Which of the following could be the output of this code assuming X has 3 datapoints?
kf = KFold(n_splits=3, shuffle=True)
splits = list(kf.split(X))
print(splits[0])
splits = list(kf.split(X))
print(splits[0])
Select all that apply
Answer : ([0,1],[2])
([0,2],[1])
3. Let’s say we have a feature matrix X. Drag and drop in order to correctly define X_train and X_test for the second fold.
kf = KFold(n_splits=5, shuffle=True)
splits = list(kf.split(X))
a, b = splits[___]
X_train = X[___]
X_test = X[___]
splits = list(kf.split(X))
a, b = splits[___]
X_train = X[___]
X_test = X[___]
Answer :
a,b=splits[1]
X_train=X[a]
X_test=X[b]
4. Say we have a list precision_scores of all the 5 precision values for each of the folds. Which of the following is the final single precision value?
Select all that apply
Answer : np.sum(precision_scores) / len(precision_scores) and np.median(precision_scores)
Model Evaluation
Model Comparison
1.Which of the following is a goal of using evaluation metrics?
Answer : Compare two different models for the Titanic dataset
2. Complete the code to create a fourth feature matrix that has just the Pclass and Sex features and uses the score_model function to print the scores. Assume we’ve defined y to be the target values and kf to be the KFold object.
X4 = df[['Pclass', 'male']]._____score_model(X4, y, kf)
Answer : Values
3. Model 1 has precision 0.77 and recall 0.68. Model 2 has precision 0.70 and recall 0.72. Which model is better?
Answer : It depends on the situation
Model Evaluation
Module 3 Quiz
1. We’ve built a model that has the following confusion matrix. Based on this, how many total datapoints are there, how many positive datapoints are there and how many negative datapoints are there?
Actual Actual
Positive Negative
Predicted positive 4 4
Predicted negative 1 11
Actual Actual
Positive Negative
Predicted positive 4 4
Predicted negative 1 11
Answer : 20 total / 5 positive / 15 negative
2. We have built a model that has the following confusion matrix.
Actual Actual
Positive Negative
Predicted positive 4 4
Predicted negative 1 11
What would the accuracy be?
Actual Actual
Positive Negative
Predicted positive 4 4
Predicted negative 1 11
What would the accuracy be?
Answer : 0.75
3. Accuracy is a bad measure of model performance when:
Select all that apply
Answer : There are more positive cases than negative
There are more negative cases than positive
4. When evaluating a model, we:
Answer : Split the data into a training
set and a test set so that we build the model on the training set and
test the model on unseen data.
5. Let's say we have a dataset of 150 datapoints and we are doing k-fold cross validation with k=5.
How many training sets do we have and of what size?
How many training sets do we have and of what size?
Answer : 5 sets and 120 datapoints
6. Complete the code to do a k-fold cross validation where k=5 and calculate the accuracy. X is the feature matrix and y is the target array.
scores = [ ]
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model. ____(X_train, y_train)
scores.____(model.score(X_test, y_test))
print(np.____(scores))
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LogisticRegression()
model. ____(X_train, y_train)
scores.____(model.score(X_test, y_test))
print(np.____(scores))
Answer : fit
Append
Mean