Monday, 8 June 2020

Machine Learning

The Basics

1.Which programming language are we going to use throughout this course?
Answer : Python


The Basics 
What is Machine Learning?
 
1.Which of the statements are correct?
Answer :
-Machine Learning uses computer power to build models and predict future results.
-Machine Learning takes data and turns it into insights.
2.Which of the following is used for reading data and data manipulation?
Answer : Pandas

3. Which of the following are examples of classification problems?
Answer :  
-Predicting if a credit card charge is fraudulent
-Determining if an image is of a car, bus or bike

The Basics 
Statistics Review

1. The following list includes the number of children that each of 5 different families have. Based on this information, what are the mean and median number of children per family?
0, 1, 1, 2, 6
Answer : mean = 2 / median = 1
2. Say we have a sample of 11 families and the number of kids per family. Each number in the list represents the number of kids in the family. Thus there is 1 family with 0 kids, 5 families with 1 kid, etc.
0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 6

The 25th percentile is 1.
The 50th percentile (aka median) is 1.
The 75th percentile is 2.
3. Which of the following statements are correct?
Answer :
(1) the standard deviation and variance measures of how dispersed the data is. 
(2) the standard deviation is the square root of the variance.
4.Complete the code for calculating and printing the 60th percentile of the data array.
Answer : 
import numpy as np
print(np. Percentile (data, 60))

The Basics 
Reading Data with Pandas

1.What do we call the Pandas data object?
Answer :  DataFrame
2.Fill in the blanks to complete the code to read in a file called mydata.csv as a pandas DataFrame and print a table of the first 5 rows.
Answer :
import pandas as pd df = pd.read_csv('mydata.csv') print(df.head()).
3.Based on our dataset, what is the maximum Pclass and the median age?
Answer : max = 3 / median = 28

The Basics 
Manipulating Data with Pandas
1. Based on the following code, what datatype is the variable age_col? age_col = df['Age']
Answer : Pandas Series
2. Which of the following creates a DataFrame new_df which has the columns Pclass, Age, and Fare? Answer : new_df = df[['Pclass', 'Age', 'Fare']]
3.Write the code to create the “First Class” column, which is True if the passenger is in Pclass 1 and False otherwise.
Answer : first class / pclass/ 1 

The Basics 
Numpy Basics
1. The data object in numpy is called a(n):
Answer : Array
2. Write the code to get a numpy array of the values in the 'Age' column from the DataFrame df.
Answer : Age,values
3. The following code returns what type of numpy array:df[['Survived', 'Pclass']].values
Answer : 2 dimensional
4. Recall that we have 887 datapoints in our Titanic dataset. What is the output of the following code?

arr = df[['Survived', 'Pclass']].values print(arr.shape)
Answer : (887, 2)
 
The Basics 
More with Numpy Arrays
1. We have an array of some of our data: arr = df[['Pclass', 'Fare', 'Age']].values How would you select the Fares of all the passengers?
Answer : arr[:, 1] 
2. We continue with our array of some of our data: arr = df[['Pclass', 'Fare', 'Age']].values Which of the following would complete the code to subset the array to get just the passengers in Pclass1? 
Answer : arr[:,0] 
3.Which of the following is correct for counting the number of passengers in Pclass 1? We have the following array definition: arr = df[['Pclass', 'Fare', 'Age']].values
Answer : (arr[:, 0] == 1).sum()

The Basics 
Plotting Basics
1. Write the code to create a scatter plot with Pclass on the y-axis and Fare on the x-axis. Color code it according to whether or not they survived. Add the labels “Fare” and “Pclass” on the x and y axes respectively.
Answer : plt.scatter (df['Fare '], df['Pclass '], c=df['Survived '])plt.xlabel ('Fare')plt.ylabel ('Pclass')
2. Which of the following would draw a straight line that goes from the point (10, 0) to (100, 3)?
Answer : plt.plot([10, 100], [0, 3])

The Basics 
Module 1 Quiz 
  
1. Which of the following is used for reading and manipulating data with the main data object of a DataFrame?
Answer : Pandas
2. Which is used for doing computations and analysis of numerical data with the main data object of an array?  
Answer : Numpy
3. From the list of the temperature highs over the past 5 days, what are the median and mean temperatures? 10, 20, 40, 30, 40
Answer : median 30 / mean 28
4. What is a measure of how spread out the data is?
Answer :
standard deviation & variance
5. We have a csv file called people.csv. The data has three columns: Name, Country, Gender. It looks as follows:

Name, Country, Gender
Maria, USA, female
Davit, Armenia, male
Write the code to load the data as a pandas DataFrame and then print a pandas Series of just the Name column.
Answer : read_csv 'people.csv' 'Name'
6. We have a pandas DataFrame of people’s heights (in centimeters) and weights (in kg).Which of the following is the correct code to take the height and weight columns as a numpy array? 
Answer : df[['Height', 'Weight']].values 
7. Complete the following code to draw a graph of the previous dataframe, with the Height column on the x-axis and the Weight column on the y-axis. import matplotlib.pyplot as plt plt.scatter (df['Height'], df['Weight'])
 
Classification 
What is Classification?

1. Classification is:
Answer :
-.a type of supervised learning 
-.Prediçtiñg à çàþeģòŕical value
2. A feature is what we’re trying to predict and a target is a piece of data we can use to make our prediction.
Answer : false
Classification 
A Linear Model for Classification
 
1. For a classification problem, we build a model to separate the positive cases from the negative cases.
Answer : true
2. From this equation for a line, which of the following points are on the line?
0 = (2)x + (1) y - 5
Answer : (05)(21)
3. Let’s look at the following equation for a line. 0 = 2x + y - 5.Which of these datapoints would have a positive prediction?
 Answer : (3,0) (-1, 8)
4. What is the goal of drawing the line in a linear model for classification?
Answer : To separate the two classes

Classification 
Logistic Regression Model

1. In Logistic Regression, we calculate a probability. For the Titanic dataset, we predict that the passenger survives if the probability is: 
Answer : 0.75,1 
2. If the predicted probability is 0.75 and the passenger didn’t survive, what’s the likelihood score?
Answer : 0.25

Classification 
Build a Logistic Regression Model with Sklearn

1. Scikit-learn’s primary use is:
Answer : Machine Learning algorithms
2. Complete this code to create a numpy array X of the Fare and Age features and a numpy array y of the target, where the target is the Survived column.
import pandas as pd df = pd.read_csv('./titanic.csv') X = df[['Fare', 'Age']].
Value y = df[' survived '].values
3. Complete the code to build a Logistic Regression model. Assume that we have a 2d numpy array X of the features and a 1d numpy array y of the target.
from .linear_model import LogisticRegression model = LogisticRegression() model. (X, y)
Answer : SKLEARN,FIT
4. Say X is a matrix of features and y is a target of True/False values. Let’s run the following code. model = LogisticRegression() model.fit(X, y) print(model.predict(X[:5])) Which of the following are possible result?
Answer :[00000] and [10101]
5. Assume that y=[0, 0, 0, 1, 1] and the result of model.predict(X) is [0, 0, 1, 1, 0]. What is the expected output of the following code? model.score(X, y)
Answer : 0.6

Classification 
Logistic Regression with the Breast Cancer Dataset
 

1. Complete the code to load the breast cancer dataset from scikit-learn.
from sklearn. from sklearn. import load_breast_cancer cancer_data = breast_cancer()
Answer : datasets and load
2. The target for the first datapoint is 0, so the tumor is:
Answer : Malignant
3.The accuracy score of 96% means: Select all that apply 
Answer : 
- Our model has 96% of the data on the right side. 
- Our model has made the 96% of the correct prediction.

Classification 
Module 2 Quiz

1. If the target of a classification problem has a categorical value, it means that it has how many possible values?
Answer : Finite
2. Select all that are true for how we have built models for the Titanic dataset.
Answer : 
-The survived column is the target 
-The Pclass column is a feature 
3. Reorder these lines of code to build a Logistic Regression model with X and y and print the percent of values predicted correctly.
Answer : 
- from sklearn.linear_model import LogisticRegression 
- model = LogisticRegression() 
- model.fit(X, y) 
- print(model.score(X, y))
4. We’ve used Logistic Regression to find a line to separate the Titanic dataset. In which of the following values of the predicted probability is the passenger predicted to survive and the datapoint is the furthest from the line of separation?
Answer : 0.9
5. If we predict a passenger has 0.8 chance of survival, and the passenger survived, the likelihood is 0.8. If we predict a passenger has 0.6 chance of survival and the passenger did not survive, what is the likelihood?
Answer : 0.4


6. A Logistic Regression model will find the line that has the highest:
Answer : likelihood 
7. Given the following code and output, what is the accuracy of the model?
print(model.predict(X))
print(y)
Output:
[1 0 0 0 1]
[1 1 0 0 0] 


 Answer : 60 %

Model Evaluation 
Evaluation Metrics
 
1. Say you’re tasked with building a model to predict spam email. Your training set has 1000 emails, 950 are legitimate emails and 50 are spam emails. You build a model that just predicts every email is legitimate. What is the accuracy of the model?

Answer : 950/1000 .95*100=95%

2.  Based on the confusion matrix below, compute the accuracy of the model.

                                   Actual   Actual
                                 Positive   Negative
Predicted positive         20       26
Predicted negative        10       44

What will the accuracy be?

Answer : 64 %

3. Fill in the blanks based on the provided data:
Our confusion matrix is as follows:
                               Actual    Actual
                               Positive Negative
Predicted positive       233      65
Predicted negative      109      480
 
There are 233 .....
.
There are 65 .....
.
There are 109 .....
.
There are 480 .....

Answer : 
 
There are 233 
true positives
There are 65 
false positives
There are 109 
false negatives
There are 480 
true negatives

Model Evaluation 
Precision and Recall
 
1. Our confusion matrix is as follows.

                                     Actual Actual
                                   Positive Negative
Predicted positive               30  20
Predicted negative              10  40

What is the precision?
 
Answer : 30/(30+20)=0.6
 
2. Our confusion matrix is as follows.

                                    Actual      Actual
                                    Positive    Negative
Predicted positive               30     20
Predicted negative              10     40

What is the recall?

Answer : 0.75
 
3. We’re building a model to predict spam email. The positive cases are spam and the negative cases are legitimate. If we’re going to delete email that we predict is spam, which is more important to maximize?
 
Answer : precision
 
4. You have built a model that has precision of 0.8 and recall of 0.5. What is the equation for the F1 score?

Answer : F1=(2*(Precision*Recall))/ (Precision+Recall) 2* (0.8 *0.5) / (0.8 + 0.5)
 
Model Evaluation 
Calculating Metrics in Scikit-learn  

1. Assume we have a 2D numpy array X of features and a 1D numpy array y of target values. Rearrange the code to build a model on the data and print the precision, recall, and f1 score in that order.
 
Answer : 

1.Model = 
2.Model.fit 
3.ypred = 
4.print ("pps 
5.print ("rrs 
6.print("f1

2. We get the following output from sklearn’s confusion_matrix function.

[[4 1]
[3 2]]

How many of each of the following are there?
True positives: 

False positives: 

False negatives: 

True negatives: 

Answer : 2134
 
Model Evaluation 
Training and Testing  

1. Overfitting is when:
Select all that apply
 
Answer : * We do a good job making prediction on data we've already seen 
               * we don't perform well on new data
 
2. Which of the following is used to evaluate the model ?
 
Answer : TEST SET
 
3. We have 2d numpy array X of 100 datapoints and 4 features and 1d array y of 100 target values. What is the output of this code?

X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape, y_train.shape)
Result:
( , )( ,)

Answer : 75,4,75
 
4. Assume we have a 2-dimensional numpy array X of features and a 1-dimensional numpy array y of target values.

We start with the following:

X_train, X_test, y_train, y_test = train_test_split(X, y)
model = LogisticRegression()

Which of the following is a correct use of the training and test sets in scikit-learn?
 
Answer : Model.fit(X_train,y_train)print(model.score(x_test,y_test))
 
5. We use a random_state parameter to ensure that we get the same random split every time the same code is run.
 
Answer : True

Model Evaluation 
Foundations for the ROC Curve

1. If we choose a threshold of 0.75, which of the following is true about our predictions and precision?
 
Answer : more negative predictions / precision will be higher
 
2. From the following confusion matrix, what would the sensitivity and specificity be?

                                           Actual         Actual
                                           Positive      Negative
Predicted positive                     30           20
Predicted negative                    10           40
 
Answer :  0.75 & 0.67
 
3. What value do we get with the following code?

p, r, f, s = precision_recall_fscore_support(y_test, y_pred)
print(r[1])
 
Answer : Sensitivity
 
4. Which of the following gives us an array of the predicted probabilities (each value will be the probability that the datapoint belongs to the positive class).
 
Answer : model.predict_proba(x_test)[:,1]
 
Model Evaluation 
The ROC Curve

1. Order the lines of code to build the ROC curve with scikit-learn.
 
Answer :
 
model= 
model. fit 
y_pred_proba 
fpr, tpr 
plt.plot 
plt.show
 
2. In which corner of the ROC curve plot would the ideal model lie? 
 
Answer : Upper left

3. Say we have a model for predicting credit card fraud. If we detect a fraudulent charge on someone’s account we’re going to disable their credit card. Thus we want to make sure when we make a positive prediction that we are accurate. Which of the three models from the ROC plot is preferred in this case?
 
Answer : A

4. Complete the code to calculate the AUC score of a Logistic Regression model. Assume X_train, X_test, y_train, y_test have already been created.
model = 
()
model.
(X_train, y_train)
y_pred_proba = model.predict.proba(
)
print(roc_auc_score(
, y_pred_proba[:, 1]))
 
Answer :  LogisticRegression 
                 Fit 
                X_test 
                y_test

Model Evaluation 
k-fold Cross Validation  
 
1. Splitting the dataset into a single training set and test set for evaluation purposes might yield an inaccurate measure of the evaluation metrics when:
 
Answer : The dataset is small
 
2. If we were to have a dataset of 100 datapoints and we break it into 5 chunks to make 5 training and test sets, how many datapoints would be in each training set and test set?
 
Answer : 80 training set datapoints / 20 test set datapoints
 
3. What is the precision value for our model if we do a 5-fold cross validation and get the following 5 values for precision?

0.7, 0.6, 0.6, 0.8, 0.8
 
Answer : 0.7

4. After doing a 5-fold cross validation, how do we choose a final single model?
 
Answer : Build a new model with all of the data 
 
Model Evaluation 
k-fold Cross Validation in Sklearn  
 
1. Let’s say we have a 2-dimensional numpy array X of the features and a 1-dimensional numpy array y of the target values. Finish the code below to build the k-fold object with k=5 and generate the 5 chunks.
from sklearn.model_selection import KFold
kf = KFold(n_splits= ____, shuffle=True)
chunks = kf. ____(X)
 
Answer : 5
               Split
 
2. Which of the following could be the output of this code assuming X has 3 datapoints?
kf = KFold(n_splits=3, shuffle=True)
splits = list(kf.split(X))
print(splits[0])
Select all that apply
 
Answer : ([0,1],[2]) ([0,2],[1])
 
3. Let’s say we have a feature matrix X. Drag and drop in order to correctly define X_train and X_test for the second fold.
kf = KFold(n_splits=5, shuffle=True)
splits = list(kf.split(X))
a, b = splits[___]
X_train = X[___]
X_test = X[___]
 
Answer : a,b=splits[1] X_train=X[a] X_test=X[b]
 
4. Say we have a list precision_scores of all the 5 precision values for each of the folds. Which of the following is the final single precision value?
Select all that apply
 
Answer : np.sum(precision_scores) / len(precision_scores) and np.median(precision_scores)
 
Model Evaluation 
Model Comparison  
 
1.Which of the following is a goal of using evaluation metrics?
 
Answer : Compare two different models for the Titanic dataset
 
2. Complete the code to create a fourth feature matrix that has just the Pclass and Sex features and uses the score_model function to print the scores. Assume we’ve defined y to be the target values and kf to be the KFold object.
X4 = df[['Pclass', 'male']]._____score_model(X4, y, kf)
 
Answer : Values
 
3. Model 1 has precision 0.77 and recall 0.68. Model 2 has precision 0.70 and recall 0.72. Which model is better?
 
Answer : It depends on the situation
 
Model Evaluation 
Module 3 Quiz  
 
1. We’ve built a model that has the following confusion matrix. Based on this, how many total datapoints are there, how many positive datapoints are there and how many negative datapoints are there?

                                      Actual    Actual
                                      Positive Negative
Predicted positive                 4       4
Predicted negative                1      11
 
Answer : 20 total / 5 positive / 15 negative
 
2. We have built a model that has the following confusion matrix.

                                           Actual       Actual
                                           Positive    Negative
Predicted positive                   4              4
Predicted negative                  1              11

What would the accuracy be?
 
Answer : 0.75
 
3. Accuracy is a bad measure of model performance when:
Select all that apply
 
Answer : There are more positive cases than negative 
               There are more negative cases than positive

4. When evaluating a model, we:
 
Answer : Split the data into a training set and a test set so that we build the model on the training set and test the model on unseen data. 
 
 5. Let's say we have a dataset of 150 datapoints and we are doing k-fold cross validation with k=5.

How many training sets do we have and of what size?
 
Answer : 5 sets and 120 datapoints
 
6. Complete the code to do a k-fold cross validation where k=5 and calculate the accuracy. X is the feature matrix and y is the target array.
scores = [ ]
kf = KFold(n_splits=5, shuffle=True)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model = LogisticRegression()
    model. ____(X_train, y_train)
    scores.____(model.score(X_test, y_test))
print(np.____(scores))

Answer :  fit 
                 Append 
                 Mean