Programming reference.md (6735B)
1 +++ 2 title = 'Programming reference' 3 +++ 4 # Numpy & matplotlib 5 Load external file: 6 ```python 7 data = numpy.loadtxt('./filepath.csv', delimiter=',') 8 ``` 9 10 Print information about data: 11 12 ```python 13 data.shape 14 ``` 15 16 Graph two columns of data: 17 18 ```python 19 import matplotlib.pyplot as plt 20 %matplotlib inline 21 x = data[:,0] 22 y = data[:,1] 23 # includes size and transparency setting, specifies third column to use for color 24 plt.scatter(x, y, s=3, alpha=0.2, c=data[:,2], cmap='RdYlBu_r') 25 plt.xlabel('x axis') 26 plt.ylabel('y axis'); 27 ``` 28 29 Histogram plotting: 30 31 ```python 32 # bins determines width of bars 33 plt.hist(data, bins=100, range=[start, end] 34 ``` 35 36 The identity matrix: 37 38 ```python 39 np.eye(2) # for a 2x2 matrix 40 ``` 41 42 Matrix multiplication: 43 44 ```python 45 a * b # element-wise 46 a.dot(b) # dot product 47 ``` 48 49 Useful references: 50 * [The official numpy quickstart guide](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html) 51 * [A more in-depth tutorial, with in-browser samples](https://www.datacamp.com/community/tutorials/python-numpy-tutorial) 52 * [A very good walk through the most important functions and features](http://cs231n.github.io/python-numpy-tutorial/). From the famous [CS231n course](http://cs231n.github.io/), from Stanford. 53 * [The official pyplot tutorial](https://matplotlib.org/users/pyplot_tutorial.html). Note that pyplot can accept basic python lists as well as numpy data. 54 * [A gallery of example MPL plots](https://matplotlib.org/gallery.html). Most of these do not use the pyplot state-machine interface, but the more low level objects like [Axes](https://matplotlib.org/api/axes_api.html). 55 * [In-depth walk through the main features and plot types](http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html) 56 57 58 # Sklearn 59 Split data into train and test, on features `x` and target `y`: 60 61 ```python 62 from sklearn.model_selection import train_test_split 63 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5) 64 ``` 65 66 An estimator implements method `fit(x,y)` that learns from data, and `predict(T)` which takes new instance and predicts target value. 67 68 Linear classifier, using SVC model with linear kernel: 69 70 ```python 71 from sklearn.svm import SVC 72 linear = SVC(kernel='linear') 73 linear.fit(x_train, y_train) 74 ``` 75 76 Decision tree classifier: 77 78 ```python 79 from sklearn.tree import DecisionTreeClassifier 80 tree = DecisionTreeClassifier() 81 tree.fit(x_train, y_train) 82 ``` 83 84 k-Nearest Neighbors: 85 86 ```python 87 from sklearn.neighbors import KNeighborsClassifier 88 knn = KNeighborsClassifier(15) # We set the number of neighbors to 15 89 knn.fit(x_train, y_train) 90 ``` 91 92 Try to classify new data: 93 94 ```python 95 linear.predict(some_data) 96 ``` 97 98 Compute accuracy on testing data: 99 100 ```python 101 from sklearn.metrics import accuracy_score 102 y_predicted = linear.predict(x_test) 103 accuracy_score(y_test, y_predicted) 104 ``` 105 106 Make a plot of classification, with colors showing classifier's decision: 107 108 ```python 109 from mlxtend.plotting import plot_decision_regions 110 plot_decision_regions(x_test[:500], y_test.astype(np.integer)[:500], clf=linear, res=0.1); 111 ``` 112 113 Compare classifiers via ROC curve: 114 115 116 ```python 117 from sklearn.metrics import roc_curve, auc 118 119 # The linear classifier doesn't produce class probabilities by default. We'll retrain it for probabilities. 120 linear = SVC(kernel='linear', probability=True) 121 linear.fit(x_train, y_train) 122 123 # We'll need class probabilities from each of the classifiers 124 y_linear = linear.predict_proba(x_test) 125 y_tree = tree.predict_proba(x_test) 126 y_knn = knn.predict_proba(x_test) 127 128 # Compute the points on the curve 129 # We pass the probability of the second class (KIA) as the y_score 130 curve_linear = sklearn.metrics.roc_curve(y_test, y_linear[:, 1]) 131 curve_tree = sklearn.metrics.roc_curve(y_test, y_tree[:, 1]) 132 curve_knn = sklearn.metrics.roc_curve(y_test, y_knn[:, 1]) 133 134 # Compute Area Under the Curve 135 auc_linear = auc(curve_linear[0], curve_linear[1]) 136 auc_tree = auc(curve_tree[0], curve_tree[1]) 137 auc_knn = auc(curve_knn[0], curve_knn[1]) 138 139 plt.plot(curve_linear[0], curve_linear[1], label='linear (area = %0.2f)' % auc_linear) 140 plt.plot(curve_tree[0], curve_tree[1], label='tree (area = %0.2f)' % auc_tree) 141 plt.plot(curve_knn[0], curve_knn[1], label='knn (area = %0.2f)'% auc_knn) 142 143 plt.xlim([0.0, 1.0]) 144 plt.ylim([0.0, 1.0]) 145 plt.xlabel('False Positive Rate') 146 plt.ylabel('True Positive Rate') 147 plt.title('ROC curve'); 148 149 plt.legend(); 150 ``` 151 152 Cross-validation: 153 154 155 ```python 156 from sklearn.model_selection import cross_val_score 157 from sklearn.metrics import roc_auc_score, make_scorer 158 159 # The cross_val_score function does all the training for us. We simply pass 160 # it the complete data, the model, and the metric. 161 162 linear = SVC(kernel='linear', probability=True) 163 164 # Train for 5 folds, returing ROC AUC. You can also try 'accuracy' as a scorer 165 scores = cross_val_score(linear, x, y, cv=3, scoring='roc_auc') 166 167 print('scores per fold ', scores) 168 ``` 169 170 Regression: 171 172 ```python 173 from sklearn import datasets 174 from sklearn.metrics import mean_squared_error, r2_score 175 176 # Load the diabetes dataset, and select one feature (Body Mass Index) 177 x, y = datasets.load_diabetes(True) 178 x = x[:, 2].reshape(-1, 1) 179 180 # -- the reshape operation ensures that x still has two dimensions 181 # (that is, we need it to be an n by 1 matrix, not a vector) 182 183 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5) 184 185 # feature space on horizontal axis, output space on vertical axis 186 plt.scatter(x_train[:, 0], y_train) 187 plt.xlabel('BMI') 188 plt.ylabel('disease progression'); 189 190 # Train three models: linear regression, tree regression, knn regression 191 from sklearn.linear_model import LinearRegression 192 linear = LinearRegression() 193 linear.fit(x_train, y_train) 194 195 from sklearn.tree import DecisionTreeRegressor 196 tree = DecisionTreeRegressor() 197 tree.fit(x_train, y_train) 198 199 from sklearn.neighbors import KNeighborsRegressor 200 knn = KNeighborsRegressor(10) 201 knn.fit(x_train, y_train); 202 203 # Plot the models 204 from sklearn.metrics import mean_squared_error 205 206 plt.scatter(x_train, y_train, alpha=0.1) 207 208 xlin = np.linspace(-0.10, 0.2, 500).reshape(-1, 1) 209 plt.plot(xlin, linear.predict(xlin), label='linear') 210 plt.plot(xlin, tree.predict(xlin), label='tree ') 211 plt.plot(xlin, knn.predict(xlin), label='knn ') 212 213 print('MSE linear ', mean_squared_error(y_test, linear.predict(x_test))) 214 print('MSE tree ', mean_squared_error(y_test, tree.predict(x_test))) 215 print('MSE knn', mean_squared_error(y_test, knn.predict(x_test))) 216 217 plt.legend(); 218 ``` 219 220 Useful references: 221 * [The official quickstart guide](http://scikit-learn.org/stable/tutorial/basic/tutorial.html) 222 * [A DataCamp tutorial with interactive exercises](https://www.datacamp.com/community/tutorials/machine-learning-python) 223 * [Analyzing text data with SKLearn](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)