KNN中不同距離度量對比和介紹（1）

發(fā)布人：數(shù)據(jù)派THU 時(shí)間：2023-05-22 來源：工程師

加入技術(shù)交流群
- 掃碼加入
  和技術(shù)大咖面對面交流
  海量資料庫查詢

k近鄰算法KNN是一種簡單而強(qiáng)大的算法，可用于分類和回歸任務(wù)。他實(shí)現(xiàn)簡單，主要依賴不同的距離度量來判斷向量間的區(qū)別，但是有很多距離度量可以使用，所以本文演示了KNN與三種不同距離度量(Euclidean、Minkowski和Manhattan)的使用。

KNN算法概述

KNN是一種惰性、基于實(shí)例的算法。它的工作原理是將新樣本的特征與數(shù)據(jù)集中現(xiàn)有樣本的特征進(jìn)行比較。然后算法選擇最接近的k個(gè)樣本，其中k是用戶定義的參數(shù)。新樣本的輸出是基于“k”最近樣本的多數(shù)類(用于分類)或平均值(用于回歸)確定的。
有很多距離度量的算法，我們這里選取3個(gè)最常用度量的算法來進(jìn)行演示：
1. 歐氏距離 Euclidean Distance

 def euclidean_distance(x1, x2):     return math.sqrt(np.sum((x1 - x2)**2))

euclidean_distance函數(shù)計(jì)算多維空間中兩點(diǎn)(x1和x2)之間的歐氏距離，函數(shù)的工作原理如下:

從x1元素中減去x2，得到對應(yīng)坐標(biāo)之間的差值。
使用**2運(yùn)算將差值平方。
使用np.sum()對差的平方求和。
使用math.sqrt()取總和的平方根。

歐幾里得距離是歐幾里得空間中兩點(diǎn)之間的直線距離。通過計(jì)算歐幾里得距離，可以識(shí)別給定樣本的最近鄰居，并根據(jù)鄰居的多數(shù)類(用于分類)或平均值(用于回歸)進(jìn)行預(yù)測。在處理連續(xù)的實(shí)值特征時(shí)，使用歐幾里得距離很有幫助，因?yàn)樗峁┝艘环N直觀的相似性度量。
2. 曼哈頓距離 Manhattan Distance
兩點(diǎn)坐標(biāo)的絕對差值之和。

 def manhattan_distance(x1, x2):     return np.sum(np.abs(x1 - x2))

Manhattan _distance函數(shù)計(jì)算多維空間中兩點(diǎn)(x1和x2)之間的曼哈頓距離，函數(shù)的工作原理如下:

用np計(jì)算x1和x2對應(yīng)坐標(biāo)的絕對差值。Abs (x1 - x2)
使用np.sum()對絕對差進(jìn)行求和。

曼哈頓距離，也稱為L1距離或出租車距離，是兩點(diǎn)坐標(biāo)的絕對差值之和。它代表了當(dāng)運(yùn)動(dòng)被限制為網(wǎng)格狀結(jié)構(gòu)時(shí)，點(diǎn)之間的最短路徑，類似于在城市街道上行駛的出租車。
在數(shù)據(jù)特征具有不同尺度的情況下，或者當(dāng)問題域的網(wǎng)格狀結(jié)構(gòu)使其成為更合適的相似性度量時(shí)，使用曼哈頓距離可能會(huì)有所幫助。曼哈頓距離可以根據(jù)樣本的特征來衡量樣本之間的相似性或差異性。
與歐幾里得距離相比，曼哈頓距離對異常值的敏感性較低，因?yàn)樗鼪]有對差異進(jìn)行平方。這可以使它更適合于某些數(shù)據(jù)集或異常值的存在可能對模型的性能產(chǎn)生重大影響的問題。
3. 閔可夫斯基距離 Minkowski Distance
它是歐幾里得距離和曼哈頓距離的一般化的表現(xiàn)形式，使用p進(jìn)行參數(shù)化。當(dāng)p=2時(shí)，它變成歐氏距離，當(dāng)p=1時(shí)，它變成曼哈頓距離。



 def minkowski_distance(x1, x2, p):    return np.power(np.sum(np.power(np.abs(x1 - x2), p)), 1/p)

minkowski_distance函數(shù)計(jì)算多維空間中兩點(diǎn)(x1和x2)之間的閔可夫斯基距離。
當(dāng)你想要控制單個(gè)特征的差異對整體距離的影響時(shí)，使用閔可夫斯基距離會(huì)很有幫助。通過改變p值，可以調(diào)整距離度量對特征值或大或小差異的靈敏度，使其更適合特定的問題域或數(shù)據(jù)集。
閔可夫斯基距離可以根據(jù)樣本的特征來衡量樣本之間的相似性或不相似性。該算法通過計(jì)算適當(dāng)p值的閔可夫斯基距離，識(shí)別出給定樣本的最近鄰居，并根據(jù)鄰居的多數(shù)類(用于分類)或平均值(用于回歸)進(jìn)行預(yù)測。

KNN 算法的代碼實(shí)現(xiàn)

因?yàn)镵NN算法的原理很簡單，所以我們這里直接使用Python實(shí)現(xiàn)，這樣也可以對算法有一個(gè)更好的理解：

def knn_euclidean_distance(X_train, y_train, X_test, k):     # List to store the predicted labels for the test set     y_pred = []
     # Iterate over each point in the test set     for i in range(len(X_test)):         distances = []         # Iterate over each point in the training set         for j in range(len(X_train)):             # Calculate the distance between the two points using the Euclidean distance metric             dist = euclidean_distance(X_test[i], X_train[j])             distances.append((dist, y_train[j]))
         # Sort the distances list by distance (ascending order)         distances.sort()
         # Get the k nearest neighbors         neighbors = distances[:k]
         # Count the votes for each class         counts = {}         for neighbor in neighbors:             label = neighbor[1]             if label in counts:                 counts[label] += 1             else:                 counts[label] = 1
         # Get the class with the most votes         max_count = max(counts, key=counts.get)         y_pred.append(max_count)
     return y_pred

這個(gè)' knn_euclidean_distance '函數(shù)對于解決分類問題很有用，因?yàn)樗梢愿鶕?jù)' k '個(gè)最近鄰居中的大多數(shù)類進(jìn)行預(yù)測。該函數(shù)使用歐幾里得距離作為相似性度量，可以識(shí)別測試集中每個(gè)數(shù)據(jù)點(diǎn)的最近鄰居，并相應(yīng)地預(yù)測它們的標(biāo)簽。我們實(shí)現(xiàn)的代碼提供了一種顯式的方法來計(jì)算距離、選擇鄰居，并根據(jù)鄰居的投票做出預(yù)測。
在使用曼哈頓距離時(shí)，KNN算法與歐氏距離保持一致，只需要將距離計(jì)算函數(shù)euclidean_distance修改為manhattan_distance。而閔可夫斯基距離則需要多加一個(gè)參數(shù)p，實(shí)現(xiàn)代碼如下：



































def knn_minkowski_distance(X_train, y_train, X_test, k, p):     # List to store the predicted labels for the test set     y_pred = []
     # Iterate over each point in the test set     for i in range(len(X_test)):         distances = []         # Iterate over each point in the training set         for j in range(len(X_train)):             # Calculate the distance between the two points using the Minkowski distance metric             dist = minkowski_distance(X_test[i], X_train[j], p)             distances.append((dist, y_train[j]))
         # Sort the distances list by distance (ascending order)         distances.sort()
         # Get the k nearest neighbors         neighbors = distances[:k]
         # Count the votes for each class         counts = {}         for neighbor in neighbors:             label = neighbor[1]             if label in counts:                 counts[label] += 1             else:                 counts[label] = 1
         # Get the class with the most votes         max_count = max(counts, key=counts.get)         y_pred.append(max_count)
     return y_pred

*博客內(nèi)容為網(wǎng)友個(gè)人發(fā)布，僅代表博主個(gè)人觀點(diǎn)，如有侵權(quán)請聯(lián)系工作人員刪除。

博客專欄

KNN中不同距離度量對比和介紹（1）

相關(guān)推薦

技術(shù)專區(qū)