Help Me Understand This Vectorized Logic

I'd like to get some help understanding vectorized operations on multi-dimensional arrays. Specifically, I've got a problem and some code that I think should work, but it's not working, and I'm sure

weixin_0010034

8人浏览 · 2022-08-09 18:45:37

weixin_0010034 · 2022-08-09 18:45:37 发布

I'd like to get some help understanding vectorized operations on multi-dimensional arrays. Specifically, I've got a problem and some code that I think should work, but it's not working, and I'm sure it's because my thinking is wrong, but I can't figure out why.

Some caveats:

This is for some homework. I really don't want a plop of code that I'm supposed to copy/paste without understanding. If I wanted that, I'd go to StackOverflow. I want the concepts.
I want to do this only using numpy. I know that scipy and other ML libraries have fancy functions that would do what I'm asking about in a black box, but that's not what I want. This is a learning exercise.

Here's the scenario:

The Scenario

I've got two datasets of Iris data (yes, that Iris data)--a training set and a test set. Both sets have 4 columns of float values, and an associated vector of labels classifying each of the data points.

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
7,3.2,4.7,1.4,Iris-versicolor
...

We're doing a 1-Nearest-Neighbor classification. The goal is to do the following:

For each data point in the testing set, compare it to all the points in the training set by calculating the "distance" between the two. Distance is calculated as

def distance(x, y):
    return math.sqrt((x[0] - y[0])**2 + (x[1] - y[1])**2 + (x[2] - y[2])**2 + (x[3] - y[3])**2)

Also known as the Root-Sum-Square of the differences between each feature of each point.

Now. Here's what I have right now:

My Code

import numpy as np

def distance(x, y):
    return np.sqrt(np.sum((x - y)**2, axis=1))


def main():
    # ... blah blah load data
    # training_data is 75 rows x 4 cols of floats
    # testing_data is 75 rows x 4 cols of floats
    # training_labels is 75 rows x 1 col of strings
    # testing_labels is 75 rows x 1 col of strings

    # My thought is to use "broadcasting" to do it without loops
    # so far, to me, "broadcasting" == "magic"

    training_data = training_data.reshape((1, 4, 75))
    testing_data = testing_data.reshape((75, 4, 1))

    # So this next bit should work like magic, producing a 75 x 1 x 75 matrix of
    # distances between the testing data (row indices) and the training data
    # (column indices)

    distances = distance(testing_data, training_data)

    # And the column index of the minimum distance should in theory be the 
    # index of the training point that is the "closest" to the given testing point
    # for that row

    closest_indices = distances.argmin(axis=1)

    # And this should build an array of labels corresponding to the indices
    # gathered above
    predicted_labels = training_labels[closest_indices]

    number_correct = np.sum(predicted_labels == testing_labels)
    accuracy = number_correct/len(testing_labels)

And this all seems right to me.

But.

When I run it, per my prompt, I should be expecting an accuracy somewhere in the .94 range, and I'm getting something in the .33 range. Which is poop.

So. What am I missing? What key concepts am I totally misunderstanding?

Thank you!

向你推荐>>>开发者社区

华为、百度、京东云现已入驻，来创建你的专属开发者社区吧！

更多推荐

数据科学的主要组成部分和特点

数据科学是十年来增长最快、最具挑战性和高薪的工作之一。那么,究竟什么是数据科学?数据科学是一个跨学科领域,它结合了统计学、计算机科学和机器学习算法,以从结构化和非结构化数据中获得洞察力。据《经济时报》报道,尽管供应增长缓慢,但印度对通过数据科学课程认证的各行业数据科学专业人员的需求增长了 400% 以上。数据科学的组成部分 1\。数据探索这是最关键的一步,因为它花费的时间最多。数据探索消耗了大

大数据

关于 Jupyter 笔记本最糟糕的五件事

我曾经喜欢 Jupyter。我仍然认为它们是许多任务的绝佳工具,例如探索性数据分析和轻松轻松地向同事展示见解。然而,虽然它们有时非常适合数据科学,但有时却令人头疼。像任何软件工具一样,它们也有其缺点。以下是 Jupyter Notebooks 用于数据科学的五个最糟糕的事情: 1.练习良好的代码版本控制几乎是不可能的 Jupyter Notebooks 对于代码版本控制来说很糟糕。问题是它们存储为

大数据

2023 年流行的大数据和数据科学角色

数据科学和大数据提供了广泛的职业前景。涉及数据的角色的扩展伴随着数据科学的出现。它是当今最流行和最前沿的技术应用领域之一,这是有道理的。数据科学目前可能是最好的就业市场。与此同时,这一发展中的主题正在改变众多业务和技术。随着所有垂直领域的行业越来越受数据驱动,就业市场和必要的技能受到影响。随着我们学习新的数据接触点和评估方法,我们生活的社会、日常生活和国家经济越来越依赖数据。这是大数据和数据科学能