Having the following rdd that contains:('37.1242586781', '-119.424203758')('37.7374074392', '-121.790061298')('45.0929780547', '-117.473248072')('34.9786739919', '-111.700983823')('35.8736386531', '-120.421476019')('34.0912863966', '-118.144076285')('35.6720438532', '-120.416438975')('36.284116521', '-119.030325609')('44.4092246446', '-122.891970809')('38.4770167448', '-122.296973442')('38.4296857989', '-121.418135072')('37.3822266494', '-122.016158235') Implement an iterative algorithm (k-means) in Spark to calculate k-means fora set of points that are in a file(rdd), a k-means algorithm in python. Do not use use K-means in MLib of Spark to solve the problem. Follow this pattern: 1. Choose k = 5 random points as starting centers2. Find all points closest to each center (use groupByKey or reduceByKey)3. Find the new center (mean) of each cluster4. If the centers changed by more than convergeDist (e.g.convergeDist = 0.1), iterate again starting from step 2;otherwise, terminate For example, use (currentcenter_x_1, currentcenter_y_1), (currentcenter_x_2, currentcenter_y_2), ...,(currentcenter_x_5, currentcenter_y_5) to denote the current 5 center points of the 5 clusters, use(newcenter_x_1, newcenter_y_1), (newcenter_x_2, new_center_y_2), .... , (newcenter_x_5, newcenter_y_5)to denote the new center of each cluster, then in step 4, calculate the total of the squared distance betweenthe current centers and new centers. E.g.tempDist = (currentcenter_x_1 – newcenter_x_1) 2 + (currentcenter_y_1 – newcenter_y_1) 2 +(currentcenter_x_2 – newcenter_x_2) 2 + (currentcenter_y_2 – newcenter_y_2) 2 + ... +(currentcenter_x_5 – newcenter_x_5) 2 + (currentcenter_y_5 – newcenter_y_5) 2Update the centers of clusters using the new center points.If tempDist > convergeDist, iterate again starting from step 2; otherwise, terminate the loop.

icon
Related questions
Question

Having the following rdd that contains:
('37.1242586781', '-119.424203758')
('37.7374074392', '-121.790061298')
('45.0929780547', '-117.473248072')
('34.9786739919', '-111.700983823')
('35.8736386531', '-120.421476019')
('34.0912863966', '-118.144076285')
('35.6720438532', '-120.416438975')
('36.284116521', '-119.030325609')
('44.4092246446', '-122.891970809')
('38.4770167448', '-122.296973442')
('38.4296857989', '-121.418135072')
('37.3822266494', '-122.016158235')

Implement an iterative algorithm (k-means) in Spark to calculate k-means for
a set of points that are in a file(rdd), a k-means algorithm in python. Do not use use K-means in MLib of Spark to solve the problem. Follow this pattern:

1. Choose k = 5 random points as starting centers
2. Find all points closest to each center (use groupByKey or reduceByKey)
3. Find the new center (mean) of each cluster
4. If the centers changed by more than convergeDist (e.g.
convergeDist = 0.1), iterate again starting from step 2;
otherwise, terminate

For example, use (currentcenter_x_1, currentcenter_y_1), (currentcenter_x_2, currentcenter_y_2), ...,
(currentcenter_x_5, currentcenter_y_5) to denote the current 5 center points of the 5 clusters, use
(newcenter_x_1, newcenter_y_1), (newcenter_x_2, new_center_y_2), .... , (newcenter_x_5, newcenter_y_5)
to denote the new center of each cluster, then in step 4, calculate the total of the squared distance between
the current centers and new centers. E.g.
tempDist = (currentcenter_x_1 – newcenter_x_1) 2 + (currentcenter_y_1 – newcenter_y_1) 2 +
(currentcenter_x_2 – newcenter_x_2) 2 + (currentcenter_y_2 – newcenter_y_2) 2 + ... +
(currentcenter_x_5 – newcenter_x_5) 2 + (currentcenter_y_5 – newcenter_y_5) 2
Update the centers of clusters using the new center points.
If tempDist > convergeDist, iterate again starting from step 2; otherwise, terminate the loop.

Expert Solution
trending now

Trending now

This is a popular solution!

steps

Step by step

Solved in 2 steps

Blurred answer