Assignment 9 Computing Proximities
.docx
keyboard_arrow_up
School
Arizona State University *
*We aren’t endorsed by this school
Course
511
Subject
Computer Science
Date
Apr 3, 2024
Type
docx
Pages
13
Uploaded by AgentQuetzal3025
Assignment 9: Computing Proximities
Prasad Srinivas
IFT 511: Analyzing Big Data
Professor: Asmaa Elbadrawy
Tuesday and Thursday (12:00 PM – 1:15 PM)
October 8
th
, 2023
Similarities Between Binary Data Point
2
3
4
3. User 3 shows a resemblance, to User 1 than User 2 as evidenced by their Simple Matching Coefficients (SMCs). The SMC between User 1 and User 3 is 0.8 while the SMC between User 1 and User 2 is 0.4. Therefore, according to the Simple Matching Coefficient, we can conclude that User 3 bears similarity, to User 1.
5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
1
2
3
95
108
110
126
118
102
124
121
145
118
140
155
185
158
8
190
178
9
205
159
10
222
184
The table shows scores of 10 students in Java programming and
Data Science
The estimated Spearman's rank-order coefficient of correlation p
is:
A. r=0.842
B.
r = -0.842
C. r= 0.158
D. r=-0.158
5
Student
6
Java Programming
Data Science
arrow_forward
Observe the table DOCTOR:
DOCTOR
DoctorID
Name
DateOfMDGraduation
NumberOfPatients
D111
John
1/1/1999
100
D222
Tina
1/2/2003
117
D333
Ellen
1/1/2001
117
D444
Tina
1/1/2003
144
Normalizing table DOCTOR to 3NF will result in:
O Five separate tables
O Four separate tables
O No changes (table DOCTOR remains as is, no additional tables)
O Three separate tables
O Two separate tables
arrow_forward
Q2) Consider the following table is used to store contact information: Name | Company | Adcress | Pnonel | PhoneZ | Phone3 | ZpCode Joe | ABC 123 5532 2234 |3211 12345 Jane |XYZ 456 [3421 | 14454 Chris |PDQ [789 | 2341 | 6655 | 14423 The table is considered as not normalized, explain why? And what is the INF of the table?
arrow_forward
ce_S1_Mangesh
/ My courses/55 ITSE415_AppDataSci_S1 / Chapter 4- Data Preprocessing / Quiz 2 - 10 June 2021
Quiz navigation
When we try to analyze the result-data of the last 10 semesters for the subject 'Programming-1' and if you are asked
to find what will be the final score of any new student based on the value of his Quiz 2 marks, then what type of
analysis is this?
of
Finish attempt-
O a. Prescriptive Analysis
Time left 0:17:27
tion
O b. None of these
Oc. Predictive Analysis
O d. Descriptive Analysis
Next page
Jump to..
pter 4- Presentation
arrow_forward
Which of the following statement(s) is/are true for bagging and boosting?
A) Bagging: Weak learners are built in parallel
Boosting: Weak learners are built in sequence one after the other
B) Bagging: More weight to those weak learners with better performance
Boosting: Each weak learner has equal weight in the final prediction.
C) Bagging: Samples are drawn from the original dataset with replacement to train each individual weak learner
Boosting: Subsequent samples have more weight of those observations which had relatively higher errors in previous weak learners
D) Bagging: Random forest is a special type of bagging technique
Boosting: Adaboost is a special type of boosting technique
arrow_forward
Answer the given question with a proper explanation and step-by-step solution.
Create a scatter plot of the following data of minutes driven and miles driven using Excel Driver Number Minutes driven Miles driven 1 30 15 2 22 8 3 62 32 4 45 18 5 15 10 6 24 22 7 37 12 8 14 3 9 48 15 10 36 52 11 45 40 12 24 16 13 50 48 14 32 21 15 8 2 Is there a relationship between minutes driving and miles driven?
arrow_forward
Damon Davis was creating the Drano Plumbing Company's spreadsheet. He determined that the net income would be $50,000. When he tallied the columns on the Balance Sheet, the totals were as follows: debit, $400,000; and credit, $300,000. What was the most likely reason for this discrepancy? If this was not the case, how should he proceed to determine the source of the error?
arrow_forward
Data Warehous
arrow_forward
There are 40 questions in this paper. Choose the most suitable answers.
Which of the following integer arithmetic operation will cause an overflow error in a processing device
with 8-bit memory system?
1.
А.
-64-64
В.
64+64
С.
60*2
D.
60/2
Which of the following is an example of GIGO?
A. "Trial-and-error" approach is not a systematic method to solve problem.
When a student submits the wrong answer to an online test, he will get zero mark.
Weather forecast by supercomputer is not always reliable.
When a cashier inputs incorrect quantity, the customer will get an incorrect receipt.
2.
В.
С.
D.
3.
John is writing a new fiction. Which of the following file format(s) is/are suitable for him to save his
work?
(1) RTF
(2) PDF
(3) PNG
A. (1) only
(2) only
C. (1) and (2) only
D. (2) and (3) only
В.
||
arrow_forward
+
ill
You're the ChiefData Science Officer at a large bank. You've instructed your team to experiment
with using payment data for marketing purposes, predicting which customer might be interested
in a golf tournament that the bank sponsors. So the data instances correspond to customers, and
the features are unique account numbers. Your newly hired team is ready to shine and has put
quite some effort in building a linear model, where each ac-count number that one can pay to is
given a coefficient. The prediction model hence predicts interest based on whom the customer
has made payments to. They proudly report to you that the accuracy of their model is 95%, on a
test set chosen in January.
1. What further questions would you ask on the evaluation? Think of test data, metrics, and
baselines.
2. What would be potential privacy risks related to re-identification or the revelation of sensitive
information of customers to the data science team? How to measure these?
3. Might there be…
arrow_forward
True/False 2. Standard deviation measures how spread out a data set is.
arrow_forward
Do a normalization from 1NF to 3NF based on the table below
arrow_forward
Here are the recurrence relations:
TẠ(n) = 2 × TA
+ n
%3D
Тв(п) %3D 4 x Тв
+ n2
Tc(n) = 8 x Tc
+n
Tp(n) = 8 × Tp
+ 1
%3D
TĄ(1) = TR(1) = Tc(1) = Tp(1) = 0(1)
Arrange the respective time complexities from deepest to shallowest (assuming the same input n).
Example: TA(n) > T8(n) → Tc(n) → To(n) →
arrow_forward
Three classifiers are to be benchmarked. To this end, using the same data, the classifiers were trained and the following table shows the
validation results obtained with n = 16 observations.
1
1
0
2
0
3
1
4
1
5
1
6
1
7
0
8
9
10
11
12
13
14
15
16
OTTOTOO
0
1
1
1
1
0
ZOOOoooo Hooo o
Ytrue Y1 Y2 Y3
1
0
0
0
1
0
1
0
1
1
0
0
1
1
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
1
0
1
0
0
0
1
1
0
11O
1
1
1
1
0001
Match the classifiers with the performance measures.
Accuracy and Error rate for Y3 Choose...
Accuracy and Error rate for Y2 Choose...
TPR and FPR for Y1
Choose...
arrow_forward
Kindly explain the rationale for the use of inferential statistics.
arrow_forward
Computer programming
arrow_forward
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
Question 1: Population Growth
The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. We will calculate the population growth rate as follows:-
?????ℎ_????=???????_????_??????????−????????_????_??????????????????_????_??????????
As such, we can only calculate the growth rate for the year 1961 onwards.
Write a function that takes the population_df and a country_code as input and computes the population growth rate for a given country starting from the year 1961. This function must return a return a 2-d numpy array that contains the year and corresponding growth rate for the country.
Function Specifications:
Should take a population_df and country_code string as input and return a numpy array…
arrow_forward
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
Question 1: Population Growth
The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. We will calculate the population growth rate as follows:-
The formula to use in calculating the growth rate is as below:-
Growth_rate =( Current_year_ population - Previous_year_population) / Previous_year_population
As such, we can only calculate the growth rate for the year 1961 onwards.
Write a function that takes the population_df and a country_code as input and computes the population growth rate for a given country starting from the year 1961. This function must return a return a 2-d numpy array that contains the year and corresponding growth rate for the country.
Function Specifications:
Should take…
arrow_forward
The students'grades % as shown 80] %Calculate the mean of each student using % one command
arrow_forward
Using the images of the dataset below, please answer the following questions
NOTE: All answers should be supported by coding and/or some written discussion.
1. Explain what this dataset about?Hint: You may need to look at columns, rows, and cells to understand the dataset.
2. Define the neighbor hood and the year where was the biggest square feetHint: You may use .max() method
3. Is there is any relation between the income and the square feet?Hint: You may need to draw (plot) your dataset.
4. For each minimum net income in the dataset, what was the years?Hint: You may use grouped and aggregated calculations
5. Reshape the dataset based on units and value.Hint: You may apply tidy data approaches
arrow_forward
see image
arrow_forward
Please written by computer source
arrow_forward
Abhay Shisodia please answer:
Create a simple binary dataset (6 points in total, 3 for each class) that can be linearly separated.
arrow_forward
Given the following:
Facts:
K
Knowledge Base
W&D0
D&N W
N&K>D
Make Forward Chaining for the above KB with the facts? (show the steps of exceution)
arrow_forward
implement a D-i-D in this problem. Load the dataset on STATA: use http://www.stata.com/data/jwooldridge/eacsap/jtrain1 This has data on firms and the amount of job training they get. Only use the data from 1987 and 1988. Carefully study the data before you proceed. Construct the D-i-D estimator in different ways:
(a) Run the regressionhrsempit = β0 + β1 grant it + β21( year = 1988) + β3Ei + uit
where Ei is a dummy variable for being a treatment (i.e. someone who would receive the grant in 1988).
(b) Run the fixed effect regression with firm fixed effects θi: hrsempit = θi + β1 grant it + β21( year = 1988) + uit
(c) Construct the 4 means of controls and treatments, before and after, and es- timate the difference in difference with means.
(d) Do you get exactly the same answer, why or why not?
(e) Now include other controls to estimate the D-i-D regression. Justify what- ever you include and interpret.
Provide line by line code for STATA and the solution
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- 1 2 3 95 108 110 126 118 102 124 121 145 118 140 155 185 158 8 190 178 9 205 159 10 222 184 The table shows scores of 10 students in Java programming and Data Science The estimated Spearman's rank-order coefficient of correlation p is: A. r=0.842 B. r = -0.842 C. r= 0.158 D. r=-0.158 5 Student 6 Java Programming Data Sciencearrow_forwardObserve the table DOCTOR: DOCTOR DoctorID Name DateOfMDGraduation NumberOfPatients D111 John 1/1/1999 100 D222 Tina 1/2/2003 117 D333 Ellen 1/1/2001 117 D444 Tina 1/1/2003 144 Normalizing table DOCTOR to 3NF will result in: O Five separate tables O Four separate tables O No changes (table DOCTOR remains as is, no additional tables) O Three separate tables O Two separate tablesarrow_forwardQ2) Consider the following table is used to store contact information: Name | Company | Adcress | Pnonel | PhoneZ | Phone3 | ZpCode Joe | ABC 123 5532 2234 |3211 12345 Jane |XYZ 456 [3421 | 14454 Chris |PDQ [789 | 2341 | 6655 | 14423 The table is considered as not normalized, explain why? And what is the INF of the table?arrow_forward
- ce_S1_Mangesh / My courses/55 ITSE415_AppDataSci_S1 / Chapter 4- Data Preprocessing / Quiz 2 - 10 June 2021 Quiz navigation When we try to analyze the result-data of the last 10 semesters for the subject 'Programming-1' and if you are asked to find what will be the final score of any new student based on the value of his Quiz 2 marks, then what type of analysis is this? of Finish attempt- O a. Prescriptive Analysis Time left 0:17:27 tion O b. None of these Oc. Predictive Analysis O d. Descriptive Analysis Next page Jump to.. pter 4- Presentationarrow_forwardWhich of the following statement(s) is/are true for bagging and boosting? A) Bagging: Weak learners are built in parallel Boosting: Weak learners are built in sequence one after the other B) Bagging: More weight to those weak learners with better performance Boosting: Each weak learner has equal weight in the final prediction. C) Bagging: Samples are drawn from the original dataset with replacement to train each individual weak learner Boosting: Subsequent samples have more weight of those observations which had relatively higher errors in previous weak learners D) Bagging: Random forest is a special type of bagging technique Boosting: Adaboost is a special type of boosting techniquearrow_forwardAnswer the given question with a proper explanation and step-by-step solution. Create a scatter plot of the following data of minutes driven and miles driven using Excel Driver Number Minutes driven Miles driven 1 30 15 2 22 8 3 62 32 4 45 18 5 15 10 6 24 22 7 37 12 8 14 3 9 48 15 10 36 52 11 45 40 12 24 16 13 50 48 14 32 21 15 8 2 Is there a relationship between minutes driving and miles driven?arrow_forward
- Damon Davis was creating the Drano Plumbing Company's spreadsheet. He determined that the net income would be $50,000. When he tallied the columns on the Balance Sheet, the totals were as follows: debit, $400,000; and credit, $300,000. What was the most likely reason for this discrepancy? If this was not the case, how should he proceed to determine the source of the error?arrow_forwardData Warehousarrow_forwardThere are 40 questions in this paper. Choose the most suitable answers. Which of the following integer arithmetic operation will cause an overflow error in a processing device with 8-bit memory system? 1. А. -64-64 В. 64+64 С. 60*2 D. 60/2 Which of the following is an example of GIGO? A. "Trial-and-error" approach is not a systematic method to solve problem. When a student submits the wrong answer to an online test, he will get zero mark. Weather forecast by supercomputer is not always reliable. When a cashier inputs incorrect quantity, the customer will get an incorrect receipt. 2. В. С. D. 3. John is writing a new fiction. Which of the following file format(s) is/are suitable for him to save his work? (1) RTF (2) PDF (3) PNG A. (1) only (2) only C. (1) and (2) only D. (2) and (3) only В. ||arrow_forward
- + ill You're the ChiefData Science Officer at a large bank. You've instructed your team to experiment with using payment data for marketing purposes, predicting which customer might be interested in a golf tournament that the bank sponsors. So the data instances correspond to customers, and the features are unique account numbers. Your newly hired team is ready to shine and has put quite some effort in building a linear model, where each ac-count number that one can pay to is given a coefficient. The prediction model hence predicts interest based on whom the customer has made payments to. They proudly report to you that the accuracy of their model is 95%, on a test set chosen in January. 1. What further questions would you ask on the evaluation? Think of test data, metrics, and baselines. 2. What would be potential privacy risks related to re-identification or the revelation of sensitive information of customers to the data science team? How to measure these? 3. Might there be…arrow_forwardTrue/False 2. Standard deviation measures how spread out a data set is.arrow_forwardDo a normalization from 1NF to 3NF based on the table belowarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education