ASS-1
.docx
keyboard_arrow_up
School
Illinois Institute Of Technology *
*We aren’t endorsed by this school
Course
429
Subject
Computer Science
Date
Feb 20, 2024
Type
docx
Pages
11
Uploaded by BarristerBraveryFlamingo36
Information Retrieval Assignment-1
A20545920
Satya Jaidev
forecast
Exercise 1.1
Draw the inverted index that would be built for the following document collection.
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
home
1
2
3
4
forecast
s
1
in
2
increas
e
3
july
2
3
new
1
rise
1
sales
1
2
3
4
top
1
Exercise 1.2
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Information Retrieval Assignment-1
A20545920
Satya Jaidev
Doc 4 new hopes for schizophrenia patients
a. Draw the term-document incidence matrix for this document collection.
Term/
Document
Doc1
Doc2
Doc3
Doc4
approach
0
0
1
0
breakthrough
1
0
0
0
drug
1
1
0
0
for
1
0
1
1
hopes
0
0
0
1
new
0
1
1
1
of
0
0
1
0
patients
0
0
0
1
schizophrenia
1
1
1
1
treatment
0
0
1
0
b. Draw the inverted index representation for this collection.
Information Retrieval Assignment-1
A20545920
Satya Jaidev
Exercise 1.3
For the document collection shown in Exercise 1.2, what are the returned results for these queries:
a. schizophrenia AND drug
Doc1, Doc 2
b. for AND NOT(drug OR approach)
Doc 4
Exercise 1.7
Recommend a query processing order for d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes:
Term
Postings size
eyes
213312
kaleidoscope
87009
marmalade
107913
skies
271658
tangerine
46653
trees
316812
N(tangerine) + N(trees) = 363465 N(marmalade) + N(skies) = 379571 N(kaleidoscope) + N(eyes) = 300321 Order is
(kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies)
Exercise 1.8
If the query is:
friends AND romans AND (NOT countrymen)
Information Retrieval Assignment-1
A20545920
Satya Jaidev
how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of
handling negation in determining the order of query processing.
For each of the n terms, get its postings, Process in the order of increasing frequency, start with smallest set and then keep cutting further.If countrymen is more frequent then it can be used to remove documents by where it does not exist. Count for word X in (documents where word X occurs) For Word X the count for !X in ((number of total documents)-(documents where word X occurs)).
Exercise 2.1
Are the following statements true or false?
a. In a Boolean retrieval system, stemming never lowers precision.
False
b. In a Boolean retrieval system, stemming never lowers recall.
True
c. Stemming increases the size of the vocabulary.
False
d. Stemming should be invoked at indexing time but not while processing a query.
False
Exercise 2.6
We have a two-word query. For one term the postings list consists of the following 16 entries: [4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180] and for the
other it is the one entry postings list: [47]. Work out how many
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
In order to explain or understand the information that a prediction model provides, what approaches do you use?
arrow_forward
A large car fleet company asked you to help them forecast vehicle resale values. They purchase new vehicles,lease them for three years, and then sell them. Better forecasts of vehicle sales values would mean bettercontrol of profits; understanding what affects resale values may allow leasing and sales policies to bedeveloped to maximise profits. At the time, the resale values were being forecast by a group of specialists. Unfortunately, they saw any statistical model as a threat to their jobs and were uncooperative in providing information. Nevertheless, the company provided a large amount of data on previous vehicles and their eventual resale values.
1.1 Describe the five steps of forecasting in the context of this project?
arrow_forward
If we evaluate a classification model(such as SVC) using the K-fold cross-validation and we set K = 9, how many times will the classification model be trained and evaluated?
Group of answer choices
3
6
9
1
arrow_forward
given that not all claim values are represented in the graph and that some of the unrepresented claim values are very large. Can you explain this? If so, please provide your reasons, with graphs or other statistics if helpful.
arrow_forward
Uncertain data modeling generally refers to tuple-level uncertainty. True or false?
A True
B False
arrow_forward
Assume an attribute (feature) has a normal distribution in a dataset. Assume the standard deviation is S and the mean is M. Typically:
Group of answer choices, multiple choice:
Then the outliers usually lie below -3*M or above +3*M
Then the outliers usually lie above -3*S or below +3*S
Then the outliers usually lie below -3*S or above +3*S
Then the outliers usually lie above -3*M or below +3*M
arrow_forward
evaluate a data model's usefulness in comparison to the traditional strength prediction method
arrow_forward
BDAN 250
Select the answer that best describes an ordinal variable:
A categorical variable with categories that cannot be ranked
A categorical variable where there is a logical rank-order relationship between the variable responses
A continuous variable with categories that cannot be ranked
A continuous variable where there is a logical rank-order relationship between the variable responses
Numeric values at set/even intervals.
arrow_forward
Your task is to implement a class that provides
methods ProcessTrade, Process PriceUpdate, OutputWorst
rade. These method calls correspond to the instructions
described above with the method arguments corresponding to
the update or query attributes.
Constraints
• 1 ≤ N, Tradeld, Price, Volume ≤ 105
• It is guaranteed that price update for an instrument is available
before first trade on that instrument.
Input Format For Custom Testing
Input to the program is specified using a simple text format. The
format and details of parsing are not relevant to answering the
question but custom input can be used to help with development
and debugging.
The first line of input contains an integer N that denotes the
number of instructions. Each of the N subsequent lines contains
either an update or a query in the format below:
Updates:
4 TRADE
. PRICE
Query:
. WORST TRADE
Some example inputs and their expected outputs are described
below.
arrow_forward
Using a decision table simplifies the process of associating conditions with responses. Explain?
arrow_forward
What is the difference between the E-R model and the extended E-R model?
and explain the ER diagrams
arrow_forward
Three of the mentors started writing notes for the CS Subjects and before they started they had set a target to finish the work on a certain date. But due to some important work these mentors had to take leave, which resulted in the delay of 3 days from the target date to be finished. Mentor A took 2 days more leave than that of Mentor C and Mentor B took 4 days more leave than that of Mentor C. What is the total number of days for which Mentor C was on leave?
arrow_forward
A high bias model has a high unpredictable error, while a high variance model has a high systematic error.
Select: True or False?
arrow_forward
Consider the following scenario: you are interested in researching the connection that exists between the number of 'likes' one gets on Facebook and the number of 'friends' one has on the social media site. Specifically, you want to know whether there is a relationship between the two variables. Please describe the statistical strategy that you would employ using just your own words. Thank you.
arrow_forward
Analyze the following dataset of Infection forecast to classify the patient to Infected or not. Use ID3 algorithm or Identification tree and Calculate Entropy and Information Gain of each attribute to identify root node of the decision tree and its branches.
arrow_forward
PART I: Using following columns build a model to predict if person would survive or not,
• Pclass• Sex• Age• Fare
Calculate score of your model (Screen Shot)
arrow_forward
In what scenarios might a denormalized data model be preferred over a normalized one?
arrow_forward
7. How are covariance and correlation different from one another?
arrow_forward
Question 5. Develop a decision tree for the given data set.
Age
Job
House
Credit
Loan Approved
Middle
FALSE
No
Fair
No
Middle
FALSE
No
Good
No
Middle
FALSE
Yes
Excellent
Yes
Middle
FALSE
Yes
Excellent
Yes
Middle
TRUE
Yes
Good
Yes
Old
TRUE
No
Excellent
Yes
Old
FALSE
No
Fair
No
Old
TRUE
No
Good
Yes
Old
FALSE
Yes
Excellent
Yes
Old
FALSE
Yes
Good
Yes
Young
FALSE
No
Fair
No
Young
FALSE
No
Fair
No
Young
FALSE
No
Good
No
Young
TRUE
No
Good
Yes
Young
TRUE
Yes
Fair
Yes
arrow_forward
How do i use matlab to read a csv and use the information and plot it into a graph. I have attached an image of some of the data
no pandas
arrow_forward
A data scientist for an online retailer is given the assignment of predicting what a customer will be interested in
based on past purchases. The data scientist is given access to six months of purchasing history from 1000
customers.
What task does the data scientist do next?
Give a presentation of the results to the targeted advertising team.
Run classification models to predict which product characteristics are likely to appeal to each customer.
Create scatter plots of each pair of variables to determine which are likely to be correlated.
Reformat the data into a csv file and remove rows with missing data.
The task is an example of which step in the data science lifecycle?
Pick
û
arrow_forward
You are building a classification model to predict whether a firm will go bankrupt within the next 5
years. When you collect the data, you find that the number of instances of firms that went bankrupt
is smaller than the number of cases of firms that did not go bankrupt. Specifically, only 7% of the
firms went bankrupt, while the rest did not go bankrupt. Once you build the classification model, you
need to compare performance against a baseline. Which of the following would be an appropriate
baseline?
O Always predicting that the firm does not go bankrupt
O Making a prediction that a firm goes bankrupt with 7% probability
O Always predicting that the firm goes bankrupt
Making a prediction from the set (bankrupt, not bankrupt) with equal probability
arrow_forward
Q1. Consider the following fact table about sales:
Product
Time
Drinks
Bread
$20000
$25000
$35000
$10000
$10000
$10000
July
August
Sept
$15000
$15000
Oct
$20000
Nov
$20000
Suppose the Time dimension hierarchy has three levels: month → quarter → all, and the Product
dimension has two levels: item → all. July and August belong to the summer quarter, and Sept and Oct
and Nov belong to the fall quarter.
(a) List all derived cells that involve quarters, together with their total dollar values.
(b) Give an example of drill down involving two cells (list both the "from" cell and the "to" cell).
(c) Give an example of roll up involving two cells (list both the "from" cell and the "to" cell).
Q2. Use the fact table in Q1 to illustrate what is meant by “the sum() measure is distributive", what is
meant by "the avg() measure is not distributive", what is meant by "avg() is algebraic", and what is
meant by “the median() measure is not algebraic".
arrow_forward
(1) Draw the inverted index that would be built for the following document collection. Compare your answer with the given ones and select the correct one.
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
arrow_forward
Take into consideration the following scenario: you are interested in studying the correlation that exists between the amount of 'likes' one receives on Facebook and the amount of 'friends' one has on the social media site. Using your own words, please explain the statistical method that you would use.
arrow_forward
5. What is the tradeoff between bias and variance?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education
Related Questions
- In order to explain or understand the information that a prediction model provides, what approaches do you use?arrow_forwardA large car fleet company asked you to help them forecast vehicle resale values. They purchase new vehicles,lease them for three years, and then sell them. Better forecasts of vehicle sales values would mean bettercontrol of profits; understanding what affects resale values may allow leasing and sales policies to bedeveloped to maximise profits. At the time, the resale values were being forecast by a group of specialists. Unfortunately, they saw any statistical model as a threat to their jobs and were uncooperative in providing information. Nevertheless, the company provided a large amount of data on previous vehicles and their eventual resale values. 1.1 Describe the five steps of forecasting in the context of this project?arrow_forwardIf we evaluate a classification model(such as SVC) using the K-fold cross-validation and we set K = 9, how many times will the classification model be trained and evaluated? Group of answer choices 3 6 9 1arrow_forward
- given that not all claim values are represented in the graph and that some of the unrepresented claim values are very large. Can you explain this? If so, please provide your reasons, with graphs or other statistics if helpful.arrow_forwardUncertain data modeling generally refers to tuple-level uncertainty. True or false? A True B Falsearrow_forwardAssume an attribute (feature) has a normal distribution in a dataset. Assume the standard deviation is S and the mean is M. Typically: Group of answer choices, multiple choice: Then the outliers usually lie below -3*M or above +3*M Then the outliers usually lie above -3*S or below +3*S Then the outliers usually lie below -3*S or above +3*S Then the outliers usually lie above -3*M or below +3*Marrow_forward
- evaluate a data model's usefulness in comparison to the traditional strength prediction methodarrow_forwardBDAN 250 Select the answer that best describes an ordinal variable: A categorical variable with categories that cannot be ranked A categorical variable where there is a logical rank-order relationship between the variable responses A continuous variable with categories that cannot be ranked A continuous variable where there is a logical rank-order relationship between the variable responses Numeric values at set/even intervals.arrow_forwardYour task is to implement a class that provides methods ProcessTrade, Process PriceUpdate, OutputWorst rade. These method calls correspond to the instructions described above with the method arguments corresponding to the update or query attributes. Constraints • 1 ≤ N, Tradeld, Price, Volume ≤ 105 • It is guaranteed that price update for an instrument is available before first trade on that instrument. Input Format For Custom Testing Input to the program is specified using a simple text format. The format and details of parsing are not relevant to answering the question but custom input can be used to help with development and debugging. The first line of input contains an integer N that denotes the number of instructions. Each of the N subsequent lines contains either an update or a query in the format below: Updates: 4 TRADE . PRICE Query: . WORST TRADE Some example inputs and their expected outputs are described below.arrow_forward
- Using a decision table simplifies the process of associating conditions with responses. Explain?arrow_forwardWhat is the difference between the E-R model and the extended E-R model? and explain the ER diagramsarrow_forwardThree of the mentors started writing notes for the CS Subjects and before they started they had set a target to finish the work on a certain date. But due to some important work these mentors had to take leave, which resulted in the delay of 3 days from the target date to be finished. Mentor A took 2 days more leave than that of Mentor C and Mentor B took 4 days more leave than that of Mentor C. What is the total number of days for which Mentor C was on leave?arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Database System ConceptsComputer ScienceISBN:9780078022159Author:Abraham Silberschatz Professor, Henry F. Korth, S. SudarshanPublisher:McGraw-Hill EducationStarting Out with Python (4th Edition)Computer ScienceISBN:9780134444321Author:Tony GaddisPublisher:PEARSONDigital Fundamentals (11th Edition)Computer ScienceISBN:9780132737968Author:Thomas L. FloydPublisher:PEARSON
- C How to Program (8th Edition)Computer ScienceISBN:9780133976892Author:Paul J. Deitel, Harvey DeitelPublisher:PEARSONDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781337627900Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningProgrammable Logic ControllersComputer ScienceISBN:9780073373843Author:Frank D. PetruzellaPublisher:McGraw-Hill Education
Database System Concepts
Computer Science
ISBN:9780078022159
Author:Abraham Silberschatz Professor, Henry F. Korth, S. Sudarshan
Publisher:McGraw-Hill Education
Starting Out with Python (4th Edition)
Computer Science
ISBN:9780134444321
Author:Tony Gaddis
Publisher:PEARSON
Digital Fundamentals (11th Edition)
Computer Science
ISBN:9780132737968
Author:Thomas L. Floyd
Publisher:PEARSON
C How to Program (8th Edition)
Computer Science
ISBN:9780133976892
Author:Paul J. Deitel, Harvey Deitel
Publisher:PEARSON
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781337627900
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
Programmable Logic Controllers
Computer Science
ISBN:9780073373843
Author:Frank D. Petruzella
Publisher:McGraw-Hill Education