ECE101-Lab5

.pdf

School

University of Illinois, Urbana Champaign *

*We aren’t endorsed by this school

Course

101

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

10

Uploaded by DukeSteel1393

Report
Lab 5: Search Engines NetID: julesa2 Link to published notebook: https://www.wolframcloud.com/obj/julesa2/Published/ECE101/Fall2023/E - CE101-Lab5.nb In this lab, we will try to build a web crawler and create a web graph from our results. We will also calculate the page rank of a node in a web graph. Crawling the Web In this part, we will crawl (a small part of) the World Wide Web, build a graph to represent the websites we crawled. Let’s set a starting page: In[90]:= startURL = "https: // www.netflix.com / browse"; Step 1 Import all the hyperlinks from this page: In[91]:= homeHyperlinks = Import [ startURL, "Hyperlinks" ] Out[91]= { https: // www.netflix.com // , https: // www.netflix.com / LoginHelp, https: // www.netflix.com // , https: // policies.google.com / privacy, https: // policies.google.com / terms, tel:1 - 844 - 505 - 2993, https: // help.netflix.com / support / 412, https: // help.netflix.com, https: // netflix.shop / , https: // help.netflix.com / legal / termsofuse, https: // help.netflix.com / legal / privacy, https: // help.netflix.com / legal / corpinfo, https: // www.netflix.com / dnsspi, https: // netflix.com / adchoices - us } Check how may hyperlinks are present on this page alone: In[92]:= Length [ homeHyperlinks ] Out[92]= 14 Step 2 Let’s pick a link that is not a phone number or email address or something on Google Maps:
In[93]:= linkedPages = DeleteCases [ homeHyperlinks, s _ / ; StringMatchQ [ s, ___ ~~ "mailto" "tel" "maps" ~~ ___]] Out[93]= { https: // www.netflix.com // , https: // www.netflix.com / LoginHelp, https: // www.netflix.com // , https: // policies.google.com / privacy, https: // policies.google.com / terms, https: // help.netflix.com / support / 412, https: // help.netflix.com, https: // netflix.shop / , https: // help.netflix.com / legal / termsofuse, https: // help.netflix.com / legal / privacy, https: // help.netflix.com / legal / corpinfo, https: // www.netflix.com / dnsspi, https: // netflix.com / adchoices - us } Step 3 Let’s just pick 1 link at random from this list: In[94]:= selectedLink = RandomChoice [ linkedPages ] Out[94]= https: // help.netflix.com Step 4 Repeat steps 1, 2, and 3 for the selected link: In[95]:= newLink = RandomChoice [ DeleteCases [ Import [ selectedLink, "Hyperlinks" ] , s _ / ; StringMatchQ [ s, ___ ~~ "mailto" "tel" "maps" ~~ ___]]] Out[95]= https: // help.netflix.com / en / node / 365?ui _ action = kb - article - popular - categories Putting it together We need to repeat steps 1, 2 and 3 for each link we end with. Here is a function that will do steps 1, 2 and 3, when it is given any URL as an input: In[101]:= getLinkedURL [ link _] : = RandomChoice [ DeleteCases [ Import [ link, "Hyperlinks" ] , s _ / ; StringMatchQ [ s, ___ ~~ "mailto" "tel" "maps" ~~ ___]]] 2 ECE101-Lab5.nb
In[99]:= listOfLinks = NestList [ getLinkedURL [#] &, startURL, 10 ] Out[99]= { https: // www.netflix.com / browse, https: // policies.google.com / privacy, https: // policies.google.com / privacy # infodelete, https: // policies.google.com / privacy # inforetaining, https: // policies.google.com / privacy / google - partners, https: // www.google.com / , http: // www.google.com / history / optout?hl = en, https: // accounts.google.com / ServiceLogin?passive = 1209600&continue = https: // www. google.com / history / optout?hl % 3Den&followup = https: // www.google.com / history / optout?hl % 3Den&hl = en&ec = GAZAjQI, https: // accounts.google.com / TOS?loc = US&hl = en, https: // policies.google.com / privacy?gl = US&hl = en, https: // myaccount.google.com / profile?utm _ source = pp&hl = en } Create a web graph from these links: In[102]:= simpleG = Graph [ MapThread [# 1 # 2 &, { Most [ listOfLinks ] , Rest [ listOfLinks ]}] , VertexLabels Placed [ "Name", Tooltip ]] Out[102]= Putting it together: Advanced (Optional - Extra credit) Instead of selecting just one hyperlink from each page, we can simultaneously select more than one and grow the graph in each direction. Set the number of links you’d like to follow from a page: In[103]:= numLinks = 3 Out[103]= 3 Set the number of hops you’d like to go in the graph: In[104]:= numHops = 3 Out[104]= 3 Note: The larger the values of numLinks and numHops, the longer it will take for the code to execute!! ECE101-Lab5.nb 3
In[105]:= g = NestGraph [ Take [ DeleteCases [ Import [# , "Hyperlinks" ] , s _ / ; StringMatchQ [ s, ___ ~~ "mailto" "tel" "maps" ~~ ___]] , numLinks ] &, startURL, numHops, VertexLabels Placed [ "Name", Tooltip ]] Out[105]= Problem 1 Change the startURL to a website of your choice, rerun the code above to create another graph. Copy paste the graph into the answer cell below. Answer https : // www.netflix.com / browse Answer (Extra Credit) Problem 2 Looking at the graph, which website do you think has (roughly) the highest PageRank? Why? Answer Netflix log in has the high PageRank because it takes you to di ff erent websites within the main 4 ECE101-Lab5.nb
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help