cs cribsheet 1
Computer Science
Jul 3, 2024
Beautiful Soup
with open(file) as f:
soup= BeautifulSoup(f, “html.parser)
soup.find(“tag”) : Returns a tag object of the first
instance soup.find_all(“tag”): Returns a list of tag objects of all
instances soup.find_all(“td”, {“class” : “J”}) : list of tag objects with specific tag methods:
tag.text: returns a string of the text displayed by a tag
tag[“attrib”] : access the tags attribute <tag attribute= “attribute value”>
zip(): pairs items of each iterable, creates tuple, new= list(zip(list1,list2)): [(1,’one’), (2,’two’)]
f-strings: print(f"{2} plus {3} equals {2 + 3}")
response = requests.get("https://someurl.com"): Sends a
request to the website - returns a response object .find(“the”) returns starting index of where the is found
dtype="int32" -> makes int
array: a n-dimensional, fixed-size object that holds homogeneous data types
np.array([1,2,3], dtype = None)
np.zeros((row, column)) : array([[0., 0., 0.], [0., 0., 0.]])
array([1., 1., 1.])
np.full(shape, fill value) : np.full((3,3), 8) array([[8, 8, 8], [8, 8, 8], [8, 8, 8]])
np.arrange(start, stop(exclusive), step)
np.arange(4, 16, 3) -> array([4, 7, 10, 13])
np.linspace(start, stop, num=50, endpoint=True, dtype=float) , stop is inclusive when true np.linspace(0, 10, 5) array([ 0. , 2.5, 5. , 7.5, 10. ])
np.random.random((rows, columns)) , random floats in range [0, 1)
np.random.randint(low, high = None, size = None, dtype = int), gives one integer Index/Slice
arr = np.array([1,2,3,4,5]) -> arr[-3] -> 3
arr = np.array([[1, 2, 3],[4, 5, 6]]) > arr[1,2] > 6
arr[:,2:] all the rows, third column Vector Operations do not change original array arr = np.array([1, 2, 3]) > arr * 2 > array([2, 4, 6])
arr <= 2 array([True, True, False])
Masking arr[arr % 2 == 0] returns array of only even nums Bitwise &, |, ~
np.sum((arr % 2 == 0) & (arr < 13))
gives T or F arr = np.array([-2, 1.5, np.nan, 2, -5], dtype = float)
arr[~np.isnan(arr)] -> array([-2., 1.5, 2., -5.])
- returns the data type of the elements within an array
.ndim - returns the number of dimensions of an array
.size - returns the number of elements in an array
.shape - returns the shape in the order (rows, columns) of an array
- returns a copy of an array that can be assigned to another variable. .fill(value)
- replaces all elements of an array with the specified value but does not return it
.reshape(rows, columns)
- returns an array with the new shape but does not change it
(condition, arr if true, arr if false) - returns a new array
- replaces all elements of an array with the specified value but does not return it
- changes the shape of an array but does not return the array
.sort(axis = 0)
- Sorts in-place the array in ascending order np.concatenate
([arr1, arr2], axis =0)
Aggregate Methods:
.max() .mean() .sum() .min()
np.savetxt() np.loadtxt()
: 2D size mutable s = aSeries(data, index=index)
selecting data:
by label s.loc[], by index s.iloc[] or s[]
masking s[s<3]
changing data: use loc or iloc
append new values use s.loc[“x”] =4
sort: s.sort_values(ascending = True)
ascending: small to large, abc order
delete: s.drop[“index”]
Data Frame
df= pd.DataFrame(data, index=, columns=)
selecting one column: df[“column”] returns series
selecting rows: use .loc or .iloc
df.loc[“row”, “column”]
df.set_index("Course", inplace = True) -> don’t count course column as 1
column anymore masking: df[df[“avg GPA”] < 3]
Add/Replace Row Values
if the index doesn’t exist, it adds it df.loc[row, col] = 850 -> changes to 850
df.iloc[-1, :] = [100,3] last row, all col
add row df.loc[“isye20”]=[150,3.1]
Add/Replace Column
df.loc[:, “new”] = df[“new”] = df[“column”] >= 2
adds new col with T/F values Sorting : df.sort_values(by = "col", ascending = False, inplace = False)
Removing : df.drop(["col"], axis = 1) columns df.drop
(["CS2316", "CS1331"], axis = 0) rows
(subset = [“Course”]) , drops duplicates in column course df.nunique
(axis = 0) counts # of distinct elements in each column Reading/Writing x= pd.read_csv(“<path>file.csv”, index_col=0)
x.to_csv(“<path>fileout.csv”, index=True)
Missing Data
df.loc[“CS2603”] = [50, np.nan, 0]
CS2603 50.0 NaN 0.0
: remove all rows that contain NaN
df.fillna(0): fill NaN with value
pd.isna(df) check is a value is NaN, gives T or F
Aggregates .mean() .sum() .min() .max() .count()
add column by taking mean of each row:
df[“new col”] = df.mean(axis=1).round(2)
add row for mean of each column
df.loc[“new row”]= df.mean(axis=0)
str method
df[“col”].str.contains(“A”, na= False)
total= df.groupby(“country”)[“medals”].count()
counts # medals for each country country is the index, medals is the only column group by more than 1 column df.groupby([“country”, “gender”])[“medal”].count()
.agg() = when applying more than 1 aggregate on more than 1 column after groupby()
pd.concat(dfs, axis = 0) , joins them vertically
horizontal axis = 1
line: series.plot(x=series.index, y=series.values)
series.plot(x=series.index, y=series.values, kind=
“bar”) kind = barh, hist, box
precent.plot(kind= “pie”)
Plotly: px.bar px.pie px.histogram px.box
fig= px.scatter(data, x = 'date', y = 'new_deaths', color = 'location') x= “column title”
fig = px.line(cases, x = 'date', y = 'new_cases', labels
= {'date': 'Day', 'new_cases': 'Number of New Cases'}, title = 'North America”)
*matplot lib library is easiest to use
class= blueprint for creating objects object= data structure created using a class as its blueprint instance – NO self
attribute- self.attribute define a class
class Dog:
instance attribute class Dog:
def __init__(self, name, people):
slef.age = age
class attributes class Dog:
numLegs= 4
def __eq__(self, other):
determines what makes things equal to e/o
return self.name == other.name and self.age == other.age def __lt__(self, other):
used to define sorting, called when < is used return self.age < other.age def __str__(self):
called when object is printed or cast to string
return f”{self.name} is {self.age} years old”
def __repr__(self):
called when object is printed return f”{self.name}”
list1= [1,2,3]
list1=list2 this does not copy the list, it simply copies the memory location list2.append(999) they both get 999 at the end
lista= [3,4,5]
listb= copy.copy(lista) #does not share memory, it's a newly constructed list object lista.append(98)
print(listb) #345
nested_lista= [[1,2],[3,4],[5,6]]
nested_listb = copy.copy(nested_lista)
nested_lista.append([1,1,1]) only list a changed
nested_lista= [[1,2],[3,4],[5,6]]
nested_listb = copy.copy(nested_lista)
nested_lista[0][0] = 9 #both changed bc nested list
nested_lista= [[1,2],[3,4],[5,6]]
nested_listb = copy.deepcopy(nested_lista
nested_lista[0][0] = 9 #only a changes
*if you assign an identified to an existing object an alias is created * copy.copy = copying references to the sublists
Inplace inplace = True changes original data frame
Fundamentals enumerate():
creates a tuple (index, value) for index, val in enumerate([“anna”, “emily”])
[(0, ‘anna’), (1, ‘Emily’)] returns enumerate object
pairs items of each iterable, creates tuple, returns zip object new= list(zip(list1,list2)): [(1,’one’), (2,’two’)]
: add= lambda a: a +10
returned value defined after colon
conditional: print(“even” if num%2 ==0 else “odd”)
list comprehension
: list= [expression for item in iterable if condition] list=[i**2 for i in range(10)] i**2 value is in list dictionary comp
: {key:val for item in iterable if x}
{i:i**2 for i in range(4)} {0:0, 1:1, 2:4, 3:9}
{lis[i]:lis2[i] for i in range range len(lis)}
{key:val for key,val in zip([1,2,3], “abc”)}
[:-1] everything but last column
Command line
mkdir: creates a directory cd: full path of current folder
ls: list content in currect directory cat: display content of file if __name__ == "__main__": -> will only print if running from command line. Executes only when you execute as a script Lists: Mutable, can iterate through Method
Adds an element at the end of the list
Add the elements of a list (or any iterable) to the end of the current list
Returns the index of the first element with the specified value
Returns the number of elements with the specified value
Removes the first item with the specified value
sorted() -> returns a new list of sorted values
sorted(alist, key=lambda x:x[1]) -> sorts by first index
.sort() -> mutates original list returns none
alist.sort(reverse= True) -> cannot assign this to anything or it returns none list.append(4) -> adds 4 to end of list
Returns the number of times a specified value occurs in a tuple
Searches the tuple for a specified value and returns the position of where it was found
immutable and cannot be sorted tup = (1,2,3) can iterate, index, slice strings immutable and iterable
.lower() .upper() .isdigit() .split() .replace()
string.split() -> makes string into a list by splitting at the spaces, returns a list
string.join() joins iterables on a string, returns string
“ “. join(alist) -> must have string before . f-strings: print(f"{2} plus {3} equals {2 + 3}")
dictionary dict= {90:”a”, 80:”b”} key:value
keys: for key in mydict.keys()
value: for val in mydict.values()
both: for key,val in mydict.items()
access value: dictionary[“key”]
updating: dictionary[“key”] = value -> it it already exists it gets updated
delete: del dict[‘key’]
sets: no indexing/slicing set={1,2”3”}
can add/remove, takes out duplicates Fundamentals
range(start, stop, step) stop is exclusive
indexing-> list[start:stop:step]
enumerate(): creates a tuple (index, value) for index, val in enumerate([“anna”, “emily”])
[(0, ‘anna’), (1, ‘Emily’)] returns enumerate object
zip(): pairs items of each iterable, creates tuple, returns zip object new= list(zip(list1,list2)): [(1,’one’), (2,’two’)]
lambda: add= lambda a: a +10
returned value defined after colon
conditional: print(“even” if num%2 ==0 else “odd”)
list comprehension: list= [expression for item in iterable if condition] list=[i**2 for i in range(10)] i**2 value is in list dictionary comp: {key:val for item in iterable if x}
{i:i**2 for i in range(4)} {0:0, 1:1, 2:4, 3:9}
{lis[i]:lis2[i] for i in range range len(lis)}
{key:val for key,val in zip([1,2,3], “abc”)}
File I/O -> list of strings
open file, readlines to create a list of all lines, strip newline char .strip(), split on delimiter .split(“,”)
with open(“file.txt”, “r” as f:
text= f.read()
f.read(): one long string read(:4) 4 char in data
f.readline(): string one line at a time
f.readlines(): list of every line as a string seek(): moves curser thru file fileObject.seek(offset)
writing: open file, write header, loop thru data writing each row as string w/ newline char at end
with open(“text.file”, “w”) as out:
out.write(“one\ntwo\n”) CSV file -> list of lists
each list represents a row of data
with open(“names.csv”,”r”) as f:
data =list(reader) >list of list
data[1:] eliminates header line with open(“files.csv”,”w”) as fout:
writer=csv.writer(fout) >creates writer object writer.writerow(data) > writes one new row writer.writerows(data) all rows in the file
with open(“csvFileName.csv”, “r”) as fin:
dictReader = csv.DictReader(fin)
listOfDicts = [dict(line) for line in dictReader]
with open(“csvFile.csv”, “w”) as fout:
dw = csv.DictWriter(fout, fieldnames = [‘key1’, ‘key2’, ...])
JSON: web service responses double quotes for strings, true/false for Boolean, null instead None, dict keys must be string type load: JSON to python loads(): parses a string of JSON code and turns it into python dictionary load(): parses JSON file into a python dictionary
with open(“file.json”, “r”) as f:
dump: python to JSON
dumps(): takes python dict and returns JSON string
dump(): takes python dict and dumps into JSON file
with open(“fileout.json”, “w”) as f:
json.dump(output_dict, f) f= what you dump to
XML: formatted as element trees HTML: for data display
starts with <!doctype html> begins with <html> and ends with </html>
visible part is between <body> and </body>
headings are defined with the <h1> to <h6> tags
links with <a> tag
<ul> unordered list <ol> ordered list <li> list item
table defined with <table>
<tr> row <th> table header <td> data cell <img src = “xx.jpg”>
API: request module import requests
: Imports the requests module response = requests.get("https://someurl.com"): Sends a
request to the website - returns a response object response.status_code
: The status code of the request (an
attribute) gives an integer response.text
: The text that was retrieved by the get request (an attribute) response.json(): Returns the text that was retrieved converted into python (only works if the text was stored in the json format)
print(response) =status code only
print(response.text[:500])= 1
500 characters response = requests.
”) -> sends info to a website
200: successful request, 404: url not found, 500: internal error, 401/3: unauthorized Escape Sequences : not printable character
\n = newline \t= tab \\=backsplash
To make non greedy put
after the + or *
[A-Z][a-z]* capital letter followed by zero or more lower case
print(re.findall(".+C",text)) start with one or more character and end with a capital letter C
Meta character
Matches any character
Escape special/meta characters
Or operator
Match at beginning of string/line. Represents “not” in a character class
Match at end of string/line
Match 0 or more of the
preceding regex
Match 1 or more of the preceding regex
Match 0 or 1 of the preceding regex
Bounded repetition
Create a character class
Capture group within the matched substring
Character class
What it matches
Any lowercase letter
Any uppercase letter
Any letter
Anything except uppercase letters
Predefined character class
What it matches
Any digit, equivalent to
Any non-digit, equivalent to [^0-9]
Any whitespace char,
equal to [ \t\n\r\f\v]
Any non-whitespace char, equal to [^\t\n\r\f\
Any alphanumeric char,
equal to [a-zA-Z0-9_]
Any non alphanumeric char, equal to [^a-zA-
‘regEx ’,
): Checks if the beginning of a_string matches the pattern. returns a Match
object. Otherwise, it returns None
‘regEx ’, a_string
): Checks if any part of a_string matches the pattern. returns a Match
corresponding to the first matching part
‘regEx ’, a_string
): Checks a_string for all non-overlapping
matches to the regex supplied and returns a list
of the strings that match
‘regEx ’, new_string
, a_string
): Checks a_string for all matches to the regex and returns a string
with each match replaced by new_string
What it does
Returns the index of the start of the string that matched match_object
Returns the index after the end of the string that matched
Returns the string
that matched match_object
Returns a tuple
of the starting and ending indices of the string matched by the regex
*Ending index is exclusive
SQL: structured query language Schema - a collection of related tables and constructs
