Lab 10: List comprehension

Lab 10: List Comprehension
Ling 1330/2330: Intro to Computational Linguistics
Na-Rae Han
Objectives
 List comprehension
 Filtering
 Transforming
2/7/2017
2
Filtering a list, the old way
>>> mary = 'Mary had a little lamb, whose fleece was white as
snow.'.split()
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 How to make a list of words that have 'a'?
>>> alist = []
>>> for w in mary:
if 'a' in w:
alist.append(w)
>>> alist
['Mary', 'had', 'a', 'lamb,', 'was', 'as']
2/7/2017
You need to make a new
empty list, and then
iterate through mary to
find items to put in
3
Filtering with list comprehension
>>> mary = 'Mary had a little lamb, whose fleece was white as
snow.'.split()
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 How to make a list of words that have 'a'?
>>> [w for w in mary if 'a' in w]
['Mary', 'had', 'a', 'lamb,', 'was', 'as']
>>>
The power of
LIST
COMPREHENSION
 Creating a new list where elements meet a certain condition:
[x for x in list if ... ]
2/7/2017
4
2 minutes
Try it out
>>> mary = 'Mary had a little lamb, whose fleece was white as
snow.'.split()
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 Syntax: [x for x in list if ... ]
Words that have 'a'
Words that are 5
chars or longer
Words that are 5
chars or longer and
without symbols
2/7/2017
>>> [w for w in mary if 'a' in w]
['Mary', 'had', 'a', 'lamb,', 'was', 'as']
>>> [w for w in mary ifuse
len(w)
len()>=5]
['little', 'lamb,', 'whose', 'fleece', 'white',
'snow.']
>>> [w for w in mary if len(w)
and w.isalnum()]
use>=5
.isalnum()
['little', 'whose', 'fleece', 'white']
5
2 minutes
Try it out
>>> mary = 'Mary had a little lamb, whose fleece was white as
snow.'.split()
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 Syntax: [x for x in list if ... ]
Words that have 'a'
Words that are 5
chars or longer
Words that are 5
chars or longer and
without symbols
2/7/2017
>>> [w for w in mary if 'a' in w]
['Mary', 'had', 'a', 'lamb,', 'was', 'as']
>>> [w for w in mary if len(w) >=5]
['little', 'lamb,', 'whose', 'fleece', 'white',
'snow.']
>>> [w for w in mary if len(w) >=5 and w.isalnum()]
['little', 'whose', 'fleece', 'white']
6
A list of English words
2 minutes
In Python shell, load up the ENABLE word list we used in
last class.
 If you saved a pickle file 'words.p', unpickle it.
 If you don't have a pickled list, build it from scratch.
 Download the ENABLE word list, posted on Norvig's site:
 http://norvig.com/ngrams/
 Open the file and make a list object:
>>> f = open('enable1.txt')
>>> txt = f.read()
>>> f.close()
>>> wlist = txt.split()
>>> print(wlist[:100])
['aa', 'aah', 'aahed', 'aahing', 'aahs', …
'abaka', 'abakas', 'abalone', 'abalones', …
2/7/2017
enable1.txt
…
abaka
abakas
abalone
abalones
…
7
ENABLE word list: what's in
>>> import pickle
>>> f = open('words.p', 'rb')
>>> wlist = pickle.load(f)
>>> f.close()
>>> len(wlist)
172820
>>> wlist[:10]
['aa', 'aah', 'aahed', 'aahing', 'aahs', 'aal', 'aalii',
'aaliis', 'aals', 'aardvark']
>>> wlist[-10:]
['zymology', 'zymosan', 'zymosans', 'zymoses', 'zymosis',
'zymotic', 'zymurgies', 'zymurgy', 'zyzzyva', 'zyzzyvas']
>>> 'platypus' in wlist
True
>>> 'syntactician' in wlist
False
>>> 'a' in wlist
False
WHAA?
>>>
8
2 minutes
Try it out
 Syntax: [x for x in list if ... ]
>>> [x for x in wlist if 'wkw' in x]
['awkward', 'awkwarder', 'awkwardest', 'awkwardly',
??
'awkwardness', 'awkwardnesses', 'hawkweed', 'hawkweeds']
Words that
have 'wkw'
?? >=25]
>>> [x for x in wlist if len(x)
Words that are
['electroencephalographically',
25+ chars
'ethylenediaminetetraacetate',
'ethylenediaminetetraacetates',
'immunoelectrophoretically', 'phosphatidylethanolamines']
Words that
are 15+
>>> [x for x in wlist if len(x) >=15 and
?? x.startswith('x')]
['xerographically', 'xeroradiographies', 'xeroradiography']
chars and
start with 'x'
Too easy for you?
Get creative! Show us
what you could find.
2/7/2017
9
Try it out
2 minutes
 Syntax: [x for x in list if ... ]
>>> [x for x in wlist if 'wkw' in x]
['awkward', 'awkwarder', 'awkwardest', 'awkwardly',
'awkwardness', 'awkwardnesses', 'hawkweed', 'hawkweeds']
Words that
have 'wkw'
>>> [x for x in wlist if len(x) >=25]
Words that are
['electroencephalographically',
25+ chars
'ethylenediaminetetraacetate',
'ethylenediaminetetraacetates',
'immunoelectrophoretically', 'phosphatidylethanolamines']
Words that
are 15+
>>> [x for x in wlist if len(x) >=15 and x.startswith('x')]
['xerographically', 'xeroradiographies', 'xeroradiography']
chars and
start with 'x'
2/7/2017
10
Try it out
2 minutes
 Syntax: [x for x in list if ... ]
>>> [w for w in wlist if w.startswith('lingui')]
['linguine', 'linguines', 'linguini', 'linguinis',
'linguist', 'linguistic', 'linguistical',
'linguistically', 'linguistician', 'linguisticians',
'linguistics', 'linguists']
Words starting
with 'lingui'
>>> [w for w in wlist if len(w) >=7 and 'a' not in w
and 'e' not in w and 'i' not in w and 'o' not in w and
'u' not in w]
['glycyls', 'rhythms', 'tsktsks']
Words that are
7+ characters
and do not have
a 'vowel'
>>> [w for w in wlist if sorted(w) == sorted('cried')]
['cider', 'cried', 'dicer', 'riced']
Anagrams of
‘cried’
2/7/2017
11
2 minutes
Try it out
 Syntax: [x for x in list if ... ]
>>> [w for w in wlist if w.startswith('un') and
w.endswith('ed')]
Think before you
press ENTER
>>> foo = [w for w in wlist if w.startswith('un') and
w.endswith('ed')]
>>> len(foo)
1076
>>> foo[:10]
['unabashed', 'unabated', 'unabraded', 'unabridged',
'unabsorbed', 'unabused', 'unaccented', 'unaccepted',
'unacclimated', 'unacclimatized']
>>>
2/7/2017
Words that
start with 'un'
and end with
'ed'
1076 items.
This is not a
small list.
12
Careful with list comprehension
How many are
8 chars or
longer?
Unless you're
reasonably sure
your list is short,
assign the list to
a new variable
first…
>>> [w for w in wlist if len(w) >=8]
This is going to return
a long list!
>>> foo = [w for w in wlist if len(w) >=8]
>>> len(foo)
120872
>>> foo[:10]
['aardvark', 'aardvarks', 'aardwolf', 'aardwolves',
'aasvogel', 'aasvogels', 'abacterial', 'abacuses',
'abalones', 'abampere']
… and then look at snippets
using slice indexing
2/7/2017
13
Transforming items in list, the old way
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 How to make a new list with uppercase words?
>>> mary.upper()
Cannot uppercase a list
. . .
AttributeError: 'list' object has no attribute 'upper'
>>> mup = []
You have had to create an
>>> for w in mary:
empty new list and then
mup.append(w.upper())
put in uppercased words
>>> mup
['MARY', 'HAD', 'A', 'LITTLE', 'LAMB,', 'WHOSE',
'FLEECE', 'WAS', 'WHITE', 'AS', 'SNOW.']
2/7/2017
14
Transforming items in list
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 Uppercased list, using list comprehension
>>> [w.upper() for w in mary]
['MARY', 'HAD', 'A', 'LITTLE', 'LAMB,', 'WHOSE', 'FLEECE', 'WAS',
'WHITE', 'AS', 'SNOW.']
>>>
 Creating a new list where each element is transformed:
[f(x) for x in list]
2/7/2017
15
2 minutes
Try it out
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 Syntax: [f(x) for x in list]
List of first
characters
List of word lengths
List of True/False
for having 'a' as
substring
2/7/2017
>>> [w.upper() for w in mary]
['MARY', 'HAD', 'A', 'LITTLE', 'LAMB,', 'WHOSE',
'FLEECE', 'WAS', 'WHITE', 'AS', 'SNOW.']
>>> [w[0]
? for w in mary]
['M', 'h', 'a', 'l', 'l', 'w', 'f', 'w', 'w',
'a', 's']
>>> [len(w)
for w in mary]
?
[4, 3, 1, 6, 5, 5, 6, 3, 5, 2, 5]
>>> ['a' ?in w for w in mary]
[True, True, True, False, True, False, False,
True, False, True, False]
16
2 minutes
Try it out
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
 Syntax: [f(x) for x in list]
List of first
characters
List of word lengths
List of True/False
for having 'a' as
substring
2/7/2017
>>> [w.upper() for w in mary]
['MARY', 'HAD', 'A', 'LITTLE', 'LAMB,', 'WHOSE',
'FLEECE', 'WAS', 'WHITE', 'AS', 'SNOW.']
>>> [w[0] for w in mary]
['M', 'h', 'a', 'l', 'l', 'w', 'f', 'w', 'w',
'a', 's']
>>> [len(w) for w in mary]
[4, 3, 1, 6, 5, 5, 6, 3, 5, 2, 5]
>>> ['a' in w for w in mary]
[True, True, True, False, True, False, False,
True, False, True, False]
17
Try it out
2 minutes
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
Words that are 6
chars or longer,
in upper case
Calculate the
average word
length … in one
line!
2/7/2017
>>> [w.upper() for w in??mary if len(w) >= 6]
['LITTLE', 'FLEECE']
>>> [len(w) for w in mary]
[4, 3, 1, 6, 5, 5, 6, 3, 5, 2, 5]
>>> sum([len(w) for w in mary])
45
and len()
>>> sum([len(w) Use
for sum()
w in mary])
/ len(mary)
4.090909090909091
18
Try it out
2 minutes
>>> mary
['Mary', 'had', 'a', 'little', 'lamb,', 'whose', 'fleece', 'was',
'white', 'as', 'snow.']
Words that are 6
chars or longer,
in upper case
Calculate the
average word
length … in one
line!
2/7/2017
>>> [w.upper() for w in mary if len(w) >= 6]
['LITTLE', 'FLEECE']
>>> [len(w) for w in mary]
[4, 3, 1, 6, 5, 5, 6, 3, 5, 2, 5]
>>> sum([len(w) for w in mary])
45
>>> sum([len(w) for w in mary]) / len(mary)
4.090909090909091
19
Back to English words
2 minutes
 Syntax: [f(x) for x in list]
"Most words are 9 characters or longer."
 True or False?
>>> TorF = [len(x) >= 9 for x in wlist]
>>> TorF[:20]
[False, False, False, False, False, False, False,
False, False, False, True, False, True, False,
False, False, False, False, True, False]
>>> TorF.count(True)
92452
>>> TorF.count(False)
80368
>>>
2/7/2017
TorF is a list of
True/False on
word x being at
least 9 characters
long
20
Filtering + transforming
2 minutes
 Syntax: [f(x) for x in list if ...]
>>> [x for x in wlist if len(x) >=23]
['carboxymethylcelluloses', 'deinstitutionalizations',
'dichlorodifluoromethane', 'dichlorodifluoromethanes',
… 'reinstitutionalizations']
filter…
tuplify…
>>> [(len(x), x) for x in wlist if len(x) >=23]
[(23, 'carboxymethylcelluloses'), (23, 'deinstitutionalizations'), (23,
'dichlorodifluoromethane'), (24, 'dichlorodifluoromethanes'),
… (23, 'reinstitutionalizations')]
>>> sorted([(len(x), x) for x in wlist if len(x) >=23], reverse=True)
[(28, 'ethylenediaminetetraacetates'), (27, 'ethylenediaminetetraacetate'),
(27, 'electroencephalographically'), (25, 'phosphatidylethanolamines'),
… (23, 'carboxymethylcelluloses')]
and sort!
2/7/2017
21
so-initial bigrams made easy
 Many of our past tasks can be accomplished through list
comprehension. so-initial bigrams from last homework:
>>> import pickle
>>> pf = open('bigramf-austen.p', 'rb')
>>> bigramf = pickle.load(pf)
>>> pf.close()
>>> bigramf[('so', 'much')]
207
>>> bigramf[('so', 'will')]
1
comprehension
>>> sograms = [x for x inlist
bigramf
if x[0] == 'so']
>>> sograms[:10]
[('so', 'to'), ('so', 'indeed'), ('so', 'affectionate'), ('so',
'prudent'), ('so', 'nervous'), ('so', 'mistake'), ('so',
'sought'), ('so', 'cheap'), ('so', 'i'), ('so', 'friendly')]
>>>
2/7/2017
22
so-initial bigrams made easy
 Many of our past tasks can be accomplished through list
comprehension. so-initial bigrams from last homework:
>>> import pickle
>>> pf = open('bigramf-austen.p', 'rb')
>>> bigramf = pickle.load(pf)
>>> pf.close()
>>> bigramf[('so', 'much')]
207
>>> bigramf[('so', 'will')]
1
so much
>>> sograms = [x for x in bigramf if x[0] == 'so']
simpler!
>>> sograms[:10]
[('so', 'to'), ('so', 'indeed'), ('so', 'affectionate'), ('so',
'prudent'), ('so', 'nervous'), ('so', 'mistake'), ('so',
'sought'), ('so', 'cheap'), ('so', 'i'), ('so', 'friendly')]
>>>
Also:
[(w1, w2) for (w1, w2) in bigramf if w1 == 'so']
2/7/2017
23
List comprehension: summary
 Syntax: [f(x) for x in list if ...]
>>> mary = 'Mary had a little lamb'.split()
>>> mary
['Mary', 'had', 'a', 'little', 'lamb']
>>> [w for w in mary]
['Mary', 'had', 'a', 'little', 'lamb']
>>> [w for w in mary if len(w) > 3]
['Mary', 'little', 'lamb']
>>> [w for w in mary if 'a' in w]
['Mary', 'had', 'a', 'lamb']
>>> [w.upper() for w in mary]
['MARY', 'HAD', 'A', 'LITTLE', 'LAMB']
>>> [len(w) for w in mary]
[4, 3, 1, 6, 4]
2/7/2017
Same as mary
Filter in only those
elements that meet a
condition
Transform each
element in list
24
Wrapping up
 Next class:
 How to process corpora
 How to process external resource as formatted data files
 Download count_1w.txt and words.js from Norvig's site
 Exercise 6
 List comprehension practice
 Midterm exam
 2/21 (Tuesday)
 At LMC's PC lab (CL G17)  More room!
2/7/2017
25