>> PARLANTE: Back to pie quick Python class.
So, the — so, this morning I'm going to talk about regular expression a little bit, in
particular, regular expressions in Python. You may or may not know regular expressions
before hand. That's okay. And I'm not going to show you all of regular expressions. I
will show you, like, just enough for us to get some useful stuff on. But regular expressions
are a very powerful combination with Python. There's a nice integration and so on. I will
show you that. Also, the exercises later today will, of course, you know, have little elements
which are solved nicely with Python regular expressions. Just as a–regular expressions
are sort of a good news, bad news situation. They–regular expressions are very–I mean,
I could use the word powerful but also, like, I could use the word very dense. Like, if
you sort of measured the amount of thought and cups of coffee that get poured into, like,
per character, like the regular expressions are, like, the most dense language possibly.
You could possible for hours over, like, one line of text trying to get, like, all the
pats and backticks, and whatever crack.
We're not going to get into that scary [INDISTINCT]
but they are–we're going to sort of touch into a little bit of that part. So, one word
of warning, when messing with regular expressions, it's the–I tend–I try to move a little slowly,
like, they are very powerful, they are a little tricky, so I'm going to try to be careful.
And for, you know, today's discussion, like, yeah, I'm going to show you just sort of basic
stuff. And if you're extremely familiar with regular expressions, well, you know, just
bear with me for a little bit.
We're not going to do this for too long. And, obviously, really
I'm going to emphasize on Python. That's the bad news. The good news is also on all the
exercises we're going to do later today, in case I forget to mentions, if there is a regular
expression component at the very end printed in little tiny print, I put the what the regular
expression solution is. So, it's kind of like you can sort of flip to the back and get the
answer if you get–if you're struggling with that part of it, because really, you know,
it is a Python class.
So, I don't want you to block on regular expressions too much.
All right, so with that introduction, let me–I'm going to start talking about how these
things work. But first, I have to tell you a joke which [INDISTINCT]. What do you call
a pig with three eyes? Piiig. All right, now, that will be covered. The necessity, that
will become clear a little bit. So let me fire up the interpreter here. So, regular
expressions in Python are supported by a module called "re." So, I'm just going to import
that. I'm going to do a lot of stuff here in the interpreter. I'm going to sort of build
this up. So, the basic idea with regular expressions is their way of searching for a pattern inside
of a larger text. So very much like, you know, search in Microsoft Word where you have the
little pattern you're looking for, and it's going to look over this huge text and find
the first instance of that pattern.
But it's this whole language where the patterns can
be very popular. So the way this works in Python the simplest way is there is a function
inside of re called "search." And I'll sort of spec this out. It's going to work basically
this way where the first argument to search is the pattern, which I'm going to about a
lot, the second argument is just kind of whatever the text I want to search. And what it returns
is actually not a Boolean, not text, but a match object. So, here, I'll write this as
"match." And then the match object will indicate–it will show us a bunch of things about the found
text. So, let me do an example. So, for the text, I'll use our punch line.
you know, it's called piiig. All right, and let's say for the pattern, we're looking for–and
we'll just search–start talking about patterns here. Maybe I'll just look for "iiig." I'll
just put the simplest possible case. So, I run it. Now, it returns this match object.
So, for this type, match–I mean, it's not really going to print, but it will say, well,
that's, you know, that's some kind of Python object.
So it turns out for the first 20 minutes
this morning, the only thing you need to know about match is that it has this–it responds
to this method called "group." If called group on it, it says, it shows you here is what
the matching text was. So, this is our first example of the regular expression. And the
simplest case in a regular expression is, like iig here, is that a character like I
or G or something like that matches itself. So the lower case I matches the lower case
I. Now, I'm going to build up the vocabulary to have a lot more complicated matches. But
that's just characters matching themselves is the simplest case. All right, another thing
to point out here. So, this match was successful. Now, I'm going to do one that's not successful.
So, like what, say, we're looking for the pattern "igs" and that pattern just doesn't
appear in there.
So, if I run that and I look at the match object, it's none. If the interpreter
is none, it just prints us nothing. But–it's just not there. So, if I were to try and say,
oh, "match.group," it's a very common error, it's not going to work. Like, because match
doesn't point to an object that has a group behavior. It just match the points to nothing.
So, the absolutely standard way to use re.search is–I'll sort of do an interpreter–first
you do re.search and then the next line is something like "if match:." Like, if the match
is there, then we found it.
We can look at the group. Otherwise, it's not there. Now,
what I'm going to do–I'm actually going to write–just–I'm going to def a little find
function just here in the interpreter. I'm going to do so many regular expression searches
today. I just want to encapsulate that behavior. So that what I'm going to show you here are
some of the prototypical use of re.search and then I'll just use it for half an hour.
So I'm going to say–I'm just going to call this thing a "find" and it will take a pattern
and some text. This is a little weird. I'm doing this in the interpreter, but this works.
I'm going to type a colon and I hit return, and now the interpreter is saying, "Okay,
what's the next line?" And so, I'll say space, space and I'll say if match. Now, I'm relying
on the fact that–I talked a little about this yesterday, the rules for true and false.
Now, there's a bunch of things that kind of count as false: zero counts as false, the
empty string counts as false.
It happens that the value none also counts as false. So, what
this if statement is sort of saying is, like, yeah, if that match is not none essentially.
If it's there, it searches, too. So, if match is there, I'm going to say "print match.group."
Okay, return again, two spaces, I'll say otherwise–I want to say what happens. So, essentially,
yeah, not found. All right, those are the two cases. The question is you always need
I'm going to always use it today. In reality, the match object has–you could
read the dots what–it has, you know, in what character position did it start, where is
that, and all sorts kind of other stuff about the match you might want to know and sort
of composite in there. All right, I'm going to–here, return. All right, so now I have
to find my find function, so now I can sort of use this for–well, I'll just–I'll use
it for my earlier example. So, now, if I say find on that, I get "not found." And if I
say ig–why didn't that work? I did what? Oh, I didn't do the match function. Yeah,
you're right. All right. All right, here, I'll just do it really quickly. So, there's
def and then I'll say "match = re.search." Oh, sure. Now, you guys tell me. Sure, where
were you five minutes ago? Okay. So, there is the match and I say "if match:," well,
okay, everyone is going to know this code by heart.
If match print "match.group", else:
print not found. Okay. So, now, what will we say, this time for sure–excellent. Okay.
So, now, I'm just going to do–you know, I'm going to build the vocabulary of regular expressions.
So, the simplest rule, rule number one is that simple characters just match themselves.
Rule number two is that–and I'm actually going to–a little hi-tech here. I'm going
to make a little space. Special characters, I made a little table up here. The dot, they
are special, dot matches any character, it means anything except it does not match anyone.
So, I could have said, "Well, I'm looking for, let's say, any three characters and then
a G." That's the pattern I'm looking for. And so, in this case, that's going to find
So can be, you know, a little bit of a sense of, like, how this is going to be,
you know, more powerful than just regular Microsoft Word search. So, the–another route
is going on here is–so, for example, if I were to say "…G," you know–or I don't know,
X. There. It's not going to find that. Well, maybe I'll do it the way I did it before to
make it better, put the S. So, if I say "…gs," that's not found. There's an all the way symmetry
here where in order to succeed, all of the pattern must match. So, in this case, I've
got four characters or whatever all–you know, I can't say, like, "Oh, well, three out of
four," no. A hundred percent of the pattern has to be kind of consumed and matched, but
that's not true about the text, right? In all of my examples, I'm not matching the whole
text. Whatever, I can use a little bit of this at the end of this word or whatever,
so as to say, a fundamental asymmetry.
The other thing that's going to happen here is
that the search is going to go to left to right, and it's going to–it's satisfied as
soon as it finds a solution. So, we could make a case whether it's maybe–say for example
I'm looking for "…g" and then I here like–and then I'll make this like, "Oh-oh, there's
a much better solution." You know, this one is eg. What's going to happen is it's not
finding that second one, right? It's not just getting into it. But, how can I make it find
that one? What if I said, "Well, really what I'm looking for is an X and then two characters
I don't know about, and then the G.
So then, it's like passes over the first one and finds
the second one. Now, the regular expression engine, I'm not getting too much detail, is–you
know, I mean, it finds all the things it's supposed to and it's smart. It understands.
It does–it will backtrack. So, for example–this is–what if I said, well, I'm looking for
"…g" and then I insist that there's an S. And here, I'll go fix this one. It's like
kind of an S here. So you can imagine–so that exceeds. So you can imagine it like it
may be sort of tries to make this one work and it doesn't work, and so it said, "Well,
it didn't work." It will keep going.
So, the other thing is it is going–it is left to
right. So, it finds the first one. Okay, yes. So the question is why didn't I meet the first
one. The trick here is I added an S to the end of the pattern. If you read the pattern,
it says–that says, any character, any character gs. And the problem is it could–this one?
One more, this one. That will succeed it. Oh, why didn't I find that? Okay, yeah. So
what it does is it–sorry, yeah–it goes left to right. And once it finds a solution, it's
like, "Okay, I'm done. Don't try it anymore." Yeah, question.
>> For the three partners. >> PARLANTE: Oh, yes. Okay. Thanks for the
question. >> Statement with a…
>> PARLANTE: Yeah, so if you are actually looking for the period characters, so like
what, say, you know, I don't know. There's a dot there. But you were right, it is C.
And you can always put a backslash in and that inhibits the specialness of a character.
So, I could look for c.g.
You know, I can look for c.–"\.l" there. Now, I'm going to
introduce a slight extra for the syntax here which Python has, which is–where it's a little
troubling like the backslash, it could be interpreted at different levels like maybe
Python or like a Java–or–it might get taken out by the language. So, talking too much
on this, I'm just going to say–Python has an option called a raw strength where you
put a raw strength where you put a lower case R to the left of the leading quote. And what
the lower case army, it says, "Do not do any special processing with backslashes. Whatever
I type, just send it through absolutely raw and un-interpreted." This feature–I mean,
it's a little bit obscure, but it happens to be very useful providing regular expressions
because it frees us from having to worry about layers of backslash as possible. So, in fact–even
though I've done my examples so far about the R, I'm just going to use the lower case
R for all of my examples from here and out.
So, I just don't have to think about it. So,
in this case, let's just try it. Yes, so then it's able to find this–you know, so it's
matching now. So, that's an L with the dot. So that, you know, that will enable to talk
about a dot disposable. All right, so the–where we got so far, so it goes left to right, dot
matches any character, and all of the pattern has to be matched but the text, you know,
we don't care. You don't have to get all the text.
So, let me show you some slightly more
scary example. So let's say in my text here I've got–you know, there are some text that
I don't care about and I'm going to say, you know, ":cat" and then there's more texts.
So let's say I want to pick that part out. So, the next sort of regular expression code
I'm going to talk about is "\w." So "\w," which is actually what I have up here, "\w"
matches what you called a word character. So that means a letter or a digit and I think
it also includes underbar. So in this case, I'm going to say, well, let's say I'm looking
for a colon followed by three-word characters. So that's going to work okay. So, now, I could
say–like sort to the "\w," there's "\d" matches a digit.
So, for example, if I say cat and
I'll say like ":123." >> Between the "\w\w\w" and "…."
>> PARLANTE: So, "…"– it's an excellent question. The question is what's there ensuing
"…" and, you know, three "\w?" The "…" is just any character, like, it could be a space,
a colon, and anything. The "\w" matches a character so long is it looks like a word
character, like, A through Z, zero through nine, underbar. No, so if this was a Unicode
string, the word, it's smarter about being–there is a basic notion of an alphabetic character
versus, essentially, punctuation. Oh, yeah, question.
>> Zero through nine, is that true, the word character?
>> PARLANTE: Yeah, a word character includes digits zero through nine. So, I mean, you
know what it's a little bit like is usernames, right? You know, blah-vadi-blah 123, you know,
That will show up in the later example. Okay. So, digit is little bit similar.
Where is my cat example going? Oh, well, suppose we'll do it again. So, if I were looking for,
you know, have blah, you know, ":123xxx," let's see. So I could look for–actually,
here, I'll just look for–I'm just looking for three digits in a row. I can write that
as–I'm going to put the r also here. So, that pulls out the 123. Now, it happens there's–so
you can see there are sort of these different regular expression codes: "\w\w." You know,
they kind of represent sort of common cases.
I'm not going to show you all of them, but
I'm sure you know the ones that I think are the most useful and, you know, ones that kind
of show up in the product are going to be later. All right, let's say–last one I'm
going to show here is whitespace. So, supposed I'm looking for like this pattern. It's like,
well, I want some digits and they're separated by spaces. So the simplest way you do that
is a "\s" represents a whitespace character. And the "\s" is smart that it knows about
space, tab, new line, those all count as a whitespace character.
It knows about the whole
sort of space, the whitespace characters. So, hopefully, that will work so that finds
it. >> One, two…
>> PARLANTE: Yeah, so look–so the question is if you had two spaces. Just hold that thought
for a second because I'm about to–we're about to get there. So, so far, I haven't done any
repetition. I've just to have like, you know, fix numbers and things. So the–probably,
the most powerful part of regular expressions is that there is these modifiers, plus and
star. So plus to the right of something means one or more of that, and star means zero or
more. So, we'll have just going to do that with my digit example. So the question is,
like, what if I got these three digits–whatever.
There's just some amount of space in between
them. So the way you would say that in regular expression is, like, put a plus to the right
of that "\s." And that means–yeah, one or more–that element repeats. There's just one
or more of those. And I'll do it with this one as well. So, here, turn there.
that matches. So, adding the plus and the star, and I'll do a bunch of examples with
these. You know, exactly what we want to start matching more complicated patterns. Also,
remember how I was saying how per character regular expressions, I think it's like pretty
the densest language that any normal person would use. And, like, look at that little–okay,
look at that little bit of code, right? That really means something, right? Every character
in the order–I mean, it's all really pretty significant.
So, it's getting–I was about
to say it's going to get worse. But what I meant to say is it's about to get even more
interesting. Whatever a professor uses the word interesting, you always know you're in
big trouble. All right, so let me do–I want to use those puzzling a little bit. So, let's
say I'm looking for–I've got this random text and I'm looking for a colon and then,
you know, let's say, a word character. You know, I'll use kitten instead and just more
texts here. Now earlier, I've said, oh, you know, I kind of knew how long the word was,
but that was kind of a ridiculous assumption. The more typical way to do is would be, let's
say, well, there's a colon and I'll say, "And then, there's just some number of word characters."
So, I would write that as "\w+." That's the much more typical way, right? Like, you have
the sum, you got quarter, something that sort of starts, and then you're like, "Yeah, whatever,"
then just take all the word characters from there.
So, if I write it that way, then it
will like–it just picks up the kitten part. So, that is a–beginning to look a little
more the way these things actually work. >> A word character…
>> PARLANTE: Yeah, so the space is not a word character. That's what's making a stop there.
So, all right–and actually there a question before, like, does it include digit and so
forth? What if there was like "kitten123?" That still works. But if it's like "kitten123"
and I–except I have to add a character, like, let's say ampersand, then it stops at the
ampersand. So as the thing, so what the plus does is it–the plus is greedy. It goes as
far as it can and then it stops. So, they're just kind of the pneumonic for regular expression
is it finds the leftmost solution, the first one, and the largest solution.
So, the plot
until they accepted there's a plus and a star, it will just go as far as it can. Yeah, question.
>> So period-plus will take you all the way to the end of the line?
>> PARLANTE: Oh, yeah, so the question is period-plus would take you–well, it's not
a question. It's actually a suggestion. So, I'm going to refer it as question. What if
I said period-plus? And the answer is I've said. Like, yeah, it just goes all the way
at the end, right, so period matches dot, ampersand, everything, okay, except for [INDISTINCT].
All ready, yeah. >> When you say largest, do you mean that
if you say kitten123, 123, you will find a whole?]
>> PARLANTE: So, if I say–you mean here if I say kitten123, 123? And I'll go back. So,
this is "\w+." Okay.
What do you think it's going to do? All right, yeah. So it will go
through both 123s, and then it will stop because the space is not a word character but digits
zero through nine are word characters. I mean, it's made a bit about this of the word "word,"
right? I mean, to a normal person, it just like a word character. But compared to, like,
ampersand, it's a word character. All right, so one more code I'm going to show, which
I'll just type in here, is backslash uppercase S is a non-whitespace character. It's kind
of like the opposite. And I–I'm a little saddened who ever designed regular expressions
chose to have uppercase and lowercase mean something different because it just makes
a little bit confusing. But backslash upper case S is really pretty darned handy. So,
let's say, for example, I knew that it was "kitten123&a=123&," you know, yatta, whatever.
It's all this junk and then there's a space. And I want to write a regular expression that
picks up all of that and the ease–but I don't know.
It's not just word characters, you know
what I mean. It includes all sorts of stuff but at least there's not spaces in it. And
a common pattern is if I write that as there's a colon and then there's just some series
of non-whitespace characters, and then they'll just sort of terminate with the first whitespace
character. So that'll–that just catches the whole thing even though, like, the Lord knows
what sort of characters in there. So, just as a practical manner, that backslash upper
case S is potentially a handy way to catch that stuff. Already, so let me show you–so
those are all the–so the so the plus, the star, and those backslash codes, those are
the ones I want to build on. Now, let me–so far my examples have been like, you know,
a little bit limited. I want to–now, I'm going to do an example with e-mails, I mean,
how to build it up and hopefully I'll show you pretty practical patterns you can use.
All right, so I'm going to make up some texts here.
I'll keep the "blah." So, let's say
we are looking for, you know, firstname.lastname@example.org and then there's more junky text and let me
just add an @ sign just by itself. I'll just leave it that for now. All right, so the problem
I want to solve is pulling e-mail. I want to imagine. I've got this big body of text
and I want to pull e-mail addresses out of it using regular expressions. So, the–I'm
going to try–first I'm going to try to write this as "\w+" and then there's an "@" sign
and then there's "\w+." It's kind of, you know, plausible (ph) first job (ph) this.
So, if we run that, so what's happened here? So what's happened is, well, you know, it
finds–there's the "@" sign.
It's the "p," but it can't go further left in that because
the dot does not count as a "\w" character. And then likewise, it gets Gmail over here
but then it's compounded by dot. So what I want to say with the apparel (ph) is I want
to sort of expand–it's not just word characters, really it's word characters plus some other
stuff. So, regular expressions, there's this very old syntax for indicating a set of characters
and it's going to use the square brackets. So, inside of those square brackets, I can
put "Well, here's the set of characters that I'm going to allow here." And actually the
"\w" works inside of the square brackets because it's just a common case. So, what I want to
say here is, well, "/w" or let's say dot, and then let's say–well, just leave it that.
So, the question is–yes, that's a very natural question.
That dot–it happens, you don't
have to backslash that one, that it understands that the dot inside of the square bracket
is just the dot. >> "\w" is included in that?
>> PARLANTE: No, so "\w" word character is just A through Z, zero through nine. It just
not–oh I'm sorry, I'm sorry. That, no, that dot there means literally a dot. It doesn't
match any character. >> Because it's in the square bracket.
>> PARLANTE: Because it's in the square bracket, it means literally a dot. I mean, it's–I
mean it's sort of what's going on here. I mean, you could do work on your Ph.D. and
the text or something that it's kind of levels of quoting. It's kind of what's going on here.
And it's–it is a necessary complexity to talk about this kind of thing. Let's all just
see what dot does. All right, so you can see, you know, you can sort of see that–so that
picks up its word character and then it stops at the space essentially, right? So, it is–it's
not really a dot.
I'm sorry. It's not a regular expression dot. It's just a regular dot. All
right, so I'm going to fix the other side as well. So, I'll put a square bracket over
here and I'll put the dot in there, oops, and the plus goes outside, right, I'm saying
that whole set repeats. All right, so that's kind of fixed it. So, the square bracket is
probably the most convenient way if there are some set of characters that you're looking
for, you know, you kind of build, let's say, well, yeah, here's I'm looking for. What if
I had a dot for Nick? I'm sorry, dot, you mean, like a he…
>> Text. >> PARLANTE: Oh, it would just–it would pick
it up. I mean, we've said, you know, we've said to the left of the @ sign just, you know,
as many of these as possible so it'll–now, suppose we wanted to say that the first character
can't be a dot, it must be a word character, can you think of way in the pattern we could
say that? Yeah, I could have a single backslash.
I would say there must be a word character
and then it's followed by one or more of the thing that includes a dot. That'll be it.
Although, then to be super–I think, really hurt (ph) then we should change that one to
a star, right? Then there must be one or more character in the zero or more of these patterns.
>> [INDISTINCT] mostly inside the brackets the order doesn't matter in the…
>> PARLANTE: Yes, once it–exactly, yes. So once it's inside the bracket, the order doesn't
matter or, I mean, try to use the word set, it's a set of characters.
All right, so let's
try that. Yeah, so then that refines them. All right, so anyway, I mean–yes, it is a
sort of a bottomless topic. I mean, you know, there's–but, I mean, hopefully, I'll show
you stuff that are useful. All right, so I've got my e-mails example, so that's the first
thing. So that's just using group, right? All I've been doing there is just using group.
So now, what I'd like to show you is I'm going to stop using my find function. I'm going
to start doing this raw here. And what I'd like to do is I want to imagine that I want
to pick out the username and the hostname separately. I want to sort of pick those out.
And so, just go back here. I'll just change this to "m=re.search" so I'm just doing it
And you can do this–I'm going to change this back to just the regular way
here. By putting parenthesis in the regular expression around the parts that you care
about. Now, the way I'm doing it here, the parentheses are not changing what it's going
to match. I'm just kind of putting in those mark up of saying, "Well, these are the two
parts that I care about." And here I'll get rid of this dot. So, if I do that, right,
so I put parenthesis around the part that matches the username and parenthesis around
the part that matches the host and the @ sign I've just–I've done, I don't care about.
So, now, I've done this.
If we look at "m," it's a match object. If I say, "m.group,"
it's the whole thing, like, just like we've always been doing. But there's also a form
of the group where you passed at a number. So, if I say, "m.group(1)," that's now just
the username part. And the "1" refers to the first set of parentheses. So if you count
the parentheses–and it goes by the left parenthesis because you could actually nest them. So the
"group(1)" refers to the leftmost parenthesis, which if you look up here, there's, like,
yeah, that's that guy.
And then, "m.group"–oops, you can just guess, "group(2)," that's now
the hostname. So, a more–the way this is going to work for, you know, some problems
or regular expressions is all the times you'll write a regular expression for the thing you're
kind of looking for, right? Well, I'm looking for a URL, or an I.P. address, or something,
and then you'll maybe put parentheses and then say, "And here's the part that I want
to extract." Then you'll call "re.search," you'll get this match object and then you'll
use "group(1)" and "group(2)," whatever, to just kind of–it's already, you know, parsed
it for you.
You'll just pull out the parts that you want as text and then process from
there. Yeah, question. >> If I put a plus or a star after one of
these [INDISTINCT] the first parenthesis and it matches that twice, is that "group(1)"
and then "group(2)" or is it still a group…? >> PARLANTE: Yeah, so the question is if there's
a plus after the parenthesis, you know, does that change how the group number works? The
answer is no. The group numbering is based on just statically looking at the pattern
as an unchanging thing and just counting the left parenthesis from–going from left to
right. So that is the shortest answer I can give there. Already, so I've got–I want to–so
"re.search" is my second favorite Python regular expression, regular function.
favorite one–actually, let me make my data a little more complicated here. I guess I'll
also add a "foo@bar." My absolute favorite regular expression function is called "findall."
So, I'll just say "findall" here. And what "findall" is going to do is I've just still
got a pattern, and now I've just changed my text to–you know, I put a second e-mail address
in there. What "findall" does is it just takes the pattern and rather than just stopping
at the first match, it just continues and it just finds all of the matches and it returns
them to you–it returns to you essentially the ".group," right, the whole text just in
the Python list of other strings. So, for example, we talked about for a file how you
could just say "f.read" to get the entire text is one string.
So a pattern–I was in
joy because it just saves me so much work is I just call "re."–I just call "f.read"
and I pass that in as the second argument to a "findall." I just feed the entire file
into an "re.findall." I have a pattern, I just let it ripped through the entire text,
skip new lines, whatever, all that stuff it just handles, and it just pulls out the things
I want and just returns them to me as a Python list. And then, I can–you can write a for
loop, you know, stuff we're doing yesterday. Now, you can just process this list. So that
is–that's really how you use this stuff. So, in this case–so notice I took the parentheses
out. So I just left this as a simple pattern. So I just got, you know, I just got the system
There's this one other variation I can do here. What do you supposed is going
to happen if I put the parentheses back in? I'm, like, well, this is, you know, it's not
really–it's a pattern but it has this grouping in it. And, yeah, we have tuples. What it's
going to do is if there are parentheses in there, instead of just returning the whole
match, it says, "Oh, well, there's two plans. I'm going to return tuples length, too." So
each tuple represents a single match and then the tuple just has the groups in there. So,
that–yeah, you can see where this can be pretty handy. If you got some big file, you
just want to kind of–there are some part about you care about, you just want to rip
it out as lazily as possible, so "re.findall." >> You'll lose the format.
>> PARLANTE: Excuse me? >> You'll lose the format.
>> PARLANTE: Yeah, I mean, I would say–so let's say you lose the format.
well, the regular expression is narrowing. You get to say what you want to keep. And
so, if you want to keep more, you know, write the regular expression bigger, you know, to
keep one. All right, so that's, you know, not hard to imagine how we're–it's going
to be easy for you to work that in to–doing stuff later on. So, I'll just mention, there
are some optional arguments that you can add sort of hear as a third option to the regular
expression. And what I'm going to actually do, I'm going to do a DIR on re. So that's
the re module. I can say, "Oh, yeah, hey, what are these symbols in here?" So these
are some constants. So, if you add the constant, "IGNORECASE," to you–this works on ".search"
or ".findall." That means that it'll consider upper and lowercase the same. So a lowercase
isle matches an uppercase isle and vice-versa. You can do the ".all." I had said that the
dot matches any character except for new line and that's kind of historical thing because
the processing tended to go line by line.
If you add the "DOTALL" flag, then the dot
will match new line as well. And so, you could–because right now, if you use dot, you're pattern
can't expand more than one line. Although, if you use "\s" where you think there's a
new line, that'll expand the line but the dot will not go over one. So, if you mean–if
you add "DOTALL," you can turn off that behavior and that will just truly match anything. So,
if you were to say "DOTSTAR" with nothing else, it would just go to the end of the file.
So there's more–I think the most common ones to use there.
>> [INDISTINCT] at the end of, like, [INDISTINCT]. >> PARLANTE: Yeah, so let me–I'll give you
an example. So the way you would use those constants is it says a third argument. So
you would say, "re.IGNORECASE" or whatever.
That's the last argument there. All right,
and so the, you know, a couple–the handouts for today, the first one–if you didn't get
one, you can get one in a second. The, you know, there's a nice, you know, an explanation
of regular expressions and it shows the syntax and, you know, a lot of kind of stuff there
or whatever we're doing here. All right, so I think we're ready for an exercise. So the
exercises today are going to be a little bigger. There's three of them I'd like to do today
and, you know, kind of incorporate all those sort of stuff. So let me demo this first one.
So, this fist one is going to involve a brief 4A (ph) into a little understood part of the
government called the Social Security Administration.
Now, the Social Security Administration in
my life experience is in-charge of putting certain fields on everyone's paycheck that
no normal person understands and it just caused you [INDISTINCT] just not that was going on,
but they do this other thing. If you do a Google search for Social Security Administration
baby names, they do this thing where they keep track of what the popular baby names
are for babies born in that year. And they'd been doing that actually for a hundred years.
So you could look at 1900, 1950, whatever, you can just see–it turns out for baby names,
there's sort of a–there's a popularity of it.
There are sort of names kind of I have
in the flow (ph). So, I look at this and I see assignment idea. So, let's go to–oh,
I don't know, 1980. I'm not even going to try and think about when you guys were all
born. Do you know what I'm saying? And let's just go with, like, I don't know the top thousand.
So we get those, like–you'll look at this and you're just thinking like, "tr," "td,"
that kind of thing, you know, good thinking. All right, so, here's just for 1980. The list
of baby names and what does this is saying is that for boy children, Michael was the
most popular name.
And then, next most popular was Christopher, Jason, David, and so on.
And over here, we have Jennifer, number one; Amanda, number two; and it just goes down
to, like, you know, here we have, you know, Bobbie, Emil, Jermain, Kraig with the K, down
to the, you know, the less popular names. All right, so what I would like to do–going
back to Python here, let's see what I have.
Okay. So I'm going to go into "day2" here,
and there's a directory, baby names. So, if I look inside here, let me look at "baby1990,"
I've pulled this sort of–I just sort of copied and cleaned up just a tiny bit the text from
Social Security Administration site. I put this very [INDISTINCT]. Okay, well, there's
some partly written CSS and whatever. And then eventually, there's–here's the "h1"
and then here's a table blah, blah, blah, and at some point it's going to say–all right,
here we go. So here's the "h3." This is popularity in 1990, or as I like to think of it is popularity
in "\d\d\d." And then, there's some whatever junk you want to skip and then here's a "tr."
And then here, there's the "tr." And if you don't know HTML but whatever, that's the HTML
for that first row. So it says, "tr," "td," and then there's the number one. And then
there's more "td" stuff and there's Michael, and there's Jessica, and then here's row two,
and so on.
And it just goes on, like, there's all the data. It's beginning to look like
an actual problem. All right, so, the first thing what I want your baby names program
to do is given a file, like, "baby1990.html" and I'm going to pipe this into more. What
I want you to do is I want you to rip through that entire file. I want you to figure out
what year it represents. I want you to pull out all the names and all the ranks. I want
you to organize it so that you can–then produce a printout that's just in alphabetical ordered
by name. So just I've shown here. So you say–so the first–first you print the year and then
I want to see Aaron 34, Abbey 42, and so on. So you're just show an alphabetical list,
here's all the names are.
So that will get us through the…
>> Combine male and female? >> PARLANTE: Yeah, so what's going to happen
is there's a strange case but sometimes a name will appear as both male and female.
And I'm making any distinction male from female. So in that case, I want you to give it the
more popular, essentially the smaller number, whatever the smaller number is, all right?
So, the–oh, let me talk–I'm going to go back to Python for just a second. There's
something I mentioned I think maybe very briefly yesterday but it's about to come up, which
is we did file opening, right? So, "f=open(filename," so I'll remind you, if you want to write a
file, if you want to do it for writing, then for that second argument you pass a "w." So,
yesterday we just did "r." We just read the file, so that's fine.
So, you put a "w" there
for writing. And in that case, then probably the simplest way to write to the file is then
it has a ".write" and then you could just have, you know, whatever kind of text you
want in there. And so, you've got to be careful, you can zero out of file here but it'll write
that text. And so, this is [INDISTINCT] as well.
That's about to come up. All right,
so that is–so part A is I want you to just pull out all the names, you know, use a regular
expression, findall, maybe a dictionary, I mean, just total regular work. So for part
B, what I'm going to do is there's an option called "–summaryfile." And I'm going to run
this on as star or I'm going to say "baby*.html." In that case, I want you to produce no output.
What did that just do? When the "–summaryfile" option is given, what I want you to do–oh,
notice in this case, I ran it on "baby*." So I ran it on all of the baby files, right?
So the shell just expands out, so in that case "rv" is going to be all of them. If the
"–summaryfile" like I've just given, what I want you to do is for each file I want you
to read it and I want you to create a new file with the same filename but ending in
".summary," and then I want you to take that output that we were printing to the screen
earlier and I want you to write it to that file.
Now, there's a little bit of a trick
here which is I've shown. So, for example, when you have a low level function, you're
not necessarily want it print it to stand it out directly. You want to have a function,
like, given the file, returns to you to say a Python list or a dictionary or something.
And then the code that got that data structure can choose what to do with it. It can either
print it to stand it out, or it could print your file. So you've got that [INDISTINCT].
Certainly that technique will come up tying to solve this. All right, so once
you've got that, then you can do something kind of neat. So, when I got these ".summaryfiles"
and what's happened is because I've done them over a decade, you can sort of see patterns,
right? So this is going an increasing order by year. So my name–the "Nick" format list
was, like, not looking so good and it's getting worse.
Now, well, there's a lot of interesting
data in here. So you can, you know, interesting data makes for fun as what I'm trying to think.
Here's probably the funniest part of this thing. There's a "Trinity" and the question
is, "In what year did the movie the Matrix come out?" And, yeah, there's another few
things that you could do here. So, like, well, maybe the Matrix was reacting to a social
phenomenon, or it's the other way around. It's all very complicated.
>> PARLANTE: Yeah, yeah, [INDISTINCT] there's a New York Times Magazine [INDISTINCT].
it just turns out this entire topic of baby name popularity is sort of very interesting
and at least now you're–and you get to do the, like, the needy goody work of actually
pointing out that data. All right, so here's what I'd like you to do, work on this and
then–that will get us to–and then have lunch and I'd like you back here at, let's say,
12:45, all right? So a little bit of coding, a little bit of lunch, and then back here,
all right, go..