>> PARLANTE: Back to pie quick Python class. So, the — so, this morning I’m going to talk about regular expression a little bit, in particular, regular expressions in Python. You may or may not know regular expressions before hand. That’s okay. And I’m not going to show you all of regular expressions. I will show you, like, just enough for us to get some useful stuff on. But regular expressions are a very powerful combination with Python. There’s a nice integration and so on. I will show you that. Also, the exercises later today will, of course, you know, have little elements which are solved nicely with Python regular expressions. Just as a–regular expressions are sort of a good news, bad news situation. They–regular expressions are very–I mean, I could use the word powerful but also, like, I could use the word very dense.
Like, if you sort of measured the amount of thought and cups of coffee that get poured into, like, per character, like the regular expressions are, like, the most dense language possibly. You could possible for hours over, like, one line of text trying to get, like, all the pats and backticks, and whatever crack. We’re not going to get into that scary but they are–we’re going to sort of touch into a little bit of that part. So, one word of warning, when messing with regular expressions, it’s the–I tend–I try to move a little slowly, like, they are very powerful, they are a little tricky, so I’m going to try to be careful. And for, you know, today’s discussion, like, yeah, I’m going to show you just sort of basic stuff. And if you’re extremely familiar with regular expressions, well, you know, just bear with me for a little bit. We’re not going to do this for too long.
And, obviously, really I’m going to emphasize on Python. That’s the bad news. The good news is also on all the exercises we’re going to do later today, in case I forget to mentions, if there is a regular expression component at the very end printed in little tiny print, I put the what the regular expression solution is. So, it’s kind of like you can sort of flip to the back and get the answer if you get–if you’re struggling with that part of it, because really, you know, it is a Python class. So, I don’t want you to block on regular expressions too much. All right, so with that introduction, let me–I’m going to start talking about how these things work.
But first, I have to tell you a joke which . What do you call a pig with three eyes? Piiig. All right, now, that will be covered. The necessity, that will become clear a little bit. So let me fire up the interpreter here. So, regular expressions in Python are supported by a module called “re.” So, I’m just going to import that. I’m going to do a lot of stuff here in the interpreter. I’m going to sort of build this up. So, the basic idea with regular expressions is their way of searching for a pattern inside of a larger text. So very much like, you know, search in Microsoft Word where you have the little pattern you’re looking for, and it’s going to look over this huge text and find the first instance of that pattern. But it’s this whole language where the patterns can be very popular. So the way this works in Python the simplest way is there is a function inside of re called “search.” And I’ll sort of spec this out.
It’s going to work basically this way where the first argument to search is the pattern, which I’m going to about a lot, the second argument is just kind of whatever the text I want to search. And what it returns is actually not a Boolean, not text, but a match object. So, here, I’ll write this as “match.” And then the match object will indicate–it will show us a bunch of things about the found text.
So, let me do an example. So, for the text, I’ll use our punch line. We’ll say, you know, it’s called piiig. All right, and let’s say for the pattern, we’re looking for–and we’ll just search–start talking about patterns here. Maybe I’ll just look for “iiig.” I’ll just put the simplest possible case. So, I run it. Now, it returns this match object. So, for this type, match–I mean, it’s not really going to print, but it will say, well, that’s, you know, that’s some kind of Python object.
So it turns out for the first 20 minutes this morning, the only thing you need to know about match is that it has this–it responds to this method called “group.” If called group on it, it says, it shows you here is what the matching text was. So, this is our first example of the regular expression. And the simplest case in a regular expression is, like iig here, is that a character like I or G or something like that matches itself. So the lower case I matches the lower case I. Now, I’m going to build up the vocabulary to have a lot more complicated matches.
But that’s just characters matching themselves is the simplest case. All right, another thing to point out here. So, this match was successful. Now, I’m going to do one that’s not successful. So, like what, say, we’re looking for the pattern “igs” and that pattern just doesn’t appear in there. So, if I run that and I look at the match object, it’s none. If the interpreter is none, it just prints us nothing. But–it’s just not there. So, if I were to try and say, oh, “match.group,” it’s a very common error, it’s not going to work. Like, because match doesn’t point to an object that has a group behavior. It just match the points to nothing. So, the absolutely standard way to use re.search is–I’ll sort of do an interpreter–first you do re.search and then the next line is something like “if match:.” Like, if the match is there, then we found it.
We can look at the group. Otherwise, it’s not there. Now, what I’m going to do–I’m actually going to write–just–I’m going to def a little find function just here in the interpreter. I’m going to do so many regular expression searches today. I just want to encapsulate that behavior. So that what I’m going to show you here are some of the prototypical use of re.search and then I’ll just use it for half an hour. So I’m going to say–I’m just going to call this thing a “find” and it will take a pattern and some text. This is a little weird. I’m doing this in the interpreter, but this works. I’m going to type a colon and I hit return, and now the interpreter is saying, “Okay, what’s the next line?” And so, I’ll say space, space and I’ll say if match. Now, I’m relying on the fact that–I talked a little about this yesterday, the rules for true and false. Now, there’s a bunch of things that kind of count as false: zero counts as false, the empty string counts as false.
It happens that the value none also counts as false. So, what this if statement is sort of saying is, like, yeah, if that match is not none essentially. If it’s there, it searches, too. So, if match is there, I’m going to say “print match.group.” Okay, return again, two spaces, I’ll say otherwise–I want to say what happens. So, essentially, yeah, not found. All right, those are the two cases. The question is you always need the .group. I’m going to always use it today. In reality, the match object has–you could read the dots what–it has, you know, in what character position did it start, where is that, and all sorts kind of other stuff about the match you might want to know and sort of composite in there. All right, I’m going to–here, return. All right, so now I have to find my find function, so now I can sort of use this for–well, I’ll just–I’ll use it for my earlier example. So, now, if I say find on that, I get “not found.” And if I say ig–why didn’t that work? I did what? Oh, I didn’t do the match function.
Yeah, you’re right. All right. All right, here, I’ll just do it really quickly. So, there’s def and then I’ll say “match = re.search.” Oh, sure. Now, you guys tell me. Sure, where were you five minutes ago? Okay. So, there is the match and I say “if match:,” well, okay, everyone is going to know this code by heart. If match print “match.group”, else: print not found. Okay. So, now, what will we say, this time for sure–excellent. Okay. So, now, I’m just going to do–you know, I’m going to build the vocabulary of regular expressions. So, the simplest rule, rule number one is that simple characters just match themselves. Rule number two is that–and I’m actually going to–a little hi-tech here.
I’m going to make a little space. Special characters, I made a little table up here. The dot, they are special, dot matches any character, it means anything except it does not match anyone. So, I could have said, “Well, I’m looking for, let’s say, any three characters and then a G.” That’s the pattern I’m looking for. And so, in this case, that’s going to find pig. So can be, you know, a little bit of a sense of, like, how this is going to be, you know, more powerful than just regular Microsoft Word search.
So, the–another route is going on here is–so, for example, if I were to say “…G,” you know–or I don’t know, X. There. It’s not going to find that. Well, maybe I’ll do it the way I did it before to make it better, put the S. So, if I say “…gs,” that’s not found. There’s an all the way symmetry here where in order to succeed, all of the pattern must match. So, in this case, I’ve got four characters or whatever all–you know, I can’t say, like, “Oh, well, three out of four,” no. A hundred percent of the pattern has to be kind of consumed and matched, but that’s not true about the text, right? In all of my examples, I’m not matching the whole text. Whatever, I can use a little bit of this at the end of this word or whatever, so as to say, a fundamental asymmetry. The other thing that’s going to happen here is that the search is going to go to left to right, and it’s going to–it’s satisfied as soon as it finds a solution.
So, we could make a case whether it’s maybe–say for example I’m looking for “…g” and then I here like–and then I’ll make this like, “Oh-oh, there’s a much better solution.” You know, this one is eg. What’s going to happen is it’s not finding that second one, right? It’s not just getting into it. But, how can I make it find that one? What if I said, “Well, really what I’m looking for is an X and then two characters I don’t know about, and then the G. So then, it’s like passes over the first one and finds the second one. Now, the regular expression engine, I’m not getting too much detail, is–you know, I mean, it finds all the things it’s supposed to and it’s smart. It understands. It does–it will backtrack. So, for example–this is–what if I said, well, I’m looking for “…g” and then I insist that there’s an S. And here, I’ll go fix this one. It’s like kind of an S here. So you can imagine–so that exceeds. So you can imagine it like it may be sort of tries to make this one work and it doesn’t work, and so it said, “Well, it didn’t work.” It will keep going.
So, the other thing is it is going–it is left to right. So, it finds the first one. Okay, yes. So the question is why didn’t I meet the first one. The trick here is I added an S to the end of the pattern. If you read the pattern, it says–that says, any character, any character gs. And the problem is it could–this one? One more, this one. That will succeed it. Oh, why didn’t I find that? Okay, yeah. So what it does is it–sorry, yeah–it goes left to right.
And once it finds a solution, it’s like, “Okay, I’m done. Don’t try it anymore.” Yeah, question. >> For the three partners. >> PARLANTE: Oh, yes. Okay. Thanks for the question. >> Statement with a… >> PARLANTE: Yeah, so if you are actually looking for the period characters, so like what, say, you know, I don’t know. There’s a dot there. But you were right, it is C. And you can always put a backslash in and that inhibits the specialness of a character. So, I could look for c.g. You know, I can look for c.–“\.l” there. Now, I’m going to introduce a slight extra for the syntax here which Python has, which is–where it’s a little troubling like the backslash, it could be interpreted at different levels like maybe Python or like a Java–or–it might get taken out by the language.
So, talking too much on this, I’m just going to say–Python has an option called a raw strength where you put a raw strength where you put a lower case R to the left of the leading quote. And what the lower case army, it says, “Do not do any special processing with backslashes. Whatever I type, just send it through absolutely raw and un-interpreted.” This feature–I mean, it’s a little bit obscure, but it happens to be very useful providing regular expressions because it frees us from having to worry about layers of backslash as possible.
So, in fact–even though I’ve done my examples so far about the R, I’m just going to use the lower case R for all of my examples from here and out. So, I just don’t have to think about it. So, in this case, let’s just try it. Yes, so then it’s able to find this–you know, so it’s matching now. So, that’s an L with the dot. So that, you know, that will enable to talk about a dot disposable. All right, so the–where we got so far, so it goes left to right, dot matches any character, and all of the pattern has to be matched but the text, you know, we don’t care. You don’t have to get all the text. So, let me show you some slightly more scary example. So let’s say in my text here I’ve got–you know, there are some text that I don’t care about and I’m going to say, you know, “:cat” and then there’s more texts. So let’s say I want to pick that part out. So, the next sort of regular expression code I’m going to talk about is “\w.” So “\w,” which is actually what I have up here, “\w” matches what you called a word character.
So that means a letter or a digit and I think it also includes underbar. So in this case, I’m going to say, well, let’s say I’m looking for a colon followed by three-word characters. So that’s going to work okay. So, now, I could say–like sort to the “\w,” there’s “\d” matches a digit. So, for example, if I say cat and I’ll say like “:123.” >> Between the “\w\w\w” and “….” >> PARLANTE: So, “…”– it’s an excellent question. The question is what’s there ensuing “…” and, you know, three “\w?” The “…” is just any character, like, it could be a space, a colon, and anything. The “\w” matches a character so long is it looks like a word character, like, A through Z, zero through nine, underbar.
No, so if this was a Unicode string, the word, it’s smarter about being–there is a basic notion of an alphabetic character versus, essentially, punctuation. Oh, yeah, question. >> Zero through nine, is that true, the word character? >> PARLANTE: Yeah, a word character includes digits zero through nine. So, I mean, you know what it’s a little bit like is usernames, right? You know, blah-vadi-blah 123, you know, the username. That will show up in the later example. Okay. So, digit is little bit similar. Where is my cat example going? Oh, well, suppose we’ll do it again. So, if I were looking for, you know, have blah, you know, “:123xxx,” let’s see. So I could look for–actually, here, I’ll just look for–I’m just looking for three digits in a row. I can write that as–I’m going to put the r also here. So, that pulls out the 123. Now, it happens there’s–so you can see there are sort of these different regular expression codes: “\w\w.” You know, they kind of represent sort of common cases.
I’m not going to show you all of them, but I’m sure you know the ones that I think are the most useful and, you know, ones that kind of show up in the product are going to be later. All right, let’s say–last one I’m going to show here is whitespace. So, supposed I’m looking for like this pattern. It’s like, well, I want some digits and they’re separated by spaces. So the simplest way you do that is a “\s” represents a whitespace character. And the “\s” is smart that it knows about space, tab, new line, those all count as a whitespace character. It knows about the whole sort of space, the whitespace characters. So, hopefully, that will work so that finds it. >> One, two… >> PARLANTE: Yeah, so look–so the question is if you had two spaces. Just hold that thought for a second because I’m about to–we’re about to get there. So, so far, I haven’t done any repetition. I’ve just to have like, you know, fix numbers and things. So the–probably, the most powerful part of regular expressions is that there is these modifiers, plus and star.
So plus to the right of something means one or more of that, and star means zero or more. So, we’ll have just going to do that with my digit example. So the question is, like, what if I got these three digits–whatever. There’s just some amount of space in between them. So the way you would say that in regular expression is, like, put a plus to the right of that “\s.” And that means–yeah, one or more–that element repeats. There’s just one or more of those. And I’ll do it with this one as well. So, here, turn there. So now that matches. So, adding the plus and the star, and I’ll do a bunch of examples with these. You know, exactly what we want to start matching more complicated patterns. Also, remember how I was saying how per character regular expressions, I think it’s like pretty the densest language that any normal person would use. And, like, look at that little–okay, look at that little bit of code, right? That really means something, right? Every character in the order–I mean, it’s all really pretty significant. So, it’s getting–I was about to say it’s going to get worse.
But what I meant to say is it’s about to get even more interesting. Whatever a professor uses the word interesting, you always know you’re in big trouble. All right, so let me do–I want to use those puzzling a little bit. So, let’s say I’m looking for–I’ve got this random text and I’m looking for a colon and then, you know, let’s say, a word character. You know, I’ll use kitten instead and just more texts here. Now earlier, I’ve said, oh, you know, I kind of knew how long the word was, but that was kind of a ridiculous assumption. The more typical way to do is would be, let’s say, well, there’s a colon and I’ll say, “And then, there’s just some number of word characters.” So, I would write that as “\w+.” That’s the much more typical way, right? Like, you have the sum, you got quarter, something that sort of starts, and then you’re like, “Yeah, whatever,” then just take all the word characters from there. So, if I write it that way, then it will like–it just picks up the kitten part.
So, that is a–beginning to look a little more the way these things actually work. >> A word character… >> PARLANTE: Yeah, so the space is not a word character. That’s what’s making a stop there. So, all right–and actually there a question before, like, does it include digit and so forth? What if there was like “kitten123?” That still works. But if it’s like “kitten123” and I–except I have to add a character, like, let’s say ampersand, then it stops at the ampersand. So as the thing, so what the plus does is it–the plus is greedy. It goes as far as it can and then it stops. So, they’re just kind of the pneumonic for regular expression is it finds the leftmost solution, the first one, and the largest solution. So, the plot until they accepted there’s a plus and a star, it will just go as far as it can. Yeah, question. >> So period-plus will take you all the way to the end of the line? >> PARLANTE: Oh, yeah, so the question is period-plus would take you–well, it’s not a question.
It’s actually a suggestion. So, I’m going to refer it as question. What if I said period-plus? And the answer is I’ve said. Like, yeah, it just goes all the way at the end, right, so period matches dot, ampersand, everything, okay, except for . All ready, yeah. >> When you say largest, do you mean that if you say kitten123, 123, you will find a whole?] >> PARLANTE: So, if I say–you mean here if I say kitten123, 123? And I’ll go back. So, this is “\w+.” Okay. What do you think it’s going to do? All right, yeah. So it will go through both 123s, and then it will stop because the space is not a word character but digits zero through nine are word characters. I mean, it’s made a bit about this of the word “word,” right? I mean, to a normal person, it just like a word character.
But compared to, like, ampersand, it’s a word character. All right, so one more code I’m going to show, which I’ll just type in here, is backslash uppercase S is a non-whitespace character. It’s kind of like the opposite. And I–I’m a little saddened who ever designed regular expressions chose to have uppercase and lowercase mean something different because it just makes a little bit confusing. But backslash upper case S is really pretty darned handy. So, let’s say, for example, I knew that it was “kitten123&a=123&,” you know, yatta, whatever. It’s all this junk and then there’s a space. And I want to write a regular expression that picks up all of that and the ease–but I don’t know. It’s not just word characters, you know what I mean. It includes all sorts of stuff but at least there’s not spaces in it. And a common pattern is if I write that as there’s a colon and then there’s just some series of non-whitespace characters, and then they’ll just sort of terminate with the first whitespace character.
So that’ll–that just catches the whole thing even though, like, the Lord knows what sort of characters in there. So, just as a practical manner, that backslash upper case S is potentially a handy way to catch that stuff. Already, so let me show you–so those are all the–so the so the plus, the star, and those backslash codes, those are the ones I want to build on. Now, let me–so far my examples have been like, you know, a little bit limited. I want to–now, I’m going to do an example with e-mails, I mean, how to build it up and hopefully I’ll show you pretty practical patterns you can use. All right, so I’m going to make up some texts here. I’ll keep the “blah.” So, let’s say we are looking for, you know, email@example.com and then there’s more junky text and let me just add an @ sign just by itself. I’ll just leave it that for now. All right, so the problem I want to solve is pulling e-mail. I want to imagine.
I’ve got this big body of text and I want to pull e-mail addresses out of it using regular expressions. So, the–I’m going to try–first I’m going to try to write this as “\w+” and then there’s an “@” sign and then there’s “\w+.” It’s kind of, you know, plausible (ph) first job (ph) this. So, if we run that, so what’s happened here? So what’s happened is, well, you know, it finds–there’s the “@” sign. It’s the “p,” but it can’t go further left in that because the dot does not count as a “\w” character. And then likewise, it gets Gmail over here but then it’s compounded by dot. So what I want to say with the apparel (ph) is I want to sort of expand–it’s not just word characters, really it’s word characters plus some other stuff.
So, regular expressions, there’s this very old syntax for indicating a set of characters and it’s going to use the square brackets. So, inside of those square brackets, I can put “Well, here’s the set of characters that I’m going to allow here.” And actually the “\w” works inside of the square brackets because it’s just a common case. So, what I want to say here is, well, “/w” or let’s say dot, and then let’s say–well, just leave it that. So, the question is–yes, that’s a very natural question.
That dot–it happens, you don’t have to backslash that one, that it understands that the dot inside of the square bracket is just the dot. >> “\w” is included in that? >> PARLANTE: No, so “\w” word character is just A through Z, zero through nine. It just not–oh I’m sorry, I’m sorry. That, no, that dot there means literally a dot. It doesn’t match any character. >> Because it’s in the square bracket. >> PARLANTE: Because it’s in the square bracket, it means literally a dot. I mean, it’s–I mean it’s sort of what’s going on here. I mean, you could do work on your Ph.D. and the text or something that it’s kind of levels of quoting. It’s kind of what’s going on here. And it’s–it is a necessary complexity to talk about this kind of thing. Let’s all just see what dot does. All right, so you can see, you know, you can sort of see that–so that picks up its word character and then it stops at the space essentially, right? So, it is–it’s not really a dot.
I’m sorry. It’s not a regular expression dot. It’s just a regular dot. All right, so I’m going to fix the other side as well. So, I’ll put a square bracket over here and I’ll put the dot in there, oops, and the plus goes outside, right, I’m saying that whole set repeats. All right, so that’s kind of fixed it. So, the square bracket is probably the most convenient way if there are some set of characters that you’re looking for, you know, you kind of build, let’s say, well, yeah, here’s I’m looking for.
What if I had a dot for Nick? I’m sorry, dot, you mean, like a he… >> Text. >> PARLANTE: Oh, it would just–it would pick it up. I mean, we’ve said, you know, we’ve said to the left of the @ sign just, you know, as many of these as possible so it’ll–now, suppose we wanted to say that the first character can’t be a dot, it must be a word character, can you think of way in the pattern we could say that? Yeah, I could have a single backslash. I would say there must be a word character and then it’s followed by one or more of the thing that includes a dot. That’ll be it. Although, then to be super–I think, really hurt (ph) then we should change that one to a star, right? Then there must be one or more character in the zero or more of these patterns. >> mostly inside the brackets the order doesn’t matter in the… >> PARLANTE: Yes, once it–exactly, yes.
So once it’s inside the bracket, the order doesn’t matter or, I mean, try to use the word set, it’s a set of characters. All right, so let’s try that. Yeah, so then that refines them. All right, so anyway, I mean–yes, it is a sort of a bottomless topic. I mean, you know, there’s–but, I mean, hopefully, I’ll show you stuff that are useful. All right, so I’ve got my e-mails example, so that’s the first thing. So that’s just using group, right? All I’ve been doing there is just using group. So now, what I’d like to show you is I’m going to stop using my find function. I’m going to start doing this raw here. And what I’d like to do is I want to imagine that I want to pick out the username and the hostname separately.
I want to sort of pick those out. And so, just go back here. I’ll just change this to “m=re.search” so I’m just doing it manually again. And you can do this–I’m going to change this back to just the regular way here. By putting parenthesis in the regular expression around the parts that you care about. Now, the way I’m doing it here, the parentheses are not changing what it’s going to match. I’m just kind of putting in those mark up of saying, “Well, these are the two parts that I care about.” And here I’ll get rid of this dot.
So, if I do that, right, so I put parenthesis around the part that matches the username and parenthesis around the part that matches the host and the @ sign I’ve just–I’ve done, I don’t care about. So, now, I’ve done this. If we look at “m,” it’s a match object. If I say, “m.group,” it’s the whole thing, like, just like we’ve always been doing. But there’s also a form of the group where you passed at a number. So, if I say, “m.group(1),” that’s now just the username part. And the “1” refers to the first set of parentheses. So if you count the parentheses–and it goes by the left parenthesis because you could actually nest them.
So the “group(1)” refers to the leftmost parenthesis, which if you look up here, there’s, like, yeah, that’s that guy. And then, “m.group”–oops, you can just guess, “group(2),” that’s now the hostname. So, a more–the way this is going to work for, you know, some problems or regular expressions is all the times you’ll write a regular expression for the thing you’re kind of looking for, right? Well, I’m looking for a URL, or an I.P. address, or something, and then you’ll maybe put parentheses and then say, “And here’s the part that I want to extract.” Then you’ll call “re.search,” you’ll get this match object and then you’ll use “group(1)” and “group(2),” whatever, to just kind of–it’s already, you know, parsed it for you. You’ll just pull out the parts that you want as text and then process from there.
Yeah, question. >> If I put a plus or a star after one of these the first parenthesis and it matches that twice, is that “group(1)” and then “group(2)” or is it still a group…? >> PARLANTE: Yeah, so the question is if there’s a plus after the parenthesis, you know, does that change how the group number works? The answer is no. The group numbering is based on just statically looking at the pattern as an unchanging thing and just counting the left parenthesis from–going from left to right. So that is the shortest answer I can give there. Already, so I’ve got–I want to–so “re.search” is my second favorite Python regular expression, regular function. My absolute favorite one–actually, let me make my data a little more complicated here.
I guess I’ll also add a “foo@bar.” My absolute favorite regular expression function is called “findall.” So, I’ll just say “findall” here. And what “findall” is going to do is I’ve just still got a pattern, and now I’ve just changed my text to–you know, I put a second e-mail address in there. What “findall” does is it just takes the pattern and rather than just stopping at the first match, it just continues and it just finds all of the matches and it returns them to you–it returns to you essentially the “.group,” right, the whole text just in the Python list of other strings. So, for example, we talked about for a file how you could just say “f.read” to get the entire text is one string. So a pattern–I was in joy because it just saves me so much work is I just call “re.”–I just call “f.read” and I pass that in as the second argument to a “findall.” I just feed the entire file into an “re.findall.” I have a pattern, I just let it ripped through the entire text, skip new lines, whatever, all that stuff it just handles, and it just pulls out the things I want and just returns them to me as a Python list.
And then, I can–you can write a for loop, you know, stuff we’re doing yesterday. Now, you can just process this list. So that is–that’s really how you use this stuff. So, in this case–so notice I took the parentheses out. So I just left this as a simple pattern. So I just got, you know, I just got the system matches (ph). There’s this one other variation I can do here. What do you supposed is going to happen if I put the parentheses back in? I’m, like, well, this is, you know, it’s not really–it’s a pattern but it has this grouping in it.
And, yeah, we have tuples. What it’s going to do is if there are parentheses in there, instead of just returning the whole match, it says, “Oh, well, there’s two plans. I’m going to return tuples length, too.” So each tuple represents a single match and then the tuple just has the groups in there. So, that–yeah, you can see where this can be pretty handy. If you got some big file, you just want to kind of–there are some part about you care about, you just want to rip it out as lazily as possible, so “re.findall.” >> You’ll lose the format. >> PARLANTE: Excuse me? >> You’ll lose the format. >> PARLANTE: Yeah, I mean, I would say–so let’s say you lose the format.
Let’s say, well, the regular expression is narrowing. You get to say what you want to keep. And so, if you want to keep more, you know, write the regular expression bigger, you know, to keep one. All right, so that’s, you know, not hard to imagine how we’re–it’s going to be easy for you to work that in to–doing stuff later on. So, I’ll just mention, there are some optional arguments that you can add sort of hear as a third option to the regular expression. And what I’m going to actually do, I’m going to do a DIR on re. So that’s the re module. I can say, “Oh, yeah, hey, what are these symbols in here?” So these are some constants. So, if you add the constant, “IGNORECASE,” to you–this works on “.search” or “.findall.” That means that it’ll consider upper and lowercase the same.
So a lowercase isle matches an uppercase isle and vice-versa. You can do the “.all.” I had said that the dot matches any character except for new line and that’s kind of historical thing because the processing tended to go line by line. If you add the “DOTALL” flag, then the dot will match new line as well. And so, you could–because right now, if you use dot, you’re pattern can’t expand more than one line. Although, if you use “\s” where you think there’s a new line, that’ll expand the line but the dot will not go over one. So, if you mean–if you add “DOTALL,” you can turn off that behavior and that will just truly match anything.
So, if you were to say “DOTSTAR” with nothing else, it would just go to the end of the file. So there’s more–I think the most common ones to use there. >> at the end of, like, . >> PARLANTE: Yeah, so let me–I’ll give you an example. So the way you would use those constants is it says a third argument. So you would say, “re.IGNORECASE” or whatever. That’s the last argument there. All right, and so the, you know, a couple–the handouts for today, the first one–if you didn’t get one, you can get one in a second.
The, you know, there’s a nice, you know, an explanation of regular expressions and it shows the syntax and, you know, a lot of kind of stuff there or whatever we’re doing here. All right, so I think we’re ready for an exercise. So the exercises today are going to be a little bigger. There’s three of them I’d like to do today and, you know, kind of incorporate all those sort of stuff. So let me demo this first one. So, this fist one is going to involve a brief 4A (ph) into a little understood part of the government called the Social Security Administration.
Now, the Social Security Administration in my life experience is in-charge of putting certain fields on everyone’s paycheck that no normal person understands and it just caused you just not that was going on, but they do this other thing. If you do a Google search for Social Security Administration baby names, they do this thing where they keep track of what the popular baby names are for babies born in that year. And they’d been doing that actually for a hundred years. So you could look at 1900, 1950, whatever, you can just see–it turns out for baby names, there’s sort of a–there’s a popularity of it.
There are sort of names kind of I have in the flow (ph). So, I look at this and I see assignment idea. So, let’s go to–oh, I don’t know, 1980. I’m not even going to try and think about when you guys were all born. Do you know what I’m saying? And let’s just go with, like, I don’t know the top thousand. So we get those, like–you’ll look at this and you’re just thinking like, “tr,” “td,” that kind of thing, you know, good thinking. All right, so, here’s just for 1980. The list of baby names and what does this is saying is that for boy children, Michael was the most popular name. And then, next most popular was Christopher, Jason, David, and so on. And over here, we have Jennifer, number one; Amanda, number two; and it just goes down to, like, you know, here we have, you know, Bobbie, Emil, Jermain, Kraig with the K, down to the, you know, the less popular names.
All right, so what I would like to do–going back to Python here, let’s see what I have. Okay. So I’m going to go into “day2” here, and there’s a directory, baby names. So, if I look inside here, let me look at “baby1990,” I’ve pulled this sort of–I just sort of copied and cleaned up just a tiny bit the text from Social Security Administration site. I put this very . Okay, well, there’s some partly written CSS and whatever. And then eventually, there’s–here’s the “h1” and then here’s a table blah, blah, blah, and at some point it’s going to say–all right, here we go. So here’s the “h3.” This is popularity in 1990, or as I like to think of it is popularity in “\d\d\d.” And then, there’s some whatever junk you want to skip and then here’s a “tr.” And then here, there’s the “tr.” And if you don’t know HTML but whatever, that’s the HTML for that first row.
So it says, “tr,” “td,” and then there’s the number one. And then there’s more “td” stuff and there’s Michael, and there’s Jessica, and then here’s row two, and so on. And it just goes on, like, there’s all the data. It’s beginning to look like an actual problem. All right, so, the first thing what I want your baby names program to do is given a file, like, “baby1990.html” and I’m going to pipe this into more. What I want you to do is I want you to rip through that entire file. I want you to figure out what year it represents. I want you to pull out all the names and all the ranks.
I want you to organize it so that you can–then produce a printout that’s just in alphabetical ordered by name. So just I’ve shown here. So you say–so the first–first you print the year and then I want to see Aaron 34, Abbey 42, and so on. So you’re just show an alphabetical list, here’s all the names are. So that will get us through the… >> Combine male and female? >> PARLANTE: Yeah, so what’s going to happen is there’s a strange case but sometimes a name will appear as both male and female. And I’m making any distinction male from female.
So in that case, I want you to give it the more popular, essentially the smaller number, whatever the smaller number is, all right? So, the–oh, let me talk–I’m going to go back to Python for just a second. There’s something I mentioned I think maybe very briefly yesterday but it’s about to come up, which is we did file opening, right? So, “f=open(filename,” so I’ll remind you, if you want to write a file, if you want to do it for writing, then for that second argument you pass a “w.” So, yesterday we just did “r.” We just read the file, so that’s fine.
So, you put a “w” there for writing. And in that case, then probably the simplest way to write to the file is then it has a “.write” and then you could just have, you know, whatever kind of text you want in there. And so, you’ve got to be careful, you can zero out of file here but it’ll write that text. And so, this is as well. That’s about to come up. All right, so that is–so part A is I want you to just pull out all the names, you know, use a regular expression, findall, maybe a dictionary, I mean, just total regular work.
So for part B, what I’m going to do is there’s an option called “–summaryfile.” And I’m going to run this on as star or I’m going to say “baby*.html.” In that case, I want you to produce no output. What did that just do? When the “–summaryfile” option is given, what I want you to do–oh, notice in this case, I ran it on “baby*.” So I ran it on all of the baby files, right? So the shell just expands out, so in that case “rv” is going to be all of them.
If the “–summaryfile” like I’ve just given, what I want you to do is for each file I want you to read it and I want you to create a new file with the same filename but ending in “.summary,” and then I want you to take that output that we were printing to the screen earlier and I want you to write it to that file. Now, there’s a little bit of a trick here which is I’ve shown. So, for example, when you have a low level function, you’re not necessarily want it print it to stand it out directly. You want to have a function, like, given the file, returns to you to say a Python list or a dictionary or something. And then the code that got that data structure can choose what to do with it. It can either print it to stand it out, or it could print your file. So you’ve got that . I’m sorry. Certainly that technique will come up tying to solve this. All right, so once you’ve got that, then you can do something kind of neat. So, when I got these “.summaryfiles” and what’s happened is because I’ve done them over a decade, you can sort of see patterns, right? So this is going an increasing order by year.
So my name–the “Nick” format list was, like, not looking so good and it’s getting worse. Now, well, there’s a lot of interesting data in here. So you can, you know, interesting data makes for fun as what I’m trying to think. Here’s probably the funniest part of this thing. There’s a “Trinity” and the question is, “In what year did the movie the Matrix come out?” And, yeah, there’s another few things that you could do here.
So, like, well, maybe the Matrix was reacting to a social phenomenon, or it’s the other way around. It’s all very complicated. >> PARLANTE: Yeah, yeah, there’s a New York Times Magazine . Anyway, it just turns out this entire topic of baby name popularity is sort of very interesting and at least now you’re–and you get to do the, like, the needy goody work of actually pointing out that data. All right, so here’s what I’d like you to do, work on this and then–that will get us to–and then have lunch and I’d like you back here at, let’s say, 12:45, all right? So a little bit of coding, a little bit of lunch, and then back here, all right, go. .