Yeah, of course, there can be some overlap but uh yeah. Yeah. I normally um you know, also um not everyone attends so it's that I don't know but face to face with the industry we have generates of 50%. so. Yeah. Yeah. Uh but I think the topic is very interesting so uh like uh like in in the uh I think in the previous advertisement we post the links on teams So therefore, like people may confuse um like last time when we checked the number of attendees going to very high.
it's like uh more than 200 people like that. So, we decided to change the link uh for the webinar. So might be some of them still going to the link. um uh with the teams on the team meeting. Yeah so we put um a persons in there already to let them like other people if their choice with the LinkedIn they will you know go with one question so I can share a screen obviously. um can I, can I use it? Can I use a jet? The webinar yeah, it should be. Um let me ask my team members to see what what it's like.
Yeah, don't I'll try and if it doesn't work, no problem. Yeah. Can you try to see what's going on now? Because I think you're on panelist already. So, a host is able to So, I'm disabled host, disabled participants screen sharing. Alright? Um you did not see the screen on the screen. Yeah. but it's a disabled. It says host host, disabled participant sharing screen. Don't worry, I would, I would make you ask so that you can share your screen. Maybe a panelist is considered to be a participant. No, I'm not. You can try. I think uh you are now is cohort so you have Okay, look good Now, you got it. Okay. Yeah. That's alright. Just filming this before. That's all we can say hello to each other. I know I'm not really looking at the chat but um if anything exciting in the chat, just let me know. Yeah, sure. Uh let me monitor the chat. Yeah. So I can monitor the chance and I can also summarize the question as well.
So, that's uh yeah and uh Okay. And at the moment, we are streaming live streaming on Facebook as well. live streaming on Facebook, Facebook. Yeah. Alright. The quality look very good. Let's hope the Australian internet keeps up Um um it's not that reliable. It should be right. for you also have your or is it going to a 5 year already? How's the weather over there? Uh Harry. Oh well, it's sunny. So, so um it's actually my favorite time of the year. All good. is that of August is um you know, winter is a bit bit cold and then um um sometimes rainy and then as we get in August, September um sunshine almost every day Blue sky um you know Sydney, beautiful city and lots of little bit of water and at the moment everything is very quiet. So lots of birds and even in the city. Yeah. I miss you too. Yeah And um it's a nice Yeah. I think uh yeah. you also want to say a few words about AV SE and um you know, uh yes, yes, I'm good. Is that the first time you're organizing Um is is the first uh I guess um we conduct some in the seminar already also open for others as well but uh yes this one is officially the first one.
Oh and uh yeah and we also aim to you know uh conduct the um and invite speakers uh frequently so that we can interact with others. Uh you know, in the field as well. and as we focus more academic or industry, I was both at the moment. Um so we uh but uh the core going to be academics so we try to connect live people's uh academically and then uh we also include some practitioner in there as well. That means the types of uh you know, practical approach Um try to reach the uh you know what we dance in and houses can be applied in practice as well. Yeah. So that is like uh what we're also trying to do is like um with the time we was at the break, I mean in a way kind of what academia should be not isolated but sort of together with stakeholders.
Yeah. Changing things Yeah and in terms of the uh the practitioner and uh professional uh and industry we are particularly focus on the financial modeling uh as like a predatory models like this as you are going to present Yeah. No. Yeah. I mean, I was over the years, I noticed that um Vietnam is pretty strong and as you know, economics uh uh but but um so there's a lot of um good people coming out of the country and there's a lot of as well. Yeah and then also um you guys have a global spread um Yeah. So like uh at the moment, it's quite a lot of calls. They were both and academics and also in the industry. So, that's why we want to, you know, colleagues uh this resource and work together to learn and an idea and try to improve together. Yeah. Nice, right? I think it's uh yeah. Whistle Stop. whenever you want, right? Sure.
Yeah. I think that we're about like uh fifty people here already. So, uh yeah, it's it's a good time to start So, yeah. Uh firstly, um hi everyone. Uh first of all, uh I hope that you are doing great and um thank you very much for your interest in our first webinars and also thank you very much Harry for agreeing to give a talk on the very interesting topic I would say credited tricks, right? But um like before we start um like this uh let me take a few minutes to introduce about our network. Um as you can see in our names, the finance and banking network. we are nonprofit organization uh specials finance and banking and as the association of Vietnamese scientists and experts Global. So, our network is to access funds of the primary channels for Vietnamese calls and practitioners to connect, learn, share, and share knowledge not only among ourselves but also with international expert as well.
Like many of you over here. so we value a lot but we love very much to connect with you to exchange the idea and also to um of collaborations together. Um so, for this purpose, uh we have been developing multiple platform including research pathway. So, we also conduct some research trainings, uh research connections, uh activity and we also organize a seminar webinar likes and educational workshop and particularly we host uh the annual conference uh Vietnam Symposium in banking and Finance the SPF since 2016.
So, this year, we also have the conference we the uh Vietnam Symposium in banking and finance. uh 2000 uh in University of Kansas, Usa. Um here's the coordinators of the credit and banking and professors Hauck Una uh from University of Maryland's Usa. Uh he's the managing editor of the of Financial Service Bishop uh and uh secondly, we also have organized a uh the editor session with professors, professors and also with professors Jonathan Barton um editors of International Financial institution and money and also financial letters among many others. and we also include professors, editors of review, quantitative finance, and accounting in the midst. Um it's in our um in this year conference, the SPF um and uh to make ours like conference is even more attractive. We also provide many uh publication opportunities uh with a special series of financial resources economic models and researches in international business and finance.
So, please check out our Vs. We have 2021 our website for further details and submission guidelines. Uh like that. The link will be provided in the chat box songs by one of our members right now. Um let me start with the webinar. Uh so, the waste your time more and we go straight away to the webinars uh that's presented by Harry but first, let me introduce to you professor Harry. So, our speaker today, um I have uh personally I have an honor to be a mentee, a college and friends of Harry since uh 2015. So, that is my great honor and my great passions are working with Harry uh for the 2015 uh many years already. So, Harry is currently is the professors in finance at University of Technology. He's specialized in banking system credits and liquidators, housing, finance, insurance, machine learning regulations, civilization, and structured finance. Here is a strategic partners for banks and regulators in a trades and not America's he has has influenced with financial institution does have applied his work to improve their risk management practice.
His award-winning which has been widely cited and published in the journal and he currently serves on the editorial board of the genre Primo validation. Um Harry's is also a very dedicated educators who consistently receive excellent student feedback and his students have produced impacts for industry research How your textbook on the credit tricks. analytics are used around the world in the data analytics courses and uh one of the function over here is does I also use a book in my teaching as well for the Costco Advanced Analytics and so for the analytics called for the both undergrad and graduate. So, uh now, please join me to welcome Harry with Good Credit Uh uh Rick's Machine Learnings in Pythons. uh the Jaws. Thank you very much Harry. Thanks. hon. Um so, first of all, thanks and your peers at the AB SE.
Um it's a great network and uh beautiful conference. So, so we have had papers presented in the past and you know, the lineup is also very impressive. um with Robert Young and uh uh colleagues so so um it's a growing network and and um um really good things come out of it so and I also like the touch you have um uh sort of bringing industry and um uh academia together. Uh so we need more of that because uh both um is is not good. Um and so with that um II don't didn't really want to give a paper presentation which is common for our, you know, uh background of being and universities. uh because um I thought um the audience is broader than that and also um we have people training today with lots of ambitions. So we have people who are um we're all interested in share common interest maybe in data science. um maybe in finance, maybe in in risk management or um modeling uh but if you have a very Hetero group today and so um I suspect just from the last amount of registrations that this event has attracted a lot of people are interested in things like Python uh machine learning, um credits, analytics and so we bring them together but um with that, we only have an hour or 2 hours.
Um I don't want to really go into uh the nitty-gritty details. So I want to talk about concepts and and what's really medicine and just sort of intersections um and you know, there's that. Um let's get started. Uh it is a webinar format. So, um I don't see you guys. Let's see who it's just great. Um um and uh if you want to ask questions, maybe you can put it in the chat. He's going to monitor the chat and um um and let me know um uh what's going on. So, feel free to um um uh put everything in the chat uh that that bothers you or you would like to have an answer and we'll try to address it if you can. Now um the I also need to say this this this uh work that I'm going to show you um might uh might be interesting for many reasons either because you work in the industry because you do it in research but also it's really uh important that you bring this material out into the teaching space and so um starting from the PHD students to be honest students to the master students to be bachelor students.
It's um maybe you find some elements today that you could apply in your courses and then that would be also a great outcome of today. Um of the work that I'm going to show you is not my own work. Um I've contributed to that but it's it's it's it's a work of a larger network and and um also a call it uh another very uh in my research life, important academic Danielle Danielle.
Um and Danielle and Ivy have worked over many many years uh in the area of risk uh the uh modeling for for research but also for industry models and He was my mentor. when when I was doing my PHD and and and and thereafter. So, so a lot of the material is based on his work and he's a professor of statistics at the University of Rags in Germany. A beautiful city too. Um so with that, um um let's let's let's get started. Um I see you guys things coming in um uh for a chat so I'm not looking at it uh not to be distracted. Um Presentation wenn Uh we'll talk about um how to manipulate or results with data. um and uh last but not the least, I'm going to benchmark to um machine learning techniques so that we get um uh a feeling of what machine learning can do for credits but also um in other areas.
Let's get started. Um current situation and so I'm speaking from Sydney, Australia. I'm sitting in my office. Uh my family is around me. Everyone's at home because we're in lockdown um and uh this has been a situation for many of us in the last uh 2 years and with that um a lot of things have changed. Uh way we define work, the the the wavy work and um the what we spend our money on uh and so a lot of things have changed uh but not only for households but also for companies companies have changed the way they the way they affect us into the production process is the way they um sell their products to ship their products.
Um I think the last 2 years um hardly ever was such a major period uh in in in in human life. Uh the things have changed so dramatically. Of course, we have um um you know um other episodes but um I think the um the speed of it change have to things have changed in a in a very short period of time. Uh the current situation very unique um and it's not as all that and you know, credit risk is a good example for that because when COVID started as a um uh uh early last year in 2020. Uh the notion was that uh especially credit risk and credit portfolios would become very distressed um and a lot of um uh banks worried and um and um effectively they're ramped up for provisioning.
Uh they uh try to preserve capital for cutting down dividends um and also if you did to work in trade risk in banks uh you would have one of the busiest times in your life. Um they were loan deferments um that were mandated by governments where people had to manually um override systems but do not have that concept of law deferments embedded. um and um uh people had um lots and lots of modeling issues. Uh now the the uh stepping in of governments, um various governments increasing their spending through um salary payments, helicopter money, uh providing all sorts of subsidies to households uh companies um um that downturn. That was Andres has been avoided. We're back in happy days and today, the risk or risk is at historic lows. I recently saw a presentation of an academic summarizes and uh risk outcomes and he said, well, if you look at a default rates, they are at current historic lows globally and so effectively, there's no credit risk so we shouldn't worry but um uh ionic can caution uh things that move cycle and perhaps the A government expenditures have just postponed.
What we might have seen last year Uh some period of time, okay? Um and uh there is clouds darkening and it's in New Zealand. They were the first announcement last week's made by the Reserve Bank of New Zealand putting spinners in the wheels of banks um effectively mandating banks to lend at higher LOL uh sorry uh uh uh a lower loan to value ratio. So let's say 60% for loans the Australian not following and thinking about implementing similar action and usually there's a um there's a chain from New Zealand over Australia over Asia to Europe and North America in terms of regulation the New Zealand uh uh UV regulates first because the banks are not New Zealand based.
They're Australian. It's always easier to an overseas industry. Uh Australia follows because they see themselves that we uh uh at the the top of regulation uh Prudential regulation and banking and um the Europeans Americans often follow because they want to safeguard their uh Boris the manufacturing companies because they live from producing goods. The rest of the world consumers and so um perhaps what we saw last week starting in New Zealand may trigger a through um and um uh Vivo increased uh uh loss rates going forward and there's a lot of things that come out of the this discussion. Um and it's not necessarily mean that we had high or low risk. It didn't matter more to the risks and Across the world, we have two situations. We have countries like Europe and North America that has abundance of lost information because they have gone through economic downturns, their loans, um uh default um and in particular of a global financial crisis has led to a huge data collection of losses um and uh Asia and Australia and New Zealand, they sit on the other end.
We have had um a growth period ever since we started collecting lost data or data. We have a very few uh default events. We know very little about what happens in the case of default just because we were in very benign a very positive economic uh periods and so um I just want to highlight that there is uh two parts of the world people who have a lot of lost data and people who don't have a lot of lost data and the modeling approaches people apply might differ between these two groups of people and so coming from Australia where we don't have much data people uh uh Und they often um uh basically can just stop sample.
They can just say, oh, let's take a global financial crisis with a couple of years and look at how how bad we're the outcomes of risk in those years. um or um uh with these sort of uh rich databases. you can do a stress or scenario based stress testing um and you can also build much more complicated models such as switching models and you know, the outcome of these two worlds we see in many cases. So, if I look at as an academic at the literature, I is a European American studies. Why? Because the data base in Asia and Australia are not good enough to provide models that are meeting academic standards and so if I was to estimate a model based on Australian New Zealand data uh very quickly, uh my my peer referees would turn down these models and saying how generally a these models um do they really reflect what's going on in economic downturn? We don't know. Last time in Australia was in 1991 and uh in 1990 We did not collect data.
Um either but it's public or private within the banks uh for this building. Um uh could build and so those issues I think um um uh will become an issue. I think COVID may result in a lot of loss rates and then if that happens, we're just following because you've always just building models on the past um observations but wouldn't it be great if you have models that actually predict what's going on in the future before being exposed to it I wasn't want to highlight the two big dimensions and you'll see that in our um talk later on uh credit risk. Um the big drivers, there are liquidity and equity um uh regardless of whether you talk about corporate lending or a mortgage lending or any other form of consumer lending, any credit risk model uh should include these two dimensions and um at the moment uh COVID has an impact on these dimensions.
So, we talk at the moment about the government spending the purchasing of the bonds that has left to inflate asset values and inflated asset values effectively has increased equity, equity values being at on the household level on the corporate level that sort of suggests very low risk. Why? Because the so called loan to value ratios uh collateral are ratios. They are very very very low and indicating Ale ratios are high but loan to value ratio are low indicating very safe loans. obviously to be. We talk about asset bubbles and burst of acid bubbles that may not hold and second, when we talk about liquidity, risk of course, there's also currently a concern that um uh one of the many expenditure items of Boris clearly is the interest payment, the servicing of the loans that as uh you know, inflation rates and interest rates 1 Day will move up and they have moved up already.
Um I would say the central banks for a secondary markets, uh bond markets for example, uh uh we talk about bank funding costs and the banks passing on these uh increased funding costs to borrowers. Um the rising interest rates on the borrower expenditures will cause further distress servicing loans going forward. Yeah and so those are all outcomes where COVID involved.
Everything looks very rosy very positive at the moment very quickly can move into a different direction and you know, for us, credits, events are actually good. Banks are losing money and all that. It's not a good outcome but for the um uh for uh any additional observations we have on these events. uh it's a good outcome should also highlight the other dimensions. Uh for example, uh time effects uh um uh you're working with on one of these problems, a different time dimensions not only about the macro economy it's also about a effects. Let me just pick one. The vintage effects. Um we've discussed that um the abundance of money in the financial system has led to lose lending standards in the banking system and that usually suggests also an increase in risk in future years. and last but not the least, um just the economic cyclist that I described uh being um ups and downs uh they cause challenges um on the machine learning um field because effectively you're a training machines on economic episodes and uh we are using these machines again, mainly not to forecast individual outcomes.
The bank doesn't care so much with a mortgage loan defaults it worries more so with the millions of loans outstanding but a significant fraction of these loans devoid. um and uh this that the testing of these models uh or the application of these models out of time out of sample. Um two other economic machines and periods that is a huge for any modeling exercise and so um I'll be also talking today a little bit about the validation of models Curry. Uh we just got one question. um last uh whether that you can provide the uh PDF of is like uh later on but uh yeah so it's going to be on the answers. um um uh uh no because it will cause a bit of work on my end. Um I'm not because I'm a lazy person but I'm going to show you why this light is not in PowerPoint. Um this light is what we call a notebook to a notebook is uh IDE that of Python and it's um a bit harder to do it in.
Um it's an HTML format. It's a bit harder to move it into a Pdf. um and so I can move it in PDF but we have different dimensions and I have to store them separately but it's kind of confuse you but I know who's recording it and so with that, uh you can just watch the recording if that's a that's right. Yes so I also give the answer later. So, the um the webinar today we're going to be able to record and post on our YouTube channel So, I want to do is can go back to the videos uh on the uh on our YouTube channel later Alright, thanks a beautiful but um yeah.
so, so definitely watch the video uh but also um you know, the what I'm going to talk about is uh things we have um summarized in the book uh and so now a bit of advertising here. Uh that is the book. it's called Deep Credit Risk um machine learning with Python and so um what I'm covering today is is a is a very tiny fraction of the concepts here. So we talk about maybe ten pages in 500. um and um there's so much more in the book. So, we um you know, um you want to dig deeper. I said, I strongly recommend get an Ebook or a physical book uh or if you can't afford it, it's not very expensive but just ask your library to buy it Every university library in the world should have it, Okay. So, um I've what the book does is um uh so it's a new way of teaching. So, we we have a traditional textbook character but it has um if you open it, you can just open a very uh example here uh it has in there um a coat, an output and text and so the book front to end is written in Python.
Um and every single example that you see in the book works but because um it uh the machine has compiled it. There's no room for errors and unlike other books, there's not just a formula. there's actually the code that supports everything and what you see can be reputed further. Um you can download the data undercoat that underlies uh uh the book uh some of the code Vida.
On deep Creatives.com. So, you go there. You just found out the data um and then you can uh basically reproduce all these examples using Python. There's also tons of videos. There's a Youtube channel if you want to have a look at uh some videos. Uh there's a video that helps you get started installing Python for example, if you're totally new to the game um and then there's um little videos that take you from one level to the next level.
Um so, uh check it out. So, all all you need is basically, you just go to this website. Um not sure whether it works um and then uh you either follow here for that and the code these prompts or um uh you download the um or look at these videos, okay? There are also other resources y'all. So, um uh like academic papers we've written and you know, we've we keep adding this new stuff comes out to his apostles. Um now, uh after you've downloaded the the data and the code now, the data, what we're looking at is mortgage data for us and it goes from the expenses over periods of 2000 to 2016.
So, it's fairly recent and most importantly, it covers the downturn experience in the global financial crisis with the lowest rates in um mortgage industry in the US but also other countries that's um um went up by banks, okay? And uh now, um before I get deep into detail, I want to also talk a little bit about hyphen. So, um as we analyze data and we use um a programming languages um uh there's a whole lot of languages and um they all effectively do the same thing.
So, if you're on for example, you a regression model on the data in all in all uh uh the the the programming languages to get identical outputs uh but that said, if you then move on from linear to non linear regression models to uh a machine learning techniques which are more recent uh and and uh as we speak of new developments, you find that language is often provide a little bit different degrees of functionality and it is um actually machine learning and in machine learning. it's the area of credit risk but has probably helped with the evolution of one language in particular, that is python. uh the most to get a significant market, share this this discussion which language is better. Um I can only say I am I've I've studied over the years um a range of them not many but II um Es literate and stayed illiterate, illiterate. Uh Sass are entitled literate and out of these languages they're fairly similar even in the setup except um I've noticed over the last year that there was a drive towards Python and we speak to of users and because we've written in the books in but we've spoken to us in in many different countries and it's a bit regional in the Us.
It's uh clearly all the industry players, all the bankers almost all moved to Python in Europe. It's a little bit sort of a mix between Sass are and pythons up and coming but even Sass and I would say they're more dominant in Asia. It's uh I find all the languages of in Australia and uh uh uh II find a little bit sort of uh uh uh are being a bit more popular with the community than the others but it was a data points and you might have others and you disagree and it doesn't matter because they're all a very similar concept.
You know, it's it's like many of us, these languages, they are not developing in isolation. They actually copy and paste from each other, you know, to some degree or you know, if something is successful, then we could implement it from other providers too. Now, what I would say having worked with all these languages that I find the most intriguing one even though I'm my roots are in so I started off years of modeling and now, I have to change and you know, there's a bit of work involved if you want to change but Typhon allows a lot of things.
It allows me an integration with the internet a lot. So, HTML in particular um and also Python it turns out um it's kind of the majority wins. Uh so as I wrote this book, my son, he's uh in year seven now that he was in year five walks into my door. This door behind me and he looks into my screen and says, oh dad, are you doing Python? And I said, yeah, I'll show you a few things and he said, no need to. I already know Python turns out in New South Wales. you know, Sydney, Australia, uh primary school kids learn Um not much but some hyphen coding in primary school and you know, they keep going in high school. we do it as well and by the time they reach university, they have uh uh a quite a quite a good understanding um uh of the subject now now um the other beautiful thing in Python is it comes with many ideas.
So, if you talk about are usually when people use a particular ID in our our studio uh but in Typhon, you can use many and some people use a spider I use in this case, it's Jupiter notebook. You can choose the IDA that suits your your needs and my son, he didn't use to be a notebook in school. He used um AA particular IDE that's made for primary school students but that makes it much easier for them to understand the matter and just focus on what's really important without too many links and buttons and uh just a bit of even a lot of helpful tools.
Now, either way, you have a box in all these languages where you put in code and so what you see that now is is not a boxer put in code and uh the way we start of uh we're bringing in the code and uh we have created uh it's quite a package of function. It's actually it's quite a module but it's a it's it's a combination of multiple functions. Let's be TCR um module and what we do is we just import it into our system Not much is happening but bones once the the the the the the what's this is executed what it has done.
It has imported a number of packages that we acquire. Um I'll show you a few and it also has imported the. Yeah and so you can look at the data either by looking at the data uh or some people um you know uh use the print command uh which um because it's slightly different format uh but it doesn't matter. So, uh You just look at at the um the data, okay? And so now, what we're looking at is a traditional data that in the mortgage uh literature or in the mortgage and call it uh risk modeling discipline. Uh we do have uh um the first column that's the index.
Uh the index is just uh uh uh a number uh you know the difference int that as little pythons of based indexing. So, any sort of numbering starts in a syrup Then, the second column here is for example, the loan ID is just uh an identifier for a loan and we have here number four. It just means our first loan has the I-4 and we have multiple observations because each row is a different observation. We have a number of time stamps. Turns out we have uh in this status at five different time stamps. So, a lot of the economic modeling does not have measure only time as you think of time it matters fundamentally in the analytics. what sort of of time you look at you look at the observation time, the origination time, the first observation time maturity time, and also When we talk about default modeling, um we have this concept of resolution of the processes.
So, only here we have five different time stats and they can compute, you know, differences between times. we can do all sorts of stuff sampling and then clustering exercises just the these five variables um So, yeah. Yeah. Can I, can I have a question over here? Uh from Carlos Um asking is the data is the data frame. Yeah. So, um so first of all, what we see here is a is a panel and in Python and now there's a python specific.
There are two main call it uh worlds or packages where people analyze the data. Uh the first world is Nampa and the second rule is pandas as um uh uh Carlos has suggested now what we currently look at is pandas and when we later moved on to machine learning, a lot of people move to Nampa. so much of the machine learning is based on Nampa. Uh when it comes to Python and P for modeling uh in industry, you're most likely to need both packages because Pandas has all the data time series uh transformations for example, you can easily merch merch merch uh data sets. you can introduce lead lags. you can um you know, computer returns of that um and so so um there's a lot of functionality in in pandas and sodas effective enhanced uh uh data frames enhanced a, it's not not pay effectively just multidimensional call it um a um and uh you have multiple dimensions.
Okay. Uh but um there is not this notion of our this is an observation. This is a um a column that stands for variable. So, um in daily work, um you most likely need to us. So, data science work but when it comes to machine learning, things that are reputable, executed by machines, most of that sort of works in our way. and I'll show you both. As a matter of fact, uh this is pandas and I think if you just add the values command uh then you convert with pandas starter frame into an unpaid A. so this is an unpaid a looks but clearly for someone who's starting pay for coding is a bit hard because um you know, suddenly uh all the Cullens have no values anymore. What do they mean? Who knows? And there are lots of rooms for parks and arrows. That's why for now we just take this with us uh so that we have um bearable names for columns and um you can imagine a little bit more what's behind these numbers.
So, for example, after the time stamps just to keep going. for example, if you have a low balances. so, we know the profiles of loans uh for example, have a loan to value ratios and percentage points, interest rate and percentage points and so on. uh and we'll take a look at a little bit more at the data. So, um just uh stay tuned. Um now, uh also we have um by uh running this code here uh DC. we have imported packages. So, some are packages that others have created. Things like um the major ones we use is uh Pandas, Nampa, we use AI can learn later for a machine learning. Uh they are then a few minor ones.
You know, for example, if you want to survival analysis, you need a package for lifelines for survival analysis if you want to do uh Uh uh you know, a certain plotting, um the different plotting packages and so on and uh we have brought everything in our system so we don't have to do all of it. installation uh ourselves and on top of that, we have also imported a number of functions that we've written uh to promote data signs and credits and so I would say um the one that I think people who work in that field benefit the most is going to be the validation function. number four but there are other ones for example, weight of evidence, resolution bias, the data preparation, the function that help you um um working this python in clear risk. Um I'm going to show you a few. So, um you get a feel for it.
So, and the data set, I'm just um maybe to um expand a little bit more Um has a different parts um called themes. So, we have the ID uh variable. We have seen we have about five time stamps. We have uh information that is measured at loan origination and we have information that is measured at uh loan oscillation um and then we also have uh observations now a bit of lining When you're now move into a world of um um machin e learning uh people in machine learning like people in regression modeling, they try to connect dependent independent variables, wives of excess and the whys are called outcomes. So, they're usually the left hand side variable and the access are called features. That's what you find because Python Machine Learning is very engineering.
Infused people don't no longer talk about independent, independent variables. They talk about outcomes and features but the same thing, it doesn't matter uh but the point being now um our futures are going to be information uh measured at loan origination and loan observation. The reason for this distinction is thanks when we make like blending decision, they collect most information when they make the first loan. So, because they they have a bargaining chip because they can say, oh, you give us all the information or we don't give you a loan and so, we are able to collect a lot of information especially from households.
They can collect pay slips, they can collect um you know, um expenditures, reports they can collect all sorts of information thereafter at the moment is out the the door at observation. It's much harder to collect information. uh because knows He has already spent the money. The risk is now for the bank to recollect the money. A covenants are usually very non existent in in in consumer loans because governments don't allow them or they are very weak for the corporates because it's very hard to measure and so with that, uh if we have most of the information upfront and uh we collect a little bit on top of that during the life of a loan uh and then last but not the least outcomes the outcomes uh uh for credit is usually the default default event Uh The speak of probability of default models uh or um loss rates given default.
We speak of LGD models. Uh or exposures. We allow outstanding loan amount. They speak of exposure models or so called refinance models because banking like any other industry is subject to customers changing providers. So, if you're a bank say and you're lending to one borrower the borrower at any point in time, it can change to a different for a refinance transaction and uh uh governments strongly support refinance because they want to ensure banking systems are competitive and borrowers in particular household borrowers are not locked in for 30 years um uh to service a loan with a lender. Um so with that, it turns out actually the the biggest activity in our data set and we could show that um it does not necessarily come from a default. It actually comes from refinance and paying off a loan prior to much uh and so uh this is the same if you're a telecom company over the phone and your customers walk to competitors, I say oh to a telephonic or whatever.
Uh other providers are uh and uh I've also um uh exist in banking um and all these outcomes be a default maturity or payoff. They they effectively have consequences, financial consequences to the banks I don't want to go further. If you don't have time for that uh but II I could spend days and days talk about refinance and pay off uh because I'm uh next to default. That's that's the other very exciting area. Um and we do have AUTSA PHD student. Her name is uh uh she might be in here today uh who's an expert on a payoff and refinance. If you want to contact me Okay. So now, let's talk a bit more about the python and then we get uh uh really started. So, here let me just uh maybe tell that it's it's steps You said, okay, we can look at the data but just looking at the data, we write the data, the name of the data that you have a data. So, that's the name of the object that carries the data frames and we can also create slices of the data.
So, for example, uh we can create a slice of a double square bracket and if you're interested in just exerting the lower an ID which I say in two square brackets. Um I'll we want to look at the low Id Um okay and now we just uh cut out from the data is at the index. that's the left and and the VIP variable. You can also look at um ID and Fico score and then we get a free columns. Now, one important thing in hyphen is.
It's a case sensitive. So you cannot just replace ID with capital letters. Nothing will work. I get an error message um and so you need to be always aware that the syntax uh is case sensitive. You can use upper and lowercase letters. No problem. You just need to know what you're doing. Uh use our lower uppercase and that holds for Arab names It holds for uh up names. It holds for you know, uh functions, packages, methods.
Uh there's all sorts of elements that Python uses. Um it's always case sensitive. Okay. Uh and another important concept is a lifts from chaining um and so for example, if I wanted to uh from that um cumn, let's focus on one but it's the Fico score. It's a credit score. If I just wanted to um get descriptive statistics, descriptor statistics are things like me Um then I can chain chain as a link I can chain uh uh a method that is uh uh something that I want to do and for example, what I wanted Uh I want to describe, I want to describe that variable and I just changed with a dot. Uh we described command, okay? Um and then I get a distribution and empirical distribution of Fico score. That's great. Um and of course, if I wanted to for example around then I can use the round method and for example, I can round to two decimals um and I can execute that as well.
Okay, execution like in many programming languages the shift to enter Uh you know, I can do that. I can also do that for multiple variables For example, for your loan to value ratio at uh observation time, I can do that as well um and so when I build my tables um uh for the analysis and um you know, uh what that tells me is the average credit score is 673. The minimum is 129. The maximum is 819 There's a lot of economic content in these uh values Okay. any any comments so far? Harry. We just got to question regarding. It was like um the uh the reason why python are orphans use machine learning credits, models, Um why not you per uh PERL um for this for for credit tricks and um the other question is like um please try supply Liquid that you are referring to. Uh this include buildings, house and land included in there Uh yeah. So, um um uh first, first, first, uh let's do a second question. So, liquidity. Um you haven't really computed liquidity in our data yet.
Um um I could talk about it but um it would expand my uh uh time frame here uh but uh if you go to a book, there's a section of liquidity um and it looks uh this section look at uh loan balances and so you could construct the liquidity measure by looking at loan balances other measures in banking. Uh don't look at uh uh levels. They look at flows. Uh uh one of the most important flows would be the debt to income ratio. DT. it's uh acronyms um and so there are many concepts and the liquidity itself is is a is a discipline.
Um so people um spend their whole lifetime uh or they can spend a whole life and there's actually a few of them uh just uh worrying about liquidity and uh changes in liquidity. That's liquidity risk. So, Apparently, we don't have anything for liquidity but we were constructed. I can later show you where but for today's, you know, session, I just let the function that I've shown you earlier computer for us and I tell you, it's liquidity. I need to believe me but I'm clearly if you had more time, uh you self complete it.
Um and all the other question is, why not other languages? Why do I python? Uh why is um you know, um first of all, we had a survey. Um what are the most important languages um and the service usually show the dominance of Aids or python for data sciences. Uh it's regional and it's also depends on who is sponsoring the survey. Uh as always uh turns out there are some surveys will come with other languages which usually financially supported by providers uh but äh Event People are suggesting you for the next the next uh python. Um so, um I think it's about Uh um perhaps it's about um you know, um the mass market here, what people are most using because just to even to set up with the artist to process it.
Uh you need a reference points and um you know with the larger languages, you have lots of manuals, lots of health, lots of guidance and you know, very quickly you run into problems because we work on different operating systems. We work on different browsers. We work on different versions and very quickly deploying machines and and and and programs become super super complicated and You know, when it comes to credit modeling, I would say 80% of the credit or models globally, maybe are sitting in banks and institutions and they have no choice what language they use. Uh it's uh given by the uh the team or by the bank um and so we need to use what the bank has a license for what the bank is comfortable to use Uh only 5 years ago, Python and R would have not been used by most banks because they said uh it's not a there's There's no uh provider that you can take it to recourse if something goes wrong.
For examples of the substitute but here in Python are you have the crowd. no one's going to pay you if something goes wrong and uh even in our book when we wrote it, um every language has new versions coming out all the time and um very quickly uh your numbers going to change a little bit. Why? Because there's there's there's some package that's wrapping another package and the other package just discovered a buck in a certain line and if you're a bank and all your loans, provisions using that procedure and now, you have to make a loan loss provision adjustment just because there was a version by some geek sitting somewhere, you cannot justify that to shareholders and um uh you need to be fairly confident that every time you run the program, you get the same results.
Um of course you can manage that. You can introduce versions and what so called environments controls um and thanks to that and so these days most banks have uh environments that they know which program version they use um and they are to open source um languages but um to a degree you move now to languages that are less common Uh things like me Um you have a hard time finding stuff that's um literate in that but also um there's less confidence uh by big banks uh to deploy these programs in the organization and and I had the discussion on that. um by one of the major banks in Australia a few years ago ago and they basically said out of 20 thousand employees about 5000 employees were literate excellent and 250 employees were Sos python and only six or seven um um staff members in the organization uh were probably on the level of pearl and above and not even per was mentioned as a reference point. It's just an advanced. So um I that explains a little bit what's going on uh but it's uh based on anecdotal evidence.
a very good um comment. and so um all I just want to say one thing again between Salsa and Python. Uh they all have the same I would say procedures. The syntax is different Uh the good thing about Python is um it's very homogeneous cross implementation. So, for example, if you have one machine learning technique and another one of the syntax does not change. So you only replace a single word. We run the code where um uh is a bit different because it has a leg system where we're always on a new program and over time, the exchanges and so for example, the syntax because it's a much older language a little bit older language, you find that the syntax between the techniques, the machine learning techniques, uh very uh different and then the other aspect is um **** good at selling. Uh so they often uh ask you to subscribe to different packages and you don't usually have access to the whole universe of methods versus in Python are it's just a matter of downloading but hereafter Okay.
Now, I want to talk about a little bit about not only your transformations because machine learning lift off that and so, let's pick one variable. This is the uh loan to value ratio Uh loan to value ratio is a key drive and credit risk. It is the proxy for um equity Uh well um one minus equity um and uh I just want to uh what I do here in Python very quickly. I formed categories from fifteen categories. I use the method Cucu. So, I form fifteen categories of equal number of observations. by fifteen. Um it doesn't matter two to twenty or ten same observation. Um I can um get the boundaries which are chosen and I then create a new variable in my data set that has the category in it. Yes And then I compute uh uh let me just show you that so you know what's going on. um and I can't drop here. The data set and it has now created here a new variable which is the category uh zero to 1415 categories, zero based indexing um and also um what it has created is boundaries Uh it gives me the class thresholds.
So, uh basically loan to value ratio from zero to 46 is a class one or class zero from 46 to 56 is class uh two uh sorry class 156 to 63 class two and so on. okay and then what I'm doing is I'm just calculating the name uh loan to value ratio and uh default rate so default rate of different indicators for that uh and so when you object of the new data that I get has uh uh as an index, the number of loan to value categories, uh the average loan to value ratio and uh the deer rate that's observed in the data history. Okay. then I can plot what I've just uh computed and so uh generally, we have a huge databases and the issue is uh if you want to plot something, you need to um you need to um aggregate information.
So, in this case, we just want to show the $15 coins. Let me just um make it a bit smaller. um and what I see here is now on the XX is the loan to value ratio. That's the um call it the debt funding of a loan uh and the default rate and um I hope this works. Um so if I am not taking you see here for your regions so usually uh below eighty you find um a very flat risk sensitive region of the loan to value ratio then uh between eighty to 120 you find uh a much more sense region and maybe here, who knows what's going on. Maybe you'll find a flattening out of that relationship but you can see in that relationship with it versus the outcome variable, you find a three different regions um and uh this is the start of thinking about nos because pretty much every variable you can think of features non-linear um um and you know, finances are very should I say innocent discipline uh because we do a lot of regression but very rarely, if you look at uh we just look at it as a positive or negative and that's it.
Uh but really most teachers are not in here. Um this is loan to value. I could do it for liquidity. I could do it for all sorts of things. The reason is the thresholds um and you know, here it has to have a few thresholds in there. uh um and you know, um here this is usually the lending threshold. Banks tend not to lend uh uh about and a loan to value ratio. 80%. This specialty is the underwater mark Banks. Uh when when you're reaching LTV of 100% go above, you have negative equity. So, you have structurally call it broke um and yes, maybe the increases here but here and maybe most people have already defaulted so it doesn't go further. you know, and so there's a lot of economic content in the non-linear and until a few years ago, pretty much with the evolution of our and Python people didn't worry about non linear and outcome. Um I OK? Let's just uh a home of one. um and you know, um this is not only about machine learning because what he can do is Let me just uh.
Stop that. Okay. can move on. What you know can do is in economic regression models. Um you can um include non linear futures. So, what options you have is um can put in poly but polys are as effective in the future itself. The future future future to be powerful and so on. Uh you can put in lines. lines are um uh also effectively uh uh uh uh similar to poly meals but the way they work is they look at uh uh intervals for example, like what I just said uh between eighty and 100. So, uh below 8082 hundred about 100 then they estimate separately the regression for these regions uh and make sure that the implied or a fitted estimates sort of sit next to each other at the threshold. Okay and uh we have absolute and relative uh blank coding and it's all the book on the book but for now I just want to make one point here. Uh there's definitely Pos. blinds. You can put in the categories in your code or weight of evidence of evidence is an old concept but it's still tight and all these these concepts with no regression models are able to model this non linear to get observed That's it.
The um non-linear. um for poly meals. Uh if you just look at here, uh your model is the solid line and the outcome is the The.the Scatter. It is not perfect. There's a deviation and uh so there's a deviation in blinds. uh of these uh fitted and They're not always identical and same for categories uh but for categories and fore, sorry categories in the matches. Perfect. Okay. so in Top two charts. We have an imperfect match.
It's good but not perfect and at the bottom two charts, we have a perfect match and now, what's the difference? The difference is in this case, I haven't told you how I did the polymers and the blinds but we have fifteen data points. Remember, fifteen data points and the top two charts. What I've done is I have included 3° of Freedom. we call it. So, I've included a poly of degree three or blind term that has three sections and in the bottom chart, what I've been doing um I've just uh used the fifteen categories to have um category implied default rates or I've calculated a bit bit of evidence for each one of these categories. Now, in the bottom charts, I have a model with degrees of fifteen and if you have fifteen observations and degrees of fifteen, of course, you have a perfect match. So, what we see here, the technique, the number of parameters you put in uh gets better and better if you have more uh uh perimeter in it. and obviously, you have a perfect match.
If you have a number number of parameters of the number of dimensions or the other points, you look at Äh also äh because in machine learning and you know, um we will not today, we will not become experts in machine learning in terms of the implementation. I would need a whole class, a couple of classes to make you literate on that but I can show you concepts and these concepts apply for various um techniques. Yeah And so, um machine learning models achieve the same not only for different approaches. So one class for example, networks and work with hidden layers and activation functions but it's linked functions between the outcomes and the features um decision trees uh work with the splitting of the future space.
When you say for example, loan to value ratio above and below, below in random for a bootstraps that is random samples of decision trace which then aggregate later on. Okay, and so um there's more machine learning techniques out there and they all have their way to fit the data uh to to fit the model to be observed data and effectively what they are doing is exactly what's going on here.
And you can actually in many of these techniques, design random. for example, you can decide how many bootstraps you want to run. you want to run to or 100 or a thousand um and these parameters called hyper parameters in machine learning and you know, here, we have just used our existing regression models. In this case, we're larger model and I have used here I call it a model of complexity free and you a model of complexity Fifteen and so the same can be done in machine learning.
I was stopped in a minute and then we can uh uh um um maybe go through all your questions. I'm sure you have many uh just the uh following now um uh bills on that. So, um let's quickly finish that and then we go on to all your questions and so now I'm Harry. Yes. yes and now we have time to ask that question. Okay, I need to know. Yeah, no. Okay. Or you can do it later. No, no problem. Okay, it's fine. Let's do now. Uh there's there's there's a question over here. Um the first one is can you share some advice on how to handle missing data before modeling? Um Yeah. Okay. Yeah. Keep going. I'm making my notes. Um the second one is that perfect matching would not be rich to be over fitting the data. So, let's go with the two question. first and then later on, we'll go with the others. So, the second one is like, yeah. Okay. So, um uh the first question missing value.
Um it sounds easy and you know, the academics, we always say, I'm missing value. Just replace it with a cross sectional or just kick it out. Use a drop in a statement. uh but but it's actually much more complicated than that Missing values are probably the most underestimated future. So because in missing values, you have information often there's a reason why the value is missing if it's an error. Okay. it's an error. kick it out but the missing might actually be a reason, an economic reason why it's missing and so for many, many applications, missing values actually should be the center point of your research or if you're building and so this is the problem starts.
Um if you then acknowledge your missing that it stands for something, then um how do you treat it? Um in the book, we have a whole section missing values uh but the only um uh again let's just say twenty pages. it goes in the right direction uh but there could be a lot more to be done on the values. Second question, Fitting is coming in a slight. So, if you bear with me but you're correct. Um if you're perfect match, you're over it. So, um spot on. Yeah. And so, this set here is exactly the point. Now, we can decide how to build the models, be it on ordinary economic techniques and machine learning models. You can save, you want little complexity, a lot of complexity. Now, if we have little complexity, then, we have a problem is called under fitting.
We introduce a bias. Our model does not describe the data well, die das ist Model give you an example. Uh the one of the key models and credits models of the money is so you know, we we always like to say uh uh uh model. it's it's a model for probability of default. It's based on the pricing framework. People got the noble price for it um and you think it's a cool model. It's a cool model but effectively, it's very prescriptive. It's very over fitting it it is perhaps given all the assumptions. it's the right thing to do but if you apply it any credit portfolio. The the the the deviations between your fitted values and what you observe are. So it's dark but it would not pass internal and external validation of the banks.
Yeah. And so we have worked for twenty plus years in this field. They've never ever seen the merchant model applied in the industry or to even get one step further. The merchant model is implemented by Kd. I have not ever seen the KB model. That is a calibration of the distance to be before it's to observe applied in the industry. I hear some people do it but I've not seen it. and so um but there's many more examples for over fitting and um Let's look at a few.
Okay. and so I just want to highlight is also the third source of error and that's uh Imagine you flip a coin. uh you know, heads and tails have 50% chance but the outcome head and tail is hard to predict Some of us we are usually statisticians. We think the best you can do is predict the probability 50% other people out there. They are scientists, mathematicians usually they say, oh perhaps you can predict whether it's a head or tail because maybe you can figure out how you flip it and then you can look at the air and measure the air and it's going to turn around and maybe you can calculate everything and you can actually predict head to tail It depends on. So, if you are the decision, you say, I cannot be better than the 50%. If you're mathematician, maybe you say, I can predict it perfectly. So, for the decisions amongst there's still this for the mathematicians of you just drop the third bullet on the It it doesn't exist in your world.
Now, there is the trade off. Now, between under and over fitting. So, you make a decision. Do I want to unfit or overit uh and the trade off is usually handled uh with concepts such as cross validation and tuning, hyper tuning. Again, I don't have much time to go into that today. I just want to highlight that if you hear these concepts, and again, the book has a lot of material on that. Uh six or seven whole chapters on on on that tuning. Um uh so so that shows you how to get that optimal mix between uh and then over fitting but now, what is under and over fitting and I think this is a bad example.
Let's look at the charts. Uh so, first of all, we have six charts. We have uh two columns and I'm going to focus on now on the left column. That's okay. Um and what the left column has is uh it has um the yellow line so it's kind of think of it uh a different rate over the economic cycle and then you get the plots. and uh basically, the yellow line is a true data generating process. and the blue plots are mosquitoes. They are kind of what we observe in And now, the estimated model and for example, in this first case, um the estimator a model of um Alternatively, uh we can of course uh model uh uh to model the free uh terms poly of degree one, two, and three. So, future future future cube uh and when we get the green dash line which follows the data generating process quite well but it's not perfect because the the observe blue lines are still away from the green line Um and then we can introduced the twenty data points.
A model of twenty prelims, a perfect fitting model, and clearly the green dash line runs for each scatter plot and so it's imperfectly describes the other and so this is all for the train sample in the first column and now, let's have a look what happens if you apply that model. to the same to the same experiment but just another random draw, okay? And then we are here in the right column.
Uh that is the uh second column um and basically it we have the same fitted model. It's the same line is here. It's just a blues have changed because we had another random experiment and all these blues are different and On top of that, we can now calculate äh äh Performance-Messers. In einem so Basic Level oder im Doing ist es ein Model. Also hier ist, hier ist ein Model und hier ist Mysge. Ein mashing the Distance, ein also, Ähm and the greater the arrow, the the the poorer, the fit of my model. okay? And so now, I'm putting that out of uh uh for new sample. I call it the test sample. So, the first of it was for training sample and it's going to be a test sample and we see that actually something quite unexpected happened. The test sample turns out to have an MS Es. that's better. uh in terms of it's lower. Um so it's not a bad model because You know. the training and the test samples are similar. Actually, I make that statement If I was to now um not have twenty data points but I would increase the number of data points maybe two infinity Uh these MS Es will converge to each other.
So we're trying to test that would be similar. That's my guess now. Um I can go down now um and uh to the next setting where I estimate the next model. So, the next model is going to be the 3° poly model here and so the green line is the same as before because it's the same model but ours have changed and I recalculate the MSC. um and you see here that the MS E has changed.
Let me just get the uh pointer, right. The MS E has changed from the training example of 7%. Again, 7% is lower than 50% before. Why? Because we now have three decreased poly but um uh in our test sample, it has changed to 22%. Now, 22% is still better than uh 44% we had earlier but it is bigger than the MS E in the uh in the the the train sample. okay and so long as you increase the complexity of your model, if it gets better but you also see that the fit of the train sample relative to a test of a test sample or relative to a train sample deteriorates and this is exactly what colors means with over fitting. So, uh your your model apply to other other data sets. Suddenly, it's over fitting it becomes not good enough here at three point. We're now this is too good because the uh uh doing free Nomas in the test sample that gives us a better result and one poly but of course, then we can do the next one. We can go to fitting or providing a perfect model. That's this model here is twenty uh to twenty and why it's in the train sample of the spot on or MSC is almost zero.
It is zero. It's just um whatever um computation error in here but if I do it now for a test sample, what's happening is not in a single instance, the Green Dash line here. It might look here as if as if it does but turns out it doesn't. So, if I what I mean here is this one. if I for example here, this is the and the so close to each other but you think they are overlapping but they're not. They're actually far away and so we see here the MS Email shoots up from 7% to 9000. That's astronomical. Yeah. And so within one to three to twenty poly nos is a turning point where suddenly you're increases and you know, the MEEE. when I did my PHD in the first time. So, RMS, EMS. I was thinking, oh, it's a crazy number.
That is MSC is actually very, very simple measure and it's actually the me that's closest to bank profitability because the distance between your observation and your fit gives you an indication of what wrong decision you as a bank can do in terms of assessing of credits and usually if you overestimate the risk or underestimate the risk you have losses in life and these losses are heavily correlated with AME. So, out of all the measures um It's not as a symmetric but um a variation, a variation of the MS E is probably what's going on in the mind of banks and so with the MSC is actually summarizes very well under an over fitting.
Okay And so, this is the problem um and we're not even that much machine learning but it's exactly the problem with machine learning these procedures tend to over it and now I'm talking to my discipline because I work in finance a lot and I see a lot of machine learning papers coming out because it's trendy and it's easy to use existing papers um both on a machine learning technique and say, hey, this paper has found um a performance of X and with machine learning, I'm a 5% or 10% or 100% better. You can do it for everything. for every economic relationship that has ever been published, just do it and I see papers being published but people do not scrutinize over fitting It's easy. It's easy with machine learning technique with any technique as you increase the parameters to have a better model but does not mean that what you're observing is actually what's going on in the data uh and so this is the problem and um in any case where we use machine learning to analyze things, you need to demonstrate.
You need to demonstrate that your machine learning technique provides a benefit but also is robust and usually you do that with cross validation. Um hyper tuning in certain regions. So, machine learning is far more complex than just executing a line of python code. We change the input and the future. So, the future stays in the outcomes. Um it is actually you will spend more time than estimating a recession model because you need to understand what all these parameters do and how it impacts your model results. Now, um and also going forward, referees become more sensitive of that issue um and uh they will demand a checks going forward. now. let's just move on.
Uh you know, this is where uh in banking, what we do is in credit risk. the data are not along the cross section of but it's still along the time series. Uh so that is the look at uh different time periods. I'm going to show you that in a second. Um either way, when you look at techniques or um machine learning techniques, you can manipulate them the results. This is how it works. Uh II hope Um So, basically, what is you often you can take your hypothesis after you analyze the data, you can test multiple times. You can um include um uh uh different features until you find uh that that feature combination that supports what you're after uh or you can select it for a report of course and if you have multiple results, the ones that fit you with machine learning, you can do the same thing pretty much what Carlos has described. or um uh we showed earlier increased the number of parameters and then you find a very good um both um Pv hacking and hacking result in no fitting.
now. um I just want to show you a last thing. Now, let's have a look at machine learning techniques. Um let's first look at the data So, in the data, if I calculate the deer right here, ähm ja so wie äh wir auf sind wir da and let me just show you that. So, the issue here is uh we have um anonymous the data and so we don't talk about the years these years here correspond to values of 2000 with zero to 2015 Um and really, if I look at the data, I see two periods.
I see this period here. It's been a period and I see uh on the other way around in terms of colors but I see this period that's um Let me make it uh the other way around. So, I see this period. It's a green period. It's up to you. Um and then I see uh the right period which is uh a period of increased um risk uh for retirement, you're going to slice the data this way. So, we're going to have a training period of zero to four Twenty-six and for a test period, you're going to have um a global financial crisis. Yeah. And so the average uh default rate for this sample might be here and the average default rate for a sample um might be here maybe even lower.
Okay. So we're going to do The harshest test of all, we're going to estimate a model during periods and we're going to forecast the risk for the economic downturn. Uh one of the best shocks for global financial crisis in mortgage lending. This is the very one function. um comes in handy that's created by a book. It's a sort of a tool where you can slice and dice with data into training and test periods. It does also calculate your various uh future transformations. Things like principle components and trusts.
So, I'm going to run that not much to see because there's no output but okay. Uh there are various features I'm going to include liquidity and equity as discussed. Um someone's I said I have you enjoyed liquidity. Yes, we include liquid liquidity as the cumulative access payments at any point in time. it's a bit hard to explain. I need more time for that but it's a matter of liquidity. One of a couple include equity which is one minus belong to value ratio. We include the loan contact rates, some sort of um you can ask yourself, should I include it because at the end of this is maybe what the bank wants to measure but at the other hand, a loan rate also includes information bank private information, one can explore Fisco GPA, principal components, and customers Now, where are they? Well, okay. They are in the in the data that have just created. So, I can comment. Let's just comment these lines here. So, I'm not going to execute them anymore.
um and I can have a look. So, first of all, of course, um I have here a trained data set We can look at the data that a bit more aggregated terms. We know it has 12 thousand rows. So, first, the original data that has uh 1000 rolls, the trained data set has 12 thousand rows and our test on it um has a 27 thousand rolls. Yeah. Um why test out a more than a while and just because over time it keeps going in size. So we have ramped up with portfolio. um and um Normally, some people have a larger um um obviously train samples and test samples but it doesn't matter Uh for us, we experiment cones but you want to predict outcomes for crisis periods now. Um that's so good. That's it. I'm going to learn to machine learning techniques and I just want to make this point Um I'm not going to use this train and test all sets anymore. I'm going to use these variables here. uh Xmas and if I look at them, Uh let's first look at the data.
So, if I look at the data, I get a starter frame and if I look at the X starter sets, uh I actually get arrays. Yeah. And so, let's keep going. the then estimated logistic regression model. um in their various noises to it. How it works. but you know, we use the package code so I can learn. This is a machine learning package. and the method is logistic regression. and uh the first set up the model uh select the model and then we fit the model using the fit method and be fitted by specifying the Napa area of the X features and now um there's some some details to that.
What we have done with the features and so on. um but I think it would go too much in detail. I need more time to explain it to you uh but uh basically fitting the observed features transformed for machine learning fit um and um the uh observed outcome variables, okay? Um and then I can uh the parameters of that model. What's the problem? So, it's currently not working. Just one second. I don't think it's okay. Just um a bit long calculation time wise. Okay, So, this is a logistical regression model. These are the parameters for the parameter for the intercept and then the parameters for various uh estimates and now these estimates correspond to his uh features here to his features.
So, look at his first second and normally in metrics, we have a nice estimation output here but because there's a psychic, it's machine learning you can print the estimates but it's no longer this sort of comprehensive table. You can create a lot of that but it requires use uh uh other functions that embeds I can learn. Okay. And um I'm going to show you how to use them in a second.
Then once you have done the modeling, you can estimate the probabilities of default and then one of the most important functions that we have in our package in the book is the validation function. that allows you that allows you to uh summarize your model. Yeah. And so, this is a powerful tool. the validation function and I've not seen it from other business specific to credits and so, what we have here are summary statistics. So, we know for example, in samples, we have 12 thousand observations.
We know that uh our average default rate is uh 1% in the data Uh you have the area under the curve. It's an important measure and these measures are important for other things. Uh regulators may be armed. Uh the book goes on and on and on about it uh but for now, I don't have time. Yeah, I just want to you to focus here on uh this is for example fit the area under the operating characteristics number between fifty and 100% is 78% We see here uh the fit over time That's a very powerful uh Ähm You may say it's not high enough because you've seen someone has a higher numbers.
Uh there's also some literature on that. I suggest you read um a paper by Daniel and Blots in 2000, I think five in the magazine. It talks about uh how to interpret his numbers correctly uh but anyway, uh example, happy days. Okay and and now we use the same model with our model here uh but we use the test data and rerun our validation output. and uh this is a problem starts because now we have here a timeline and we have here the outcome. So, the outcome is an increased risk to see before we had a average Um we had had an average uh default rate of 1%. Now, we have an average default rate of 3% uh which is much higher and so in the crisis, the risks are but our predictions don't follow That's a problem with the model. Can we do better? Well, I don't know. We'll try.
Uh but in the last year, the calibration, you see that for any average PD level, our outcome levels, the default rates are much higher. So, um it shows that our model is under predicting the real life different rates Okay. and then Now, we can do a machine learning technique. um and you know, there's a lot of them out here. be quite like the voting classifies because it combines machine learning techniques. So, in this case, we combine a logistic model with a random for us with a neural network and we can do other models as well but you know, this is a bit coding. It's all in the book but um uh you need to effectively estimate these these models and separation first and then there's a technique to aggregate all of these models Uh in this case, there are two techniques. There's a uh a hard and a soft uh aggregation be use as a soft one that's a mixture model and the outcome of that is this chart here.
And we see especially if you look at the upper right chart but suddenly, the machine learning technique is much better to predict the increase in the crisis. Why? Because it's accommodating many many of these non uh that we have observed now. If you find similar observations of others. So, long story short, should we use machine learning techniques? I think it's it's um definitely what banks should use Um uh banks have different applications that we can use it for loan process, loan loss, provisioning, and bank capital regulation and you may find when the moment you're um I do something for me that's that is by accountants as long as provisioning or the regulators that is a capital allocation machine learning techniques, they are not there yet.
They they don't fulfill one of the So, this transparency because a lot of these models, I can no longer show perimeter estimates um and with that, um it may take some time until these models make entry and we are hopeful because um we think these models are reproducible. It's not like um that they are okay. If there's no equation, who cares at the end, you just want to have a good outcome predictor to have a better system and so we have a good hope and in 1020 years uh these models will make entry also to the area of credit risk analytics in terms of bank regulation Regardless of that, of course, you can use it in loan pricing and anything that's not regulatory and perhaps you should.
The only the only comment here is exactly colors coming. Again, you need to really really understand these models understand the issue of over fitting um to be able to um uh uh uh estimate models that are predictive, uh predictive on other loan portfolios and predictive on out of time periods and this is where COVID comes in. Now again, let's close the loop to the beginning. A COVID has changed so many things in our lives. and it's not necessary that the risk is higher at the moment it might be higher. So who knows but it is the interactions that are different. What if these non linear are different? What in terms of the difference is is on every single exposure um and and they discussed it but there's a big big big big problem with it and and so uh the question is and and uh and machine learning techniques to uh to capture it. We need to find out Um I don't want to bore you with all these sort of stories now but um if you can have a chat later, Uh last but not the least, we do offer training events.
So, we also do it commercially for industry. So, if anyone who's here from industry and um your employer happy to uh sponsor you for paid industry event and look it up. Uh that's one for coming on a GD in September Um in that workshop, you will estimate machine learning models um uh grounds up so we will um talk about things like standardization on how to uh collaborate and how to um cross validate and and the hyper parameters um and last but not least, um you know, uh um cres of our life. I mean way um except our families of course um and um be the um these days we do a lot of the LinkedIn in a sense that whenever we have a paper published or something's coming out that could interest you, we share it and likewise, if you have something interesting uh also share it with us because um we live from your experience um observations. what they said, it would be great if you um you know, could connect um to Daniel and then also to myself So, that's kind of what I prepared for my end.
we can have a chat. if you have any questions. Sure, there's uh that applies a lot of question over here. Yeah. Yeah. So, yeah. Um uh first thing I lost uh for the uh interesting stuff uh regarding to the machine learning and uh the future trends that's going to be applying biking. Um there's a few question over here regarding to the morals and the uh um the material that you have just presented. Um the first question is are there in uh any minimums uh system requirements for running these models? I mean it's like I think The attendees want to ask about the hardware requirements. I think that uh uh a lot a lot. Um so, even if you could spend a whole week talking about it. So, first of all, you need to put yourself into the shoes of an academic. Uh you probably have the lowest entry because you can uh you have many many many liberties so you can run it on any computer.
Usually universities have big cluster computers. Uh it also means you can change the data. Uh if you if it's too big for you, make it smaller, uh you do end up sampling but but if you're in an industry in a workforce uh and you have many strains and it starts off with um you're working out by yourself. You're working on a big team. What if you have 100 models? um and uh in order to do all of that, what we've seen today, it um one of the inputs or requirements would be that the whole team uses a language that's that's the same so not ours and Python choose one that you have coding standards guidelines on how you communicate with each other to help people implement that uh in in in the workflows you've because of all these versions if you could use Python or come out you need to have um environments where people use the same versions not that the next person uses the same code and gets different results.
Um and then of course, there are also um um Once you enter the machine learning technical, requirements where you need a lot of computer power and so we have used um one package as many other packages äh IT so so very quickly. This this this um you know, can be very complicated and um uh we have different processes for example and um of course some processes are quicker and you know, you need to be uh make wise decisions come back to your question. I remember the time and it's just one. that's one. One of the many reasons I didn't show you uh execute. For example, the last model is because it takes a lot of time uh uh a lot. Yeah, 15 minutes but you know, for our seminar, we don't have luxury to wait.
Yeah, sure. Yeah. Alright, thanks. Thanks Harry. Yeah. Um yeah II think it's the choice of the is also depend on the regulator as well, right? Because like um last time I remember at the time when I worked in a bank at that point in time the regular tends to accept us rather than get all the testing and results produced by our like the of the open source of things like that. Yeah. Because like we said, they got the whole company. If you back up for the output and things like that. Yeah. Yeah. Just just two comments on that. Uh so yeah, so the regulators are losing up a bit so so we also acknowledge you know, um the entry of Python. Yeah. Uh but second also for a reason um it is uh the version control problem is a huge issue in the irony, right And you know, Um then I have worked on two different computers um with different packages in the book and they are uh issues we had to overcome uh to get a consistent results.
Uh in terms of this version management. Yeah and uh if you just let's let someone lose um and you know um uh not within the bank but sort of outside of the bank organization. think about accountants consultants uh then there is a problem and so so regulators for a reason uh um a bit critical Yeah right? Yeah. Thanks, Harry. Let's go into the next questions. Um here the uh welcome back to the Blinds regression or the regression. There's some questions about uh when do you use the appliances uh regression instead of the poly? I mean it's like uh where's we will trials and errors or No no. Um yeah so regardless of what you're doing uh be uh regression modeling, machine learning. Uh I'm not a big fan of any So, you know, there are approaches but you have um for example, step wise and forward and backward approaches to test all sorts of combinations. Um this is what engineers would do. Uh I think uh if you work in credits modeling, you are usually of an age of um someone who's finance economics that even if you have a background in engineering uh but you should always um uh try to ask yourself uh what are the important drivers and what regions make sense so that you're not running like a grid search from A to B but you're focusing on areas that are important and so uh because there you get much more at the end that everything's limited.
Your resources are limited but you need to come out with the best model with you, the results you have and you only achieve that if you kind of know what you're looking for and so for example, if you racial, if you read one or two papers on a mortgage lending very quickly, you know, that most of the activity is from LTV eighty and above and so that I wouldn't look too much than modeling the we call it the AV below eighty or below sixty or below whatever your threshold is but you need to very um be careful on modeling and everything above and the higher the more careful you have to be um and so we are uh you know be it um also a discussion around missing values on selection biases. um outlier discussion. So there's a lot of problems in modeling in all these cases You build better bonds if you are a subject matter For example, on bank lending or for example, if you work in telecom on on uh you know, forecasting turn rates for uh consumers.
Uh so, so whatever you're doing having uh uh uh a subject matter expertise is important even for modeling Thanks a lot. um there's a few other questions as well. Let's look at the attendees is very exciting with others. Um the next one is uh like this come from the problem. That's the he or she um face in practice. Um so the question I think from uh asking that I had a problem with the hacking um when I change the way I split the trends test test the result change uh even after combining with the cross validations and the hyper pros to uh this phenomenon still happen and the others.
Have uh Does he or she do? Uh well. um so um this is uh you know, um um this is an important question uh but it's also very hard one to answer because I would need to see that and work with it. Um I can only um äh Statement, whatever, you get also suspicion that something's wrong is usually that tells you that there's something not robust but it is a general rule for many economic analysis and it's very very hard to um you know, give a solution that works and then um you have a perfect model. Every model isn't perfect by definition. It could also be that you are worried about um you know, decibels and so on which are not really economically meaningful could be but um I would need to understand more about the problem Yeah.
Alright. Okay. Yes. I think this uh uh the most important thing is to look at the data and the characteristic of the data that's um uh just been working on you know, to identify the exact problem, right? Um the next question that's like um are there any limitation of the numbers that are input data for our model? That's the the model that is uh the mixture between the neurons and thing like that and uh other and uh minimum system requirements of this one we already covered for this. Are there any change to update the system because increasing the number of data and how they affect the timeline Yeah. So, um uh the computation times or the processing efforts of course, increase the the size of the data you process um and um uh you know, this is because you have more features or is because you have more observations. It increases in both dimensions and also, you know, with Python are you need to be careful with a little bit um how you use work, memories of computers.
So, for example, **** if you use as a commercial uh solution that has been designed for panel data sets and so what they do is they um very um uh resource efficient read observations into the work space processor and take them out but an and they bring the whole matrix of the whole area into the work space um and it's not efficient to process it this way but there have other packages you can include but it allows for more resourceful call it um data management um and so that's also possible but you just need to know about the problem and the way to solve it but regardless of that, of course for more features and the more observations, the more computation time and computation resources you need Thanks a lot and just give you uh and there's no open. There's some uh minimum uh second of course but uh you can have models with go to infinity and so if you set it up in the wrong way, you can wait and there's never get a result. So, so you really uh uh this is a this is a very important issue.
Yeah, exactly. yes II remember the time when we work on the uh like two step uh modeling and going to the new um kinds of models and with the non-linear optimizations number one and at that time I need to run hills three the same time. One from my laptop, one from my desktop at home and one at the work place in order to, you know, run the different kinds of the scenario that's we're working on. Yeah. Yeah. You know, you can always start small and you start as a small set up a time when you increase the size and then um you can get a feeling what size and time means and then you can sort of say, how much do you want time do you want to invest uh to get a result? Yeah, exactly. Yeah. Alright. I think the next question is from a PSD. um that's uh interested in what uh on the um you know the learnings area. Um so, the PHD asking us if I want to submit a machine learning models as you know, uh what's your advice Um on how to convince the reviewers to believe in my results? Uh I've seen many machine learning papers claimed to outperform our sample but who knows if they actually use the data to fit the worlds That's the hardest until um as a discipline.
The more the more the submit papers that are sloppy more on the papers um but user methodology, the less accepted it will be uh and so um basically uh you need to find uh experimental setup that sort of ensures the integrity a little becomes with her reputation. So, um you know, there are uh I think referees often they they have a feeling sort of about people's work and so if you come from uh uh if you have done some work and validation, maybe they trust you more uh because uh one one of the dangers of it allows you a suite of validation tools, you know, like our validation function but you can apply any control fit uh but the biggest problem over fitting.
You need to tell that story. You need to address it uh and so not just present results and say this is a better model and be done with it. You need to show um why it's better and how much it's better but it's easy. I haven't done yet because um often there are limitations for example, for many applications, you don't have enough data to have. Um in many cases, you need free data. That's a training and a test and then you need a test data but you only apply once. So, You often don't have the luxury to for you. That's um and so, Perhaps uh one advice is a look for applications where you have a lot of data um where there's a lot of issues. So, if you have a lot of risk with not many default events uh or not many time periods perhaps you don't want to even start doing machine learning but um again, like in our case that we feel we have enough data.
So we do it uh but but for example, if you did um Australian defaults and you have um so we have in the starter set uh just give you an example fifteen 15 thousand different events. We feel that's adequate and we have an economic downturn and we have sixteen or 17 years of data uh but if you have um if I look into Australia and you may have um uh only at the same time period but uh say only a few hundred default events and then they have ten a year and maybe you want to um perhaps um not apply machine learning picnics.
So so so I think it works the better the more data you have in general Thanks a lot, Harry. Um I think the last question is like um get interested in the book that you uh that you interviews um and also the events uh regarding to the trainings uh uh like in more details about the big tricks um and um actually asked us the the books uh in um just keep the in detail discussion in uh some of the model that you presented and so it's the application of those model So, so sorry. The question was uh the the question is like um where's the the book provides all of the details regarding to the moral that you presented in the application? Uh tactics. Yes. So so indefinite of the book, a front to end is is is is is is the whole execution as if it's not a black box or anything. Um it takes you from data to the results. So, there's um uh uh there's no gap uh and with that, it's very honest but um you know, if you will also find that some techniques and data perform better than others Um auch so Das sind ähm also So, so um uh I think it's perfect for someone who wants to learn and build better models.
Yeah, sure. Thanks a lot, Harry. Um the other thing is um the some of the attendees is also interested in the events uh regarding to the trainings in um can you like uh it's kinds of information maybe in the chatbot about the and things like that they can find the event details. Yeah so it's um uh I'll put it in. Um so so it's um I'll put the link in. I'll put the link in the chat. okay? Uh also, if you um you know, um connect via LinkedIn, I can send you an an update if you exchange or anything It's a link for the event uh and also um uh yesss connect via LinkedIn. Uh so, so we run events uh maybe once or twice a year, not so often but but um um and also you know, we we all academics and the right papers and um it's interesting to see what people are working on or if you um you know, feel you've worked in a related field machine learning credit risk, it would be great if you connect and you know to send a paper to get a feeling what what you're working on, what the issues are and stay in touch.
Yeah. Yeah. Thanks a lot, Harry. I must say I should. We also attends the class going to the Grand Prix. The links before but at that time, I was um uh I'm learning the uh I was learning the art uh rather than Python. Uh my practice is about 2 years old. I remember it. Yeah. Alright. Yeah. Um yeah um is there any other question uh for the attendee Okay. So, um just to ask a question regarding to um what if CP is still in lockdown? Do you mean It's the physical class, right? I think it's the masterclass can be online, right? Yeah, it's zoom. It's in Zoom. Um Oh, it's okay but that's it. Um you know, um we are hopeful that uh 1 Day we will be all vaccinated. So, it's always a call and everyone get vaccinated and let the open the border. Uh I think at the moment in it looks like um we sort of uh more following the European American model that we are now get vaccinated and and hopefully this whole um COVID story.
It comes to an end very soon. That's all I can say and then, of course, we would love to see you in person and not only for the workshop but also if you want to meet up and um Danielle is in Rags, Germany. Um many of you guys might be in Germany or Europe and then also I'm in Sydney, Australia and we all travel quite a lot.
COVID permitting and so um if you want to catch up in a meeting in person, um let us know where you are and you'll find each other. Yeah I very much hope that's the uh the the the situation is resolved then we can also uh in in New Zealand as well. I'm very much hope for that because I love the uh you know, the social interaction after class.
We got uh informal discussion and we can actually acquire a lot of the idea and problems that we face in the practice in banking, right? Yeah, yeah and you know, you know, credit risk has lots of stakeholders and you'll find there's regulators. there's industry practitioners um on different ends. so you have the banks and you have the fintech companies and you have the rating agencies and uh as soon as it goes on and on and on, it's actually interested in credit uh or you could do a quick look at uh a LinkedIn. How many um actually people working in the industry. It's become an industry globally but it has 100 thousand members. Uh so this is a risk is not only and it's small anymore. It's huge. Can only encourage everyone to uh uh be part of it, right. And the salary is high as well as what I say right.
Um okay. Yeah, I got a question. I like uh in terms of like step one with the recording, be shared to the attendees. um um Ross, um I can answer this question for you. Uh we have uh the YouTube channel for our network and later on the recording is going to be shared on the YouTube channel so you can review the video anytime you want, right. Um I think we'll um alright. Okay. Um we will um I think I'll upload the videos about uh like today or tomorrow, right? Uh if we do not have any further question, then I think so we can end the webinar here. Uh thanks a lot Harry for giving your thumbs, giving you a great help and inspire us about you know, using the machine learning and credit.
Uh Rick's modeling um and uh thank you uh you all for um joining us um as uh attends the webinar. Let's make our event successful and um yeah hopefully we will um you know uh organize even .