Big Data is a Big Deal

Episode 2 June 01, 2024 00:28:07

Hosted By

Michael Hatfield

Show Notes

"Big Brother is Watching!"  Many times I have been aware of massive database systems that can process enormous amounts of data.  Much efficiency and knowledge has been derived from these huge systems for any size of business.  I wanted to know more!

In this episode, Michael Hatfield interviews Mr. Shelby Thornton a long-time data engineer who knows his way around a Xabyte.  He is one of those exceptional people who have made Big Data their life’s work, and what a task it must be.                                       

You’ll not want to miss this latest discussion on Big Data-A Big Deal.

Please remember to go to our new youtube handle, MyRealTalkShow, that’s MyRealTalkShow at youtube.com and touch the Subscribe button! You can also find past-aired shows at our handle MyRealTalkShow on youtube.com.

Hot topics of the day, amazing people, and we talk about the latest in real estate, too, each week as Michael Hatfield hosts the “Real Estate and MORE!” show.

The weekly Saturday Show of (2) Episodes airs every Saturday on the San Francisco Bay Area’s largest am radio stations: KGO810am from 09:00am-10:00am and on KSFO560am from 5:00pm to 6:00pm.

The Michael Hatfield RE/MAX Team is an experienced Real Estate Broker choice for home buyers and sellers in the Bay Area. If topics of the day fascinate you, interesting people, or Bay Area real estate, you will want to tune into each episode.

View the Michael Hatfield Homes Website or contact Michael directly via email.

Show 42, Segment 2, originally airing June 1, 2024.

View Full Transcript

Episode Transcript

[00:00:12] Speaker A: Welcome back to the real estate and more show. Thank you for listening this morning. Today I have a good friend who is very knowledgeable in the field of big data, most recently director of data engineering for a multinational, well known corporation. I've known this individual for probably three, four decades, and he's come in today to share with us an overview of what he has learned about his incredible venture of big data. Big deal. Welcome to the show, Mister Shelby Thornton. [00:00:44] Speaker B: Well, thank you, Michael. It's a pleasure to be here. I think there's a lot we can go through with big data. It starts out with small data and works its way up. [00:00:52] Speaker A: It sure does. For all the years I've known you, you've been involved in the field of technology. You have pretty much focused on databases from day one. Please give our listeners an overview of your background with engineering of relational database products, and then we'll get into the big data. [00:01:11] Speaker B: Okay, well, right out of college, I went to a technology startup that was producing relational database products. And so I was an engineer there. So I was developing the internals, and in fact, one of the things that I was responsible for there was creating a test system that was the first relational database ever to exceed 100 transactions per second on a TPC benchmark. Basically, it's a financial industry benchmark now. You get thousands of transactions per second. But back then, 100 was a big deal. [00:01:43] Speaker A: A big deal. I'd imagine 100 transactions per second is pretty fast. [00:01:48] Speaker B: It was then, it is now. It's kind of a snail's pace, I understand. [00:01:51] Speaker A: Well, what kind of databases have you worked on from day one, mister Thorny? [00:01:56] Speaker B: Basically, I've been a relational database guy, and relational databases have actually grown into big data. So I'm sure everybody in your audience has heard of Oracle. Oracle started with relational databases. It's still their bread and butter, but now they, you know, they moved into HR systems. They're, you know, they've bought a lot of different companies to become a huge corporation, but their bread and butter, their base is relational databases. [00:02:22] Speaker A: What do you mean when you say relational database? What do you mean by that? [00:02:25] Speaker B: It's a way of laying out data. And that is you can, if you have records that have, I don't know, let's say it's an employee department and accounting, you could have one record that has all that in there, but then you'd have a lot of duplication of data if you had 100 employees in the same department. So you break it out, you have an employee table which just has employee data, and then you can have a table that has department. So it's department data, and then you can relate, relationally. Relate those two tables together so that you don't have a duplication. It's much more efficient and is less prone to error. [00:03:04] Speaker A: Wow, pretty interesting. So are these databases primarily for businesses? They're primarily for businesses, I would imagine. [00:03:13] Speaker B: Absolutely, yeah. [00:03:13] Speaker A: With us in real estate, ours is not big data. It's more relational data, and it operates on a customer relationship basis. I imagine with what you're doing, there's a lot more purposes for the data that you get in. [00:03:31] Speaker B: Right. And there's a lot of different ways of using data. And early on, there was a little saying that we had. That's still true today. He who has the data wins. [00:03:42] Speaker A: Kind of interesting. [00:03:43] Speaker B: So understanding, for instance, the example that you just gave, and that is you have customer data. Well, there's a lot of data about your customer that isn't necessarily in your database, but it's out there. So, you know John Smith, you know Mary Stanton, all those folks, they do a lot of things on the Internet these days, right? And there's all that data is out there that somebody like a Google can take, or a Facebook can take that data and they can relate that back to that person. So it could enrich the data. If you wanted to pay Google to give you this data, you could learn a lot more about your client, which might tell you some of their proclivities, where they might really want to live, what kind of house they, you know, what type of things they look for when they're just searching the net that might relate to a housing purchase. So there's a lot of data out there, and that's where the Internet came in and really exploded data, because before, as we discussed earlier, it was pretty simple data. It wasn't anything that was going to really, you know, shake the world, but it was going to help you run your business. Now you can really take a lot more data and pull things together. And sometimes it feels a little insidious, to be honest. [00:04:55] Speaker A: It's evolved a lot. So the primary purpose, pardon me, that I understand a big of a big data database, is for a company to use the data in their systems to improve operations, provide better customer service, create personalized marketing campaigns, and take other actions that ultimately can increase revenue and profits. [00:05:19] Speaker B: Absolutely. What you can do is, you know, for the customer, just the customer data itself. The more you know about your customer, the more you can target your marketing campaigns to that type of person. And they would speak to them by the same token, let's talk about something that's completely different from that, and that would be security. Well, now you can take insecurity from all over and be able to relate that back and go and see what the norm is. And when you see something out of the norm, you go, I need to take a look at that because there might be a security breach based on blogs that you're getting from a lot of different sources. So when you, I'm sure that you read in the paper all the time, but, oh, this company just got breached. You know, someone just kind of broke in and got, well, what you want to be able to do is take, just take the security data that's there and either be able to look back and go, okay, how'd they get in? [00:06:11] Speaker A: Or are they inside? [00:06:13] Speaker B: Or can you, can you actually look at that on what we call a streaming basis, see it in the stream, like, hey, that's out of the norm, and shut that system down before they have a chance to get in? [00:06:22] Speaker A: Wow, that's pretty amazing. I recall back flying the old Tristar, the L 1011. They had Rolls Royce there, and Rolls Royce has set an example with big data. They started moving the data of 100 parameters into the flight data recorder so they could pull it at the end of the flight and determine how well, or if there's any issues and deficiency with those engines nowadays. I understand that using the cloud based solutions, that they are able to take it on an ongoing basis and not just wait until the flight is over, right? [00:07:01] Speaker B: I mean, what you're describing is what we call the Internet of things. IoT and factories do this, airplanes do this. A lot of different entities will take data and they're going to go back to streaming data. So they will stream the data. And as they're, they're able to actually take a look at that stream and go, hey, this is out of the ordinary. Why is, in your case, the engine running an extra two degrees hotter? Normally, it would run at this temperature. Now it's running a little hot. We need to take a look at that when that plane lands to see if there's an issue with something, you know, that needs to be replaced, needs to be adjusted. But, you know, and same thing happens on a factory floor. If you have a lot of these sensors that are streaming data back to what we call, let's call it the mothership, it streams the data back. And you know what you're expecting that data to look like. And when that data goes outside those parameters, now you need to take a look at what's going on at the factory, there may be something different that, you know, that you need to adjust. [00:07:58] Speaker A: Well, I noticed that nowadays, with the streaming on the Boeing airplanes, they can stream that data right directly to the operational control center while the aircraft is in flight. And they also have been able to increase the number of parameters of measurement on those engines considerably, from, like, 100 potential to as much as, you know, 500 or 1000 parameters, so to speak. So it has helped to improve efficiencies in a lot of ways. And as the databases have evolved from just a standard relational database to big data, I imagine that it's provided a lot of help to companies utilizing this type of database. [00:08:42] Speaker B: The IoT type of data you're discussing is twofold. One is streams, but it also. And they can read the stream, so they can decide stuff that's going on at this very moment. They can also take that data, and they will store it in a big data platform, which is still a relational platform for the most part. They can store that, and then they can take that. They can aggregate that data or summarize it and be able to look and see what the averages are and what is trending. And so that's where the big data comes in from the storage, so that you can actually go back and do data analytics on it or data analysis and be able to go, yeah, these things are trending this way, or now they're trending down. Why is it, where did that trend come from? And you can dig into that, because now you have the data. [00:09:23] Speaker A: Folks, on our show today, we have Mister Shelby Thornton. He most recently was director of data engineering for a very large company, a household name. Now he is a consultant. He likes to focus in the security end of big data, using what can be a big advantage to our company, trying to protect what they have in the way of assets. He'll be able to be reached through us, if you wish. And the number here is at the end of our show. And what is basically big data, Shelby, what is it? [00:09:56] Speaker B: Well, when the Internet came along and really started to explode, you all of a sudden, your data streams really exploded as well. And you've got three types of data. There's one that's most prevalent. It's structured data. And when I say structured, I mean it's going to be in such a way that it's tagged. So you can say, like, name, and you put a name tag on it. You get a name, address tag, you get an. You. You get an address, but you tag all of those data pieces within that object. There's also semi structured data where you get some of that, but you don't have all. So you may have a catch all area in that. You know, that data record, shall we say. And then there's completely unstructured data. Now unstructured data is a lot harder to deal with, not that it can't be, but it takes a lot more processing power to really pull it apart. And at the end of the day, you're still going to tag that to make it semi structured, at least so that it's usable. [00:10:51] Speaker A: It's beyond me when you're talking like this, when you say unstructured data means that you haven't set the parameters up ahead of time, or means that you're setting up new parameters to derive some result that you're hoping to find, well. [00:11:05] Speaker B: You'Re pulling in data that you really don't know what's in it. So at that point you've got to analyze that data to provide some structure around it. So it could, but when I'm talking about structured versus unstructured versus semi structured, that's just a raw stream. When the raw data coming that you're capturing, it can be in those different formats. Obviously structured is the easiest to deal with because then you can, it's readily available and usable. You don't have to do what we call ETL with it, which is extract, transform and load. So you can basically take it as it is. If you want to convert that to a different type of. If you want to turn that into something that is more relationally friendly, shall we say, there's some work that has to be done there, but it's a lot less to be done. And so a lot of the, there's a lot of companies out there that are now producing products to do this. There's a lot of open source work that's being done out there. I mean, the biggest one that's been in use for big data is Hadoop. Hadoop is a platform that allows you to do fault tolerance so that if you have multiple copies of each piece of the data, so that if one piece of. So with like disk drive, and I'm sure everybody's had this happen to them before your home computer blows up a disk, you lost the data on it. Well, Hadoop does it in such a way that it's fault tolerant that if one disk blows up on a different system, a copy of that is elsewhere, so you haven't lost it, and it can then be regenerated, but it's designed for just huge datasets, not the stuff that we were used to doing with just a financial record, for instance. And these datasets, I'm sorry, are most. [00:12:46] Speaker A: Of the cloud based databases that way to where something goes up into the cloud, that it has a separate redundancy to it in case that set of, of data is lost. [00:13:00] Speaker B: It has to be done by the person or the company that is using the cloud. Whether it's within AWS, whether it's Azure by Microsoft, whether it's Google cloud platform. You really have to set it up yourself. They provide the facilities for you, but you need to really do it yourself as part of setting up your systems and design. [00:13:20] Speaker A: Wow, a lot to know, right? Didn't it used to start out with the three versus, meaning big data is greater variety, increasing volumes, more velocity, and then it added to that recently, I guess there was a couple more versus that it added to. [00:13:40] Speaker B: Well, velocity is, it's really more about data size, right? Velocity means you're getting a lot of data and you have to have a big enough pipe that you can take that in without losing it. I don't necessarily look at big data in that respect. Another way to look at big data, for instance, is let's say that you are a company with a lot of products, and you've, you know, and most companies have moved their products to being cloud based. You write your subscription, you're no longer loading a disk up on your pc, right? I mean, you know, for those of you that, for instance, bought quicken back in the nineties, you got a disk, you put it up, everything was on your computer. Now, if you're going to buy that product or a lot of the different, a lot of other offerings that are out there, you basically get a subscription on your web browser. It brings up the interface for it. And so what they're doing though, now with that is they can get telemetry data from those products. They can figure out where you're getting stuck, where you spend most of your time, what seems to be the features that are most used. So they can take all of that data, give it to a data scientist, which I am not. But these guys, you know, they're really good at taking that data, running their algorithms across it, and figuring out possible product changes that would then enhance their product. Next thing that happens on big data, because now that you've got a ton of data is you can do machine learning. You can understand from how people are using products, how people are using the Internet, you can kind of learn. And that's leading right into the latest thing, which is AI. So it's an ever evolving thing, but it's all, at the end of the day, it's all about data. If you don't have the data, you can't do any of that. [00:15:30] Speaker A: What's kind of interesting now that you said there's quicken and you load quicken, and it was fine, was on your desktop. And now they're saying QuickBooks, that in 2025 and thereafter, they won't offer the desktop version anymore. And so in essence, they gain your data and your usages of the program, and then they take and either develop the product or develop another product based upon the information that they learn from people that are now forced to use their books on the Internet. [00:16:07] Speaker B: Yeah, I mean, the Internet has made it possible to do that. And for them, it's much easier if they want to make updates, they don't have to send you anything, you don't have to download it. It just gets done in the product and it's seamless to the customer. So it's easier to support, it's easier to develop, and it's also easier to track and to be able to do the telemetry I spoke of earlier. So they can kind of see, you know, when you're setting something up, especially, you know, Quickbooks is a great example. Let's say you're setting up, I don't know, your payroll for the first time. Well, there's a, there's a lot to go through to set up payroll on Quickbooks. And a lot of times customers get stuck and maybe they give up. Well, they see one person do that, they don't necessarily, you know, they go, well, they had a problem and maybe they'll call us and all that. But if they see 10,000 people, they get stuck in the same place. Now they know they've got to make that more user friendly. [00:17:05] Speaker A: Interesting. It's my understanding big data started with large data sets back in the 1960s to seventies, but large data sets back in the sixties, the seventies was relatively small. Were there even things such as gigabytes then? [00:17:21] Speaker B: There were, but they spanned so many disk drives. It was, you know, in a gigabyte was huge back in the sixties. I don't know if anybody remembers the first PCs that ever came out. They had like 128k. [00:17:36] Speaker A: Yeah. [00:17:36] Speaker B: Of memory. And so if you had two hundred fifty six k. Oh my God, you were just, you, you were in high cotton. [00:17:44] Speaker A: Very interesting. So now it goes, you know, you got kilobytes, you got megabytes, you got gigabytes. And you got. What else? [00:17:53] Speaker B: Well, you go from gigabytes to terabytes, terabytes, terabytes to petabytes, and petabytes to an exabyte or exabytes. And basically each time you make that jump, you add three zeros. [00:18:06] Speaker A: Wow. [00:18:07] Speaker B: So you've got a thousand gigabytes, is 1 terabytes, is one petabyte. Now, as an example, when I was working at that multinational company doing just the security data, we were taking in about 20 terabytes of data from the different operating companies around the world per day. [00:18:30] Speaker A: Wow. [00:18:31] Speaker B: That meant we were taking in about one petabyte a week, which meant to get an exabyte out of that was about half a year, maybe a little less than half a year to get an exabyte. That's a lot of data, man. And so you really have to have a lot of processing power to get through that. And if you don't have it laid out correctly, you're using something that may have worked well on a terabyte of data a day. That product may start falling down at 20 terabytes a day. And you got to figure out other examples of other ways of going about. [00:19:04] Speaker A: You mentioned earlier that it expand mini disk drive just to get a gigabyte. Where is all of this information stored? Where is these data sets? Where are they stored? Are they stored in the cloud, in data centers? Where. [00:19:23] Speaker B: Well, first off, the cloud really is just somebody else's data center. [00:19:30] Speaker A: Okay. [00:19:31] Speaker B: So to answer your question, the simple answer is it's in a data center somewhere, but they are now outfitted to be able to have exabytes worth of storage. [00:19:40] Speaker A: Wow. [00:19:41] Speaker B: And you actually will take that as an example. AWS, whichever. If anybody who has watched a football game in the past couple years, they hear aws and the analytics and all that. Well, AWS has a huge amount of storage. Not that Google doesn't, because Google's, you know, a bigger company than AWS, than Amazon, as is Microsoft. So they all have their own proprietary way of storing the data. But at the end of the day, it's just in one big data center in each case, or I shouldn't say one big, it's in data centers around the world that are run by those companies. You can do it on your own. If you don't, if you. If you're a large company that can afford the hardware and you want to support it, you can do it in your. On your own and you get the same result at the end of the day. [00:20:28] Speaker A: Wow. Pretty interesting. I find it fascinating. Well, back when I was, I had grocery stores at one time, and, you know, I wanted the information from the cash registers. I wanted to know if those registers had integrity when, when a cashier got done with them. And so I developed the software to run that on the, on the computer. And I remember there was big hard drives that we would put into this console in order to store the information. And they were big. You put them in, you turn them, and it would go from there. And then I would eventually be able to get the information that I wanted from there. And it was really useful information because if you didn't take an inventory every couple weeks, you could see that all of a sudden you're buying a lot more for a certain department that's going to indicate from those parameters that you could have a problem there, that you might have a loss of product. Pretty interesting. There's a lot back then that goes with today. Only today, it's more like evolved in so many different areas. [00:21:40] Speaker B: Your cash register data will be considered transactional. You have a transaction of 20 items, so you have a line item for each one of those items. Totaling the full of the overall transaction would be whatever the sale was. So that's what you were storing and taking that, and then you could kind of cross reference that against what you were buying and seeing if they matched up. I sold 20 jars of peanut butter, but I just had to replace 30 where those are the ten go, you know? But if you don't have the data, again, you need the data to be able to go. Someone stealing ten jars of peanut butter from me, whether it's an employee, whether it's somebody coming in off the street and shoplifting. The only way to do that, though, is to have the data. [00:22:22] Speaker A: You got to have the data. Well, that's a lot of data. You probably don't need every parameter that's out there. Each company has to decide what do they want to collect by way of data. That would seem like a big job. How does that interface with the actual corporate structure of a company that is multinational? Who actually decides? Well, I want you to draw parameters from that particular area of the business. [00:22:49] Speaker B: Well, it's more, rather than pull the parameters from that area of the business as a, you know, as a data admin, as a data, you know, architect, you're going to pull everything, and what you're going to do is then you're going to work with the, with the upper management to decide what parts of that data are most important. You don't drop it because you may want it later on, but you're going to decide, I need to focus on a, B, C and D key. F and G are going to still be stored, but they may not be used or they may be used later. When we're doing our, what we call a drill down, we want to look into the data as we go deeper and see what other attributes may be useful in analyzing that data. [00:23:30] Speaker A: Wow. I know recently, I don't know how many years ago, that's that IBM bought the weather company and when they bought the weather company, they gathered information at the time from 100,000 weather monitoring devices and 2.2 billion data gathering points. And this system has become very effective because it provides airlines as an example, the ability to know what the weather is going to be when you get there. And should you already think about diverting or should you just wait it out on the ground until that weather has passed? That's pretty useful information. When you're burning, you know, 3000 gallons an hour carrying 300 people, you know, it's pretty important. [00:24:14] Speaker B: No, again, that falls in my eyes, that falls into the IoT, the Internet of things. You're basically taking telemetry data of a different nature for weather. You're getting wind speeds, you're getting temperatures. The forecasting, I don't know. Forecasting I think is the hardest part of that business. I look at the people on the 10:00 or 12:00 news that are doing weather and I want their job. It's the only job I know of where you can be wrong more than 50% of the time and still keep it. [00:24:46] Speaker A: But actually now they're getting better and I think it's probably because of the amount of parameters that has become available from the big data sets. [00:24:57] Speaker B: Right? Well, it's available data and being able to take that data, do the analytics on it, see the trending. You want to be able to trend that, you know, for instance, this year Anchorage had the most snow it's ever had in recorded history. Usually it's about 102 inches a year. It's already about 110. And it's what, beginning of February. [00:25:18] Speaker A: Wow, that's amazing. [00:25:20] Speaker B: But they're tracking all that so they can actually, you know, and then the temperatures, you know, you get, it's amazing. You know, the temperature tracking that can be done now because you have the data and most of it's real time, you can bring it in and you can track it on a per millisecond basis. Not that that's that useful, but every second, every 10 seconds, every, you can kind of decide what your parameters are going to be every five minutes, every ten minutes, and you kind of see, wow, it's getting cold real quick or not. It's being pretty steady state. So back to your point about being able to fly in and feel comfortable that the weather is going to be amenable to you actually landing, it's a. [00:25:56] Speaker A: Very, very good benefit. So I think we could say that any company that utilizes and implements big data is going to advantage themselves through efficiency, through a lot of various areas. Any of you folks out there that you're hearing here in Silicon Valley, Valley as well as all around the Bay Area, you need somebody to come in, take a look in your company, or you really want some guy you can tell already this man has it together. Shelby Thornton. He used to be director of data engineering for multinational corporation. And to end this session, what do you have to say, mister Shelby? [00:26:34] Speaker B: What do I have to say? I say again, he who has the data wins. And there's more and more data every day. It expands. We find new uses for it. We find new ways to analyze it. It's just, you know, Sky's the limit, at least right now. [00:26:50] Speaker A: I understand at various times in my life, I've been aware of massive database systems that can deal with enormous amounts of data. Many important benefits have been derived from these systems. Mister Shelby Thornton is one of these exceptional people who have made big data his life's work. And what a task it must be. I don't want that job. Thank you for being on the show and for sharing, Shelby. [00:27:15] Speaker B: Thanks for having me. I appreciate being here. [00:27:17] Speaker A: You've been listening to the real estate and more show incredible people like Shelby Thornton. Important topics like big data, big deal, and of course, we talk about Bay Area real estate. Listen to archive real estate and more shows. And Michael Hatfield, that's Michael hatfieldhomes.com radio. The real estate and more show is now podcast on demand, on Spotify, Amazon, Apple, iHeart, Pandora, and most major podcast platforms. I'm your host, Michael Hatfield. I do hope you tune in next Saturday morning and have a blessed week. [00:27:59] Speaker B: It all right, well, how'd it go?

Other Episodes

Episode 1

December 10, 2024 00:00:54
Episode Cover

**TRAILER**So Ya Wanna Be A Pilot?

In this episode, Nancy interviews Michael on his career years as an airline pilot and Lead Designated FAA Check Pilot. Having traveled all over...

Listen

Episode 1

July 13, 2024 00:28:56
Episode Cover

On the World Stage-Lisbon of Portugal

Many do not see Portugal for the great destination it is.  So, we found it necessary to go and check out Lisbon and its...

Listen

Episode 1

May 18, 2024 00:28:18
Episode Cover

Recovering From Your Genes

When conditions are such, one can easily fall into alcoholism, especially if your parent is a user.  Alcohol sneaks up on a person, and...

Listen