Presented by Guillermo A. Fisher (@guillermoandrae)
August 15, 2019
The data lake is a design pattern that supports the storage of data -- both structured and unstructured -- so that it can later be transformed, analyzed, and explored. As with some other popular trends in tech, there's a lot of confusion about what data lakes actually are. This session provides an introduction to the data lake pattern as well as some of the services provided by AWS that will allow you to deploy your own data lake.
About Guillermo A. Fisher
Guillermo is a Cloud Enablement Leader at Cloudreach who leads a team of data scientists, data architects, and data engineers. He is also the founder of 757ColorCoded, an organization focussed on helping people of color achieve careers in technology. He's a husband, father of four, fan of clean code, and a consumer of very silly comedy.
This presentation is brought to you by RingCentral developers. Revolutionize the way your business communicates with RingCentral's APIs for voice, SMS, team messaging meetings and facts. Don't just be a developer. Be a game changer with RingCentral developers.
All right. So thanks everyone for attending the talk. My name again is Guillermo Fisher. We're going to talk about data lakes in AWS. A little bit about me before we get started. My name again, Mo Fisher, it's pronounced Guillermo kinda like Giermo Del Toro. I have been in tech for almost 20 years and I've served in several different capacities. I have moved from graphic design over to front end design the tobacco end programming and into architecture or at least engineering and people management. Now, I currently work for handshake. I'm almost done with my first week and hen chicken so far so good. And I lead a fully remote engineering team for handshake. Last year I founded the 75 71 color coded meet up for aimed at people of color in Southeast Virginia. We like it was noted before our, our aim is to help people of color achieve careers in technology.
We started out just wanting to create a safe space for folks to share their experiences and have begun to do much more. Now we've partnered with the city of Norfolk to offer some programs to the library to teach some sessions do some workshops, partnered with some other organizations locally to do some things. So and also partnered with AWS to offer some trainings to people, some free training in the area. So it's been great. As I also said, I want my husband father for, and you can learn more about me at VK L Y N. Dot. Dev. And more about the organizations I mentioned that the handles that are listed in this slide. So before we begin talking about data lakes, I think it's important for us to spend some time talking about why the data Lake solution exists and what problems it solves.
I also want to make sure I cover some important concepts within data science and analytics to clear up some confusion that might exist around definitions and roles because it's a fairly popular topic right now and there are tons of buzzwords flying around that you may have heard but don't necessarily understand. So let's just dive right in. Modern applications generate a ton of data and that data exists in many forms and we start calling it or referring to it as big data when we're dealing with gigabytes and petabytes of data. And so just to note, big data can refer to the actual size of the dataset or the technology associated with processing a large amount of data. So there are, there's structured data that inherits, adheres to a schema. A good way to think about that is to consider like a table with columns and rows.
There's also unstructured data, which doesn't fit in any predefined schema or isn't necessarily organized in a predefined way. And when you consider the different types of data that are associated with software consider like application and error logs, transactional data, traffic, data, customer data, et cetera, et cetera. You begin to understand that even though there is inherent value in all that data, it can be difficult to actually draw insights from it. And in many cases, companies have been collecting a ton of data for years and many of these companies have of various sizes of and policies of not deleting any of that data. So theater, Scott, all of this data stored up and they want to use all of that data to answer questions about how their businesses are performing and how, for instance, their customers are interacting with their products. So they want to essentially use that data to make decisions about their future. And when I say drawing insights from data, what I'm referring to is, is data analytics specifically did it? And that data analytics is the science of analyzing data to draw conclusions from it. And these days, more and more companies are realizing that they need to leverage data analytics to make data driven decisions and get an edge over their competition.
So a company decides they want to start making sense of all the data that they've been storing. In many cases, the first thing they do is they hire a data scientist or data analyst or one or two data engineers. And unfortunately, many times they expect these data professionals to sort of be magicians that can just pour over decades of disparate data and provide an answers in a matter of weeks or even days without understanding the amount of hard and often tedious work it needs to be done in order to actually start gaining insights from the data.
Well, let's just take a look at what a data team usually looks like. So you've got a data engineer's Dana and data analysts and data scientists. Data engineers are generally responsible for building and maintaining architectures that support big data systems such as ETL pipelines. ETL is extract, transform load for anybody that doesn't know create applications and tools to support data scientists and data analysts. So they actually better their customers in a sense. Other data analysts and data scientists and they collaborate with data scientists to build ag algorithms to derive meaning from datasets. Now that's not all data engineers, but there are some that kind of lean towards data science and have experience with a machine learning models and that kind of thing. Then you've got a data analysts who are responsible for like collecting and cleaning data from different data sources for analysis identifying, analyzing and interpreting data to uncover patterns and they create domain specific reporting in the form of like visualizations or dashboards that executives or the people in the company can use to understand the data. Then of course, there are data scientists and there's some crossover between what data analysts and data scientists do. But specifically data scientists specialize in the collecting and cleaning of data training and deploying models to predict outcomes. And when I say models, machine learning models, I'm referring to the output of data algorithms, right? So we'll get into that a little bit later. I'll talk about some of that at a high level. I won't get into too many details there, but they also are responsible for communicating their findings to business stakeholders.
So like I said, companies start hiring, hiring data folks and oftentimes they, they prematurely hire like data scientists specifically. And I've seen this done firsthand myself. A company will make the decision to move forward with a loosely defined data strategy. They hire a data scientist and that person spends months trying to build out the infrastructure that they need to support the creation of those data algorithms instead of actually spending time building those algorithms. And as a result, both the company and the data scientists become frustrated and sometimes just straight give up. So there are steps that need to be taken in order for an organization to get ready for, for AI or machine learning or what have you. And they can be represented in what is known as the data science hierarchy of needs. And that is often represented as a pyramid. That's Premarin that I have on the screen here. We're in a strong foundation, which is the vitals on the bottom half has to be built in order to support everything above it. So I'm going to talk about each of these steps, each of these layers individually, what's involved in those and who's generally responsible for each of the items in those steps.
So we'll start at the very bottom of the collect step.
We're dealing with getting all of the data and that data can come from sensors from like raspberry pies or an architecture that supports centralized logging or a external applications like say for instance, Google analytics. Different groups within an organization work together to actually determine what data is relevant to the platform that you're trying to build out. And once those data sources have been identified, data from those sources, these to actually move through the system somehow, right? So a destination needs to be identified and then, and ETL scripts have to be built to move data from the data source to this new destination. And infrastructure also needs to be built to support the collection and the flow of data through the system. So at some point, decisions need to be made about how the data should be stored. Do they want to store it in a database?
What kind of database is it again, structured, unstructured data. So do we use our traditional RDMS traditional database or do we use a document storage or you know, something like Longo or whatever it is. So the move store and the collect steps are the responsibility usually of the data engineers and they have the skills needed to write the ATL scripts, excuse me, ETL scripts and then generally know enough to stand up the resources that they need. So there may be data architects in there that will help out with that depends on the organization, but in many cases, any engineers have the skills needed to, to stand up those resources.
So at the Explorer transform step a, once that move store step is satisfied, you can begin to actually look at the data. Some transformation usually has already occurred in the moves doorstep. But at this stage we're actually starting to take a closer look at what data we have a look at anomalies in the data. And go through the process of what's called cleaning the data and re cleaning the data to prepare it for the next phase. When I eliminate outliers in the data and do that sort of thing, so then you can actually start to use it. So the data analysts and data scientists usually work together at this level to determine whether or not any steps at the bottom of the pyramid and need to be tweaked. So let's say they identified some novel anomalies, they've got to go back to the, to the beginning, tweak those ETL scripts so that the pipeline produces cleaner data. And that can happen. That can be a cycle that happens a few times before they actually get it right.
So the next level is the aggregate label level. And here we actually get into performing the analytics segmenting the data, defining features features and, and within the context of, of, of data science are measurable properties or characteristics of a phenomenon being observed. And we also started working with training data, which is going to be used to build the machine learning models, which are, like I explained before, the output of nit algorithms. And after this step we're ready to start building our models and then do some testing next perimentation to approve upon those models and data analysts, excuse me. Data scientists are usually involved in these two steps to learn, optimize, and aggregate labeled steps. And once you've satisfied those, you're ready to do like the cool AI and deep learning stuff.
So we've got to build a solution. I'm part of the company used the organizations have to build a solution that will support the work that they're trying to do. So once they've understood those steps, what's involved, who's involved, they're ready to get going. But that solution, there were some requirements to, to building that solution. Data has to be stored in one central location. It's easier to manage the data and access to that data when it's all in one place. Data has to be stored securely, securely. Obviously security is top of mind for everyone these days. It's an absolute must. And access is an issue. We need to make sure that the day to day exists, accessible to different kinds of users. Cause we've got again, data scientists, analysts, I need to get to the data engineers and maybe some other folks within an organization that will need access to the data and the solution isn't viable unless it actually adds value to the lives of those people as they're trying to perform their jobs.
So so this is kind of what traditionally has been the requirements for, for solutions that will solve these problems. And one of the solutions was the, the data Mart solution. A data Mart is kind of like a subject oriented database. It's a piece of an enterprise data warehouse. The data warehouse is usually like a large collection of data, structured data that contains data from, from different parts of the organization, but it's all exists in one place. So that subset of data held in a data Mark is usually aligns with a particular business unit. So like here in this example, we got a data Mark for sales. We've got one for finance and we've got one for product. And here it's easy, like easier for the different folks in those divisions or the analysts to hit those different data marts and perform analysis on those on the data that exists in those thin parts.
There are some pitfalls to data Mart solution one you can only answer a predetermined set of questions. So in designing the data Mart, you, you know, for instance, off the top that you, you're gonna be building a sales data Mart. And so by doing that you're excluding data and potential answers that you can get about other types of data. And a second pitfall is that there was no visibility into the data at its lowest levels, right? You're not storing the data and it's Ross forums. So you're missing out on some of the information that's available in its rough form.
So that has led folks to come up with an additional requirements, which is that the data needs to be stored in its original format. So by doing that, we get to make full use of all the information that's available. So how then what do we call the solution that allows us to do all of these things? A little while back, this definition was used to, to describe a data Lake, and I'm just going to read this. If you think of a data Mart as a store of bottled water cleanse and packaged and structured for easy consumption in the data Lake is a large body of water in a more natural state. The contents of the data Lake stream in from a source to fill the Lake and various uses of the Lake and come to examine, dive in or take samples.
So here's what a data Lake solution looks like. The data Lake is actually a pattern that was built to again satisfy these requirements. So on the far left, you've got your data sources that they can be different data sources. Data can exist in different format through got a structured database, you've got some Jason data, you could have CSV, XML data and that all gets put into the data Lake and stored in its raw format. Then that data gets transformed. And I'll put it in in different ways. It could be a data Mart that you use. You could, you could put it into a data warehouse or data catalog. A data catalog contains metadata about raw data that actually exists in your data Lake. They help make data more discoverable and help organizations as they inform decisions about how to actually use their data.
So a data Lake is healthy when a meta data exists in a data catalog, which is what I just described. When you only have relevant data in your data Lake. So if you're, you might have tons of different data, but if they're trying to make sense of your product or the way customers interact with your product, you don't necessarily need, I don't know, stuff about, I dunno, your, your employees in there with with your customer data for, for example data governance policies and procedures govern how the data is stored in the access. So you need some procedures and policies around, you know, who can get access to the data. When does data move around within the system, that kind of thing. And automated process, our processes are employed to manage data flow clean data and enforcing practices. So when those things are not being done the data in the data Lake becomes for, for lack of a better term, kind of swampy, right?
So the right processes, sometimes the right processes are put into place at the beginning of the, the design of the data Lake to ensure that it's well built and cerated and in some cases over time, sometimes orgs just kind of get careless about how they manage data in that Lake and it becomes what's known again as a data swamp. It's an unkempt data Lake essentially. So let's try to talk about a data Lake and see it through the lens of a fake company in a fake application, right? So let's consider, this is nomad PHP. A company called elephant express at delivers goods. This company has a website, it's got an iOS app and Android app and it's all powered by an API that's, let's say, built with Laravel. It's got view JS on the front end. And here's the pair of the different data sources for this application.
So we've identified our data sources. Let's say for now we're just going to start with stuff that we have in Postgres, our application logs and our customer information. So we know where we want to start. Now we need to figure out where are we going to store this data. So there are some services we're already using AWS side, so like I said, for our CDN. So we're going to stay in AWS. There are other CA providers that offer great solutions and a Google GCP has some great solutions. Azure has some great solutions. We're just gonna talk about AWS. And so Amazon simple storage service S3 retired referred to before as three is at the heart of several services in AWS. It was I believe the first service launched. It's great for data lakes because of the built insecurity options that are available.
And because of the life cycle management features that help you manage the cost of storage from for data that you intend to keep for a long periods of time. So through S3 you could move your data that's stored in like a traditional list, three buckets into a glacier and it'll cost less, cost less for you to, to store that, that data. But you know, there are some trade offs there. So we're gonna use S3 in our situation, but there are other, is one other services pretty popular. It's Amazon Kinesis, specifically the data streams feature or our product. So if you're dealing with like streaming data, like click stream data, which is something I referred to before, you want to use Kinesis data streams it's a great service. It's durable it can capture a ton of data in a streaming fashion, pre-service CDs, but we're not gonna focus on that.
We're going to use [inaudible] for our data Lake here. So boom, popping all this stuff into S three. We're going to somehow push all that data in some scripts that our engineers are going to write. And now we're going to start transforming the data. So what are we going to use to do that? We've got glue. Glue is a service that's offered by AWS. It's a fully managed ETL service. Makes it easy for customers to prepare. They're in load their data for analytics. In glue you can create jobs that move your data for you just through using the actual interface management console. There is a course, an API and there's there are some CLI tools available for use, but you can easily do all that stuff that would management console. There are prebuilt crawlers that exists so that you can parse and catalog your, your raw data.
We'll get into the cataloging part of this all in a few slides. But the, the predefined jobs and crawlers that exist in glue cover lots of pretty common use cases. For instance, later on in here, we're going to take a Jason file or we're going to want to take architects and file and convert it to a parquet file because we're going to maybe consider using on Redshift for a data warehouse over talking about that later. But you can also write custom jobs and crawlers. So if you have specific business logic that you want to add to glue, you can do that so long as it's written in Python or scallop.
You can also use Lambda to write some ETL scripts on your own as well. Lambda allows you to run code without having to worry about managing the servers. If the code runs on they can reuse, again, the rights on custom jobs in languages that aren't supported by glue. So code can be triggered by events in S three. So for instance, if you want to trigger a, a script when a file is uploaded to Esther, you can do that. You can do timed event triggers, so you can use CloudWatch to actually set those. Or you could have glue trigger a Lambda function as well.
This grief isn't an Amazon or AWS tool, but I do want to mention that because if you've been following Lander for awhile, you know, the PHP isn't one of the languages that's supported natively, but recent edition, the recent edition of the layers feature in Lambda makes it possible for you to bring your own runtime to land. As you can run PHP on Lambda now. And breath is a product that makes it easy to run. And that's something that I'm, I've been using a lot lately. I'm a big fan of it, so I just wanted to make sure other folks in the PHP community were aware of it. So that you can use it to do cool stuff on MDA yourself as well.
So in our data Lake we're going to use a combination of glue and Lambda because we have I'll talk about the different data sources that we're going to try to use a little bit later. So I'll talk about, I'm not actually, so we're gonna use Emma, the Amazon, Athena for the catalog. So Athena is, is out of the box integrated with, with AWS glue data catalog. So when you crawl the data, you, that gets put into a catalog, I'm behind kind of underneath the hood of the catalog. This is Presto B. And what's cool about Athena is that you can write SQL queries directly through the interface that's offered in the AWS management console or if you know, offers an ODBC and JDBC connection strings so that you can connect to Athena through a tool like power BI for instance. So you can get all your data run through power BI instead of having to actually log into the console and the AWS management console that do things there. So it's a great tool to use. And then there's Redshift. Redshift is a pretty popular solution for, for data warehousing offered by AWS. It's fully, fully managed. You can start with just a few gigabytes of data. It can get expensive, but it's a, it's a great solution to use. And under the hood there is Postgres as well.
And then there was already SRDS cause one of my favorite services makes it easy for you to set up a database in the cloud and set up my SQL Postgres, a bunch of other different database engines, really good tool. It's cost efficient. You can set up a read replicas and do all sorts of crazy things with it. Great tool to use and great if you're just trying to build a data Mart as part of your data Lake. So what we've decided to do here is we're going to use Amazon Redshift. We're gonna push our data into a data warehouse. We're also gonna use the, the catalog that's offered through the integration with glue and Athena to do some exploration of the data. So we can, you can log right into the management console and, and write again secret queries once your data has been crawled. And then we're going to have a data Mart set up and we're going to use RDS for that. Since we're already using Postgres, we can create our Postgres, our data Mark
So we have what we haven't talked about is security and access control. AWS offers identity and access management. I, I am, it is a tool that enables you to manage access to the services and resources in AWS securely. You can tie it to eldap or active directory. You can manage users and groups through, through IAM. You can manage keys through it. It's a, it's a great service. It's also one of the fundamental services in AWS. And we're in our data Lake. We're going to use [inaudible]. You would use, I am to manage the access for, for the data scientists or data engineers. You'd create different roles for them to use access to data so that we make sure that only, that they only have access to the things that they need access to. And there's Amazon CloudWatch, it's a monitoring and observe observability service built for engineers, developers, et cetera, et cetera.
It's another one of these fundamental tools. As an AWS like I said, you can set triggers to CloudWatch, time-based triggers to CloudWatch and trigger your, your either your glue jobs or your Lambda functions or whatever it is, but you can monitor the performance of the different services within your data Lake through these cloud watch. And then there's cloud trail, which is very important when you're talking about like auditing the use of the different services that exist within your, your data Lake cloud Raji basically log any access to any of the resources that exist in in, in your, your cloud. Essentially you can tie it into different services. But it's, it's a, it's a critical tool to use when you're trying to get a sense of who's actually accessing your resources and how they're, how they're using them.
So when you put it all together, this is what the data Lake looks like. So we've got our data sources on the far left, everything gets popped in this three using glue and Lambda to do some transformations on the data. And then we've got data ready to be used with red shift. And Athena and RDS and at the bottom kind of a thread that runs through all of this, is that in order to get access to different parts of the data that exists different States within the data Lake we use, I am, he was CloudWatch, he was cloud trail. I each of these services, not all of them, but like for instance, Amazon S three has some features built into it to handle encryption, service side and client side encryption control through glue. And Athena, you can have pretty fine grain controls over what data exactly is and how it's accessed.
So great things, great tools to use. So in, in our data Lake, let's say we have Jason data and like I said before, we may want to convert that day on data push it into red, right? So what we would do and add instances, we would use a goo job or a goo crawler to crawl that data job to actually convert that data into a parquet file. And that parquet files would be used to import that data into Redshift. So then you can use a Lambda function to, to actually copy that data into Redshift. And that's how it would move through our Lake
So in addition to power BI, there are eras, a tool offered through AWS that allows you to do some quick visualizations with your data. So as a compliment to like PowerBI hours or replacement of to power BI, you can use QuickSight. Quicksight is, it's really easy to use. It's fast, it's cloud based, it's fully managed. You don't have to worry about the servers or anything like that. When using quick site, you can easily create and publish dashboards and visualizations. There are some limitations. You have to, you can only share those visualizations with people that have accounts within. Quick side you can't send, as far as I know, as it exists today, you can't send emails to people that aren't in your organization, for instance. Send a email, excuse me, a visualization and report to somebody that isn't a, it doesn't have an account.
So it's a great solution for, for folks that have accounts an organization that has accounts within an Amazon AWS. So what if you're efficient and you don't want to set all this stuff up on your own. Does AWS offer some kind of a service that will allow you to do all of this stuff quickly without having to worry about it yourself? And they actually do what's called AWS a Lake formation. It was announced at the 2018 re-invent. So it is a service that does a bunch of the stuff we already talked about for you. It'll create the data Lake within days in some cases versus the months or maybe a year or something more so that it would take for you to get all the data, get all the data, set up, all the services set up, get all the access control and all that stuff set up. So what Lake formation handles is everything that's in this kind of this orange box on this age. So if you've got on the left, that's three and you're gonna add your different data sources there. You can use like formation to set up those pieces in the middle and then pop the data out into the different format so that you can it in different ways.
So I've gone a bit faster than I wanted to go. But some, some key takeaways here. A data Lake is a centralized, secure pository that allows you to store all of your structured and unstructured data at any scale. So it's the right solution when you don't know exactly what you wouldn't do with your data. But you, you know, you want to do something and you don't know how much data you're going to be collecting, but you know, it, it's going to be a lot. It's a, it's, it's a flexible solution like can saw your data as it is run different types of analytics from dashboards and visualizations to big data processing, analytics and machine learning to guide your decisions. Data leaks are great work load to be deployed into the cloud because it provides performance, scalability and all these other great benefits at massive economies of scale.
So if like the elephants express the organization, you've got a bunch of stuff in the data center and you're thinking about moving to the cloud it's a great kind of a first project, first fully cloud-based project to, to build your, your data Lake in there. You can move your data from your data center securely into, into the cloud through different services like direct connect or you know, there are a bunch of different ways you can do that. But yeah, it's a, it's a great kind of first step towards your, your, your cloud journey. And AWS provides several services that allow you to easily deploy and manage data Lake so you can do it yourself by setting up all of the services yourself. You can actually use it. There are some Cod formation templates and confirmation is essentially a way for you to kind of automate. It's like an infrastructure as code solution offered by AWS, something like Terraform, but it's specific to AWS. And then of course there's the Lake formation service that will handle all this stuff for you. So that's actually it. So
Guillermo, great talk. I have a couple questions and for those Latina, if you have questions, please feel free to post them either in the Q and a or the chat. And we'll be happy to cover them. The first question I have for you is, you know, big data and data science. I think you kind of alluded to this are, are buzz words where every company feels that they need to be doing data science. And you had a great quote essentially about how without data science here, you're like a deer in the highway. What would your advice be to a company on whether or not they'd need to, you know, do big data or you know, when the appropriate time to bring in the data scientist?
I'd say aim first to get some data analytics stuff going. So I think there, there's a lot to be gained from just kind of taking a look at the data that you have and taking a closer look really and just kind of putting it in front of you in a way that makes sense. Right? So the, the actual AI and the data science stuff and machine learning, that's great. But the first thing you need to do is just take a look at how you're actually doing and data analytics gets you there. And then once you've got that all worked out, buttoned up, then you can start to, to, to work towards doing some of the machine learning stuff. Does that make sense?
Yes, that's very, very helpful. Another question I have is, with the tools that you've mentioned, are there any like bad use cases for them or things that people try to commonly do, but it doesn't really work well for that scenario?
Yeah, so I can think of a few things. Number one, don't try to use red shift to as a transactional database. It's, it's, it's all LTP. It's not meant for, you know, handling your your, your, your sales on your website or creating new managing user accounts. It's for a data warehousing, you know, it's for data that's access to infrequently for glue. Definitely take a look at what your use cases, if, if you're trying to use, like I said, if you're trying to pull in streaming data, you want to consider using Canisius data streams. Instead, it's, it's built for streaming data. Glue isn't built for that. I think those are the, those are the two I would, I would caution people against the most.
Okay. and I think you mentioned a tool called [inaudible]. Can you share more about, be rough and you know, again, where it really excels and some of the challenges that you've faced with it as well. So, yeah, so, so
The guy who I don't want to put your, his name, it's, I think it's Matthew and Napole. I created the project. So what it, it leveraged at least. So at the time that I was introduced to it, it leveraged what was called AWS. Sam the S the serverless application model and in AWS and layers, right? So like I said recently AWS launched this this layers component that's part of Lambda now. So I think now have up to five layers. You can add your own runtime or shared libraries to allow them to function now. And what BREF does is it distracts the way some of the work that you would need to do to to for instance, you have to compile on a compile, you have to package up PHP, right? So if you're interested in doing the work to package it, a PHP BREF has already done that.
There are different run times available. There is a run time for a full loan application. So you can run a a full app, like a lateral application, for instance, through Lambda. If you're just doing a function, there's a package of PHP available for that. If you're doing some other stuff, there's a, I can't remember what it's called, like a worker, I think is what it's called. There's a run time for that. So it would take, it does all that work for you. It offers a CLI that makes it easy for you to push your your function or your application directly to, to Lambda. There are some, some trade tradeoffs in running PHP in Lambda. It's not as fast as they iPhone or Java. And, and, and they end up because it's not natively supported, but for I think for most use cases, it's perfectly fine.
And I'm using it for my, I'm going to be using it. I'm moving from my personal website, I'm going to be writing some Lambda functions and I'm standing sitting in behind a API gateway and building kind of like a micro services architecture to using Lambda functions in a written in PHP. So I'm going to be writing some blog posts about that and I've actually written a blog post about managing for actually building the CICB pipeline with Travis and if privacy I and, and breadth. So that's on medium. I can share the link with anyone if they're interested.
That'd be awesome. It actually leads to my, my last question here which is, you mentioned a lot of valuable tools and solutions outside of the Amazon documentation and maybe Amazon that has the best way to get started. What resources would you recommend for people wanting to get started with big data and moving and using the tools that you mentioned?
Good question. So outside of the Amazon AWS stuff I would also advise people to take a look at GCP. They've got some really good services. B query is a big one. It when I don't use it often, I actually don't use it at all. I said often, but I'm familiar with familiar with it. I've heard people discuss it and from what I understand, it does a few different things that multiple AWS services do. So I would look at GCP. There are some articles and I can add a slide with some resources to this before I share it with you all. There are some good articles on data science. There's a good article on the the hierarchy of of, of needs that I talked about. Good articles on, you know, the differences between a data data, Mark, data warehouse and data Lake. I think it's important to understand those differences when you're.