A quick look at Data Lake

The idea of data lake is, instead of using a formal database, we just throw all the raw data (could be log files, csv for JSON files, etc.) into a big data repository, that’s what we call a data lake.

It’s a common approach for big data applications.

AWS S3 + Glue + Athena

So this S3 bucket is our data lake.

When we want to query the data, we can use Amazon Glue to crawl the unstructured data from S3 (data lake) and extract a schema around it. Then use Amazon Athena to query data by SQL.

Partition for query performance

For example, if we are going to query by date, it would make sense to partition the raw data by date. So we can quickly get the query result from the logs of one particular day.

So if we can store the data into a directory structure that corresponds to partitions based on query criteria, this would surely optimize performance.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store