The idea of data lake is, instead of using a formal database, we just throw all the raw data (could be log files, csv for JSON files, etc.) into a big data repository, that’s what we call a data lake.
It’s a common approach for big data applications.
AWS S3 + Glue + Athena
For example, we can dump all the unstructured raw data into AWS S3 service, which provide unlimited storage capacity.
So this S3 bucket is our data lake.
When we want to query the data, we can use Amazon Glue to crawl the unstructured data from S3 (data lake) and extract a schema around it. Then use Amazon Athena to query data by SQL.
Partition for query performance
Considering of query performance, it’s probably not a good idea to just throw all data into one big bucket.
For example, if we are going to query by date, it would make sense to partition the raw data by date. So we can quickly get the query result from the logs of one particular day.
So if we can store the data into a directory structure that corresponds to partitions based on query criteria, this would surely optimize performance.