A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualisation, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established "on premises" (within an organisation's data centres) or "in the cloud" (using cloud services from vendors such as Amazon, Google and Microsoft).
You may have heard the term 'big data', and it is big. Organisations that develop new consumer value offerings based on data tend to out perform their competitors by around 9% in organic revenue growth, as highlighted in this Aberdeen Survey. The nature of the data that can be fed into a data lake is vast. It can come from web analytics, consumer insights, trend reports, financial reports, social media analytics, in vitro, in vivo and clinical data and so on. All of this remains in its raw format, unedited and unsorted, and so provides and open resource to pull value from the 'lake' to develop new innovations, service/product offerings or steer business strategy.