AWS Lake Formation: the new Datalake solution proposed by Amazon

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days. A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions. However, setting up and managing data lakes today involves a lot of manual, complicated, and time-consuming tasks....

November 29, 2018 · 2 min · 332 words · Matteo Redaelli

Analyzing huge sensor data in near realtime with Apache Spark Streaming

For this demo I downloaded and installed Apache Spark 1.5.1 Suppose you have a stream of data from several (industrial) machines likeMACHINE,TIMESTAMP,SIGNAL1,SIGNAL2,SIGNAL3,... 1,2015-01-01 11:00:01,1.0,1.1,1.2,1.3,.. 2,2015-01-01 11:00:01,2.2,2.1,2.6,2.8,. 3,2015-01-01 11:00:01,1.1,1.2,1.3,1.3,. 1,2015-01-01 11:00:02,1.0,1.1,1.2,1.4,. 1,2015-01-01 11:00:02,1.3,1.2,3.2,3.3,.. ...Below a system, written in Python, that reads data from a stream (use the command “nc -lk 9999” to send data to the stream) and every 10 seconds collects alerts from signals: at least 4 suspicious values of a specific signal of the same machine``` from pyspark import SparkContext from pyspark....

November 25, 2015 · 2 min · 279 words · Matteo Redaelli

Before SQL then NOSQL and BIGDATA: now BIGDATA and SQL again

The trend of these years has been switching from SQL (RDBMS) databases to NoSQL databases like Hadoop, MongoDB, Cassandra, Riak, … SQL is a old but easy and fast way to query data. And people STILL look at it for quering Hadoop and bigdata: Apache Drill (MapR) Apache Phoenix Apache Hive (see Stinger initiative) Apache Spark SQL Presto (Facebook) Apache Tajo (a datawarehouse) Impala (Cloudera) Read details from 10 ways to query hadoop with sql ....

March 2, 2015 · 1 min · 76 words · Matteo Redaelli

Bigdata on Twitter 2015-02

Below the top media (and other statistics) about “bigdata” word in Twitter in February 2015

March 2, 2015 · 1 min · 15 words · Matteo Redaelli

Workflows in Apache Hadoop

How to orchestrate your Hadoop Jobs? Possible solutions are: Apache Oziee included in the top Hadoop distributions Azkaban from LinkedIn Luigi from Spotify Apache Airflow from AirBnb See for instance a comparison among luigi, airflow and pinball at http://bytepawn.com/luigi-airflow-pinball.html

October 3, 2014 · 1 min · 39 words · Matteo Redaelli