Scheduling AWS EMR clusters resize

Below a sample of howto schedule an Amzon Elastic MapReduce (EMR) cluster resize. It is useful if you have a cluster that is less used during the nights or in the weekends I used a lambda function triggered by a Cloudwatch rule. Here is my python lambda function import boto3, json MIN=1 MAX=10 def lambda_handler(event, context): region = event["region"] ClusterId = event["ClusterId"] InstanceGroupId = event["InstanceGroupId"] InstanceCount = int(event['InstanceCount']) if InstanceCount >= MIN and InstanceCount <= MAX: client = boto3....

July 22, 2019 · 1 min · 136 words · Matteo Redaelli

Any faster alternative to #Hadoop HDFS?

I’d like to have an alternative to Hadoop HDFS, a faster and not java filesystem: S3: S3 Support in Apache Hadoop if your servers are hosted at Amazon AWS chep: using hadoop with ceph glusterfs: managing hadoop compatible storage lustre: Running hadoop with lustre Openstack Swift: Hadoop OpenStack Support: Swift Object Store xstreamfs: there is an hadoop client Which is better? Any suggestions? References: [1] https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

November 17, 2016 · 1 min · 66 words · Matteo Redaelli

A comparison of Hadoop distributions

Read all the article at http://www.hadoop360.com/blog/hadoop-whose-to-choose

May 21, 2015 · 1 min · 6 words · Matteo Redaelli

A case study of adopting Bigdata technologies in your company

Bigdata projects can be very expensive and can easily fail: I suggest to start with a small, useful but not critical project. Better if it is about unstructured data collection and batch processing. In this case you have time to get practise with the new technologies and the Apache Hadoop system can have not critical downtimes. At home I have the following system running on a small Raspberry PI: for sure it is not fast ;-) At work I introduced Hadoop just few months ago for collecting web data and generating daily reports....

March 13, 2015 · 1 min · 93 words · Matteo Redaelli

Workflows in Apache Hadoop

How to orchestrate your Hadoop Jobs? Possible solutions are: Apache Oziee included in the top Hadoop distributions Azkaban from LinkedIn Luigi from Spotify Apache Airflow from AirBnb See for instance a comparison among luigi, airflow and pinball at http://bytepawn.com/luigi-airflow-pinball.html

October 3, 2014 · 1 min · 39 words · Matteo Redaelli