Architecture Modernization with Cloudera

Autor: Sara17 • March 15, 2018 • 1,417 Words (6 Pages) • 934 Views

Page 1 of 6

...

-slow transfer from sources

Physical transformation is requireds (shema on write)

Cleaning, normalization required

Mandated RDBMS table targets

Metadata limited to system tables

Presentation layer vendor mandated-> single focus rdbms sql only

Off limits except to ETL staff

• “we aren’t ready”

• “the data must be cleaned”

• “data governance trumps”

• “end users not trusted”

• Traditional IT control

New Backroom

Purpose built for high transfer rates

Physical transformation optional

Cleaning, normalization discouraged

Tabel targets optional of deffered

Extensible metadata via Hcatalog

Presenation layer open ended

- Before or after any transformations

- Analytics client specific

- Multiple simultanieous personalitues

Doors open to

• Qualified analytic users

• Automated processes

• Experiments, model building

• Clients other than SQL

• Open data marketplace

Modernizing architecture: Traditional BI

We see here the traditional Data ware house Architecture

Several sources load into an EDW and BI tools on top of it

What we see evolving are 3 main cases.

1 Archiving Data is offloaded to the Enterprise data Hub eg keep only 3 months on EDW. On the EDH you can report on years of data (like Relay health is keeping Electroning Health record for 7 years, and analyst can acces this historic data)

2 Do all the ETL on the EDH. USE the EDW for the high performance analytics, and use the space to grow and not to update every 3 years the DW machine. (SFR in France)

3 Get multistructured data into EDH, and send only aggregations to the EDW. Ingest and Analyze all the clickdata on EDH, send only kpi’s like number of users, conversion rates etc to the EDW

Modernizing architecture: Big data analytics

Here

- Data explorations: being able to search on all data across

Interesting case is DurkheiM DARPA (Defense AdvancedReseacrch Project Agency). The US faces a lot of suicides under their Military veterans. They indentified critical correlations between veterans communications (mails tweets call center) and suicide riks. Now can better predict when help is needed

- Scalable Machine learning test and production

What is interesting is that you do your machine learning testing (LAB) and production(the FACTORY) on the same environment with SPARK machinelearning. Not create some test on a desktop and then recode everything for scaling again

Modernizing architecture: Fast Data Analytics

To absorb Streaming data and Analyze and react in real time you need a different architecture (eg OTTO can predict with 90% certainty if a customer is gonna abandon his basket and can react in milliseconds with an action or offer)

EDH Logical Information Architecture (1/2)

Data transforms from raw to disccovery to integrated zones (conformed dimensions)

Landing Zone / Staging Layer

- Data loaded from source systems

- Separate directories for each source system

- As close as possible to original format and structure

Discovery Zone / Enriched Layer

- Still separate directories for each source system

- Data sets “enriched” e.g. by joining reference data

- Available for Discovery and Exploration by Analysts

Integrated Zone / Atomic Layer

- Data from multiple sources joined together

- One (or a small number of) atomic data model(s)

- Available for analysis but not optimized for speed

Optimized Zone / Mart Layer

- Data organized to provide optimized performance

- Typically organized by use case not by source

- Deformalized, uses optimized formats e.g. Parquet

EDH Logical Information Architecture (2/2)

I want to end with 2 slides on a logical Information structure

We see here a logical Information structure

From left to right different zones ( like Landing zone, Discovery Zone, Integrated zone and Optimized zone)

Horizontal a Managed layer and User Layer

At the bottom we see the Process from RAW to Trusted and the possible diferent steps (Ingest, validation Enrichment Transformation and Routing)

Data pipe lines

IT is responsible for Management of the ingestion , curation and quality etcetera

The User zone is meanr for do self service , enricht it. Combine it No entry sign . Those should be only be managed & quality

...

Download: txt (9.2 Kb) pdf (57.5 Kb) docx (18.3 Kb)

Continue for 5 more pages »

Read Full Essay Save to my library

Only available on Essays.club