Skip to main content Big Data and Machine Learning
Google Cloud Big Data Services
- They are serverless services, managed buy Google
- Services:- Cloud Dataproc (managed Hadoop):- Is an open-source framework for big data
- It is based on the map-reduce programming model
- Dataproc is a fast and easy managed way to run Hadoop, Spark, Hive and Pig on GCP
- A cluster can be created in 90 seconds or less
- We can scale up and down jobs on the fly
- We pay only for hardware resources used during the life of the cluster (billing is done down to second)
- We can save money be using preemptible compute instances (up to 80% cheaper)
 
- Cloud Dataflow:- It is an unified programming model and managed service
- It can be used for ETL, batch processing, stream processing
- We used Dataflow to create data pipelines used for batch processing a streaming purposes
- Orchestration: we can create pipelines that coordinate services, including external services
- Dataflow integrates with Cloud Storage, Pub/Sub, BigQuery and BigTable
 
- BigQuery:- Ad-hoc SQL queries on massive datasets
- Provides near real-time interactive analysis of massive datasets
- It is a fully managed, low cost, petabyte scale data warehouse
- No cluster maintenance is required
- It lets us specify the region where the data will be kept, compute and storage is separated with terabit network in-between
- We only pay for storage and processing used
- Provides automatic discount for long-term data storage (after 90 days)
 
- Cloud Pub/Subs:- It is meant to serve as a simple, reliable, scalable foundation for stream analytics
- Pub: publisher, sub: subscriber
- Receiving messages does have to synchronous
- It is great for decoupling systems
- It is designed to provide at least once delivery at low latency (some messages might be deliver more than once)
- It is recommended to be used of systems where data arrives at high and unpredictable rates (example IOT)
- We can configure subscribers to receive messages on a push or pull basis
 
- Cloud Datalab:- Built on project Jupyter, it provides managed lab notebooks
- It lets us create web based notebooks containing Python code
- We only pay for the resources used for the compute
- Data can be visualized with Google Charts of Matplotlib
 
 
- Google Machine learning solutions as a managed service
- Tensorflow:- Open source tool to build and run neural network models
- It has wide platform support
 
- We can run Tensorflow on GCP ML Platform
- Tensorflow can take access of Tensor processing units provided by GCP
- Google Cloud Machine Learning Engine:- Fully managed machine learning service
- Optimized ofr Google infrastructure, integrates with BigQuery and Cloud Storage
 
- Cloud Vision API:- Managed platform used to understand the content of an image
- Can quickly classify images
- Can provide sentiment analysis and text extraction
 
- Cloud Speech API:- Convert audio to text
- We can transcribe audio files to text
 
- Cloud Natural Language API:- Uses machine learning models to reveal structure and meaning from text
- Can extract information about items mentioned in text documents, news articles and blog posts
 
- Cloud Translation API:- Translate arbitrary text into another language
 
- Cloud Video Intelligence API:- Can be used to annotate contents of videos
- Can detect scene changes
- Can be used to flag inappropriate content
- Supports a variety of video formats