We took a look at some of the foundations of big data systems (some of them are even outdated now), from a more academic point of view.
This is going to be a short video since what we mostly did was trying to understand the motivation and design decisions behind all these systems. I’ll put the links to all the papers we reviewed so you can take a look at them.
Google File System
This was the first distributed file system Google created to store all the information they manage; it has since then been replaced by Colossus. However, the foundations remain.
From there we jumped to …
(HDFS) or Hadoop File System
Which is again, a distributed file system, in this case, inspired by GFS. As I mentioned earlier, we start with the foundations, that is Version 1 version of HDFS only to see the differences with the second version (and now there is a third version out, yey!).
Once we learned a bit about HDFS, we learned about his companion, the programming model called…
Which is a useful technique to process vast amounts of information in a distributed way, taking advantage of having lots of relatively cheap computers. I made a whole video dedicated to MapReduce; you can check the link in the description.
However, MapReduce is somewhat outdated too, and it has some limitations. We reviewed other more modern approaches to work with Big Data problems using…
Which is a framework for distributed computing that allows us to specify transformations over a dataset without actually doing them right away but in a lazy manner. Spark has its foundation on the concept of Resilient Distributed Datasets: read-only collections of data distributed over nodes in a cluster. I’ll probably make a video about Spark in the future.
Both MapReduce and Spark run on top of a distributed filesystem, benefitting themselves from the characteristics of such systems.
After learning about these two processing approaches, we went on to learn about more different ways of storing information using distributed NoSQL Data Stores like…
Another one of those Google creations, in the first line of the paper it says what Bigtable actually is: Bigtable is a distributed storage system for managing structured data that is designed to scale to a huge size. And that’s about it, I mean, is way more complex that I’m making it sound here, but I won’t go into details.
Again we briefly saw an open source version of Bigtable called HBase.
Finally, we reviewed Cassandra, another highly distributed data store, and its approach to decentralise the knowledge that the other approaches had centralised in a master node, another interesting thing is its ease to work across data centres.
As for the practical side of things we did a couple of exercises: one using MapReduce and the other one using Spark on a school provided cluster. Both exercises involved calculating PageRank scores of some Wikipedia articles.
As you can probably guess all of these systems involve a coordination hell as all of them are distributed and hold redundant copies of data some of them not only on a single cluster but across the entire world.
And that was it, as I said for all of those systems we reviewed their main components such as Master nodes or DataNodes or whatever they were called on each of the implementations and the basic techniques that powered their reliability like writing to logs or creating checkpoints, along with some of the tools that helped these tools achieve great performance like LSMTrees, SSTables and Bloom filters.