HBASE

Hadoop

HBASE

딜레이라마 2017. 2. 9. 19:59

The Problem with Relational Database Systems

RDBMSes have typically played (and, for the foreseeable future at least, will play) an integral role when designing and implementing business applications. As soon as you have to retain information about your users, products, sessions, orders, and so on, you are typically go‐ ing to use some storage backend providing a persistence layer for the frontend application server. This works well for a limited number of records, but with the dramatic increase of data being retained, some of the architectural implementation details of common database sys‐ tems show signs of weakness. Let us use Hush, the HBase URL Shortener discussed in detail in (to come), as an example. Assume that you are building this system so that it initially handles a few thousand users, and that your task is to do so with a reasonable budget—in other words, use free software. The typical scenario here is to use the open source LAMP9 stack to quickly build out a prototype for the business idea. The relational database model normalizes the data into a user table, which is accompanied by a url, shorturl, and click table that link to the former by means of a foreign key. The tables also have indexes so that you can look up URLs by their short ID, or the users by their username. If you need to find all the shortened URLs for a particular list of customers, you could run an SQL JOIN over both tables to get a comprehensive list of URLs for each customer that contains not just the shortened URL but also the customer details you need.

In addition, you are making use of built-in features of the database: for example, stored procedures, which allow you to consistently up‐ The Problem with Relational Database Systems 7 www.finebook.ir 10. Short for Atomicity, Consistency, Isolation, and Durability. See “ACID” on Wikipe‐ dia. 11. Memcached is an in-memory, nonpersistent, nondistributed key/value store. See the Memcached project home page. date data from multiple clients while the database system guarantees that there is always coherent data stored in the various tables. Transactions make it possible to update multiple tables in an atomic fashion so that either all modifications are visible or none are visible. The RDBMS gives you the so-called ACID10 properties, which means your data is strongly consistent (we will address this in greater detail in “Consistency Models” (page 11)). Referential integrity takes care of enforcing relationships between various table schemas, and you get a domain-specific language, namely SQL, that lets you form complex queries over everything. Finally, you do not have to deal with how da‐ ta is actually stored, but only with higher-level concepts such as table schemas, which define a fixed layout your application code can refer‐ ence. This usually works very well and will serve its purpose for quite some time. If you are lucky, you may be the next hot topic on the Internet, with more and more users joining your site every day. As your user numbers grow, you start to experience an increasing amount of pres‐ sure on your shared database server. Adding more application servers is relatively easy, as they share their state only with the central data‐ base. Your CPU and I/O load goes up and you start to wonder how long you can sustain this growth rate.

The first step to ease the pressure is to add slave database servers that are used to being read from in parallel. You still have a single master, but that is now only taking writes, and those are much fewer compared to the many reads your website users generate. But what if that starts to fail as well, or slows down as your user count steadily in‐ creases? A common next step is to add a cache—for example, Memcached.11 Now you can offload the reads to a very fast, in-memory system—how‐ ever, you are losing consistency guarantees, as you will have to inva‐ lidate the cache on modifications of the original value in the database, and you have to do this fast enough to keep the time where the cache and the database views are inconsistent to a minimum. While this may help you with the amount of reads, you have not yet addressed the writes. Once the master database server is hit too hard with writes, you may replace it with a beefed-up server—scaling up vertically—which simply has more cores, more memory, and faster disks… and costs a lot more money than the initial one. Also note that if you already opted for the master/slave setup mentioned earlier, you need to make the slaves as powerful as the master or the imbalance may mean the slaves fail to keep up with the master’s update rate. This is going to double or triple the cost, if not more. With more site popularity, you are asked to add more features to your application, which translates into more queries to your database. The SQL JOINs you were happy to run in the past are suddenly slowing down and are simply not performing well enough at scale. You will have to denormalize your schemas. If things get even worse, you will also have to cease your use of stored procedures, as they are also sim‐ ply becoming too slow to complete. Essentially, you reduce the data‐ base to just storing your data in a way that is optimized for your ac‐ cess patterns. Your load continues to increase as more and more users join your site, so another logical step is to prematerialize the most costly queries from time to time so that you can serve the data to your customers faster. Finally, you start dropping secondary indexes as their mainte‐ nance becomes too much of a burden and slows down the database too much. You end up with queries that can only use the primary key and nothing else. Where do you go from here? What if your load is expected to increase by another order of magnitude or more over the next few months? You could start sharding (see the sidebar titled “Sharding” (page 9)) your data across many databases, but this turns into an operational night‐ mare, is very costly, and still does not give you a truly fitting solution. You essentially make do with the RDBMS for lack of an alternative.