Karlsson on databases and stuff

Monday, August 5, 2013

Don't let Technophobia kill innovation

What? Me? technophobic? I have the latest iPhone, my office is jam packed with USB gadgets and my car is a Prius, how much more techno friendly can one get?

That is all fine, but looking beyond fun technologies that we play with just for fun, or natural, but cool and useful, evolutions come to most of us easily. But can you honestly say (I can't) that you always look at the promise of a new technology and never have never looked at it not from the point of view of the obvious new advantages, when the technology has developed into something useful, and instead just looked at it and judged this new technology only from it's first, shaky, implementation?

When I was in my early teens (which occurred around the time just after Mayflower had arrived in New England) my family moved into our first own house. My parents were running a restaurant at the time (they ran one or the other all through my childhood) and my mother had seen most of the weirdo Heath Robinson designed (TM) commercial and domestic kitchen appliances, and when we first entered our new home and mum looked in the kitchen and realized there was a dishwasher in there, her first reaction was "Well, I'm never going to use that one". One month later, the dishwasher was working more or less daily, and my mum never did any dished by hand.

Many years later, me, her only son, having spent the better part of his life playing with SQL based relational databases (and looking at some of the code in them, I suspect that Heath Robinson is still around, now as a software engineer), started to look at NoSQL databases, and my reaction was largely that of my mums when she saw the dishwasher "Nah, I'm not going to use anything like that. Eventual consistency? What kind of silly idea is that".

Yes, I was wrong, but I am still convinced that NoSQL databases (yes, I know NoSQL is a bad term, but this is monday morning and I don't have enough energy to think up something better) will not replace SQL based system. What I do think is that we need both.

Just as I think my mum got it wrong twice: Yes, the dishwasher really is a good idea, but some things are better handled without is. The results is that there is an abundant lack of sharp knifes in my mums house (as a dishwasher is a really effective knife-unsharpener). My self, I use a dishwasher, but knifes and beer glasses are still, to this day, washed by hand by yours truly (beer glasses and I don't want any left over enzymes in my beer, as they are used to kill bacteria, including the really tasty bacteria that gives beer it's distinctive taste).

Too many words has so far been used to say this: The world needs both SQL and NoSQL databases working together, serving different purposes and applications. As for Eventual Consistency, I still thing this is bogus, just say what it is, no consistency, and live with it, MongoDB, Cassandra and LevelDB are still very useful technologies, as is MySQL. And in many cases you need ACID properties and atomic transactions and all that, but in many cases this is a gross overkill.

Look at something like Virtualization. In that case, I think I looked at it in the right way, looking at the potential of the new features this brought, and not ignoring, but thinking less about the issues with the first implementations (slow I/O, slow networking, complexity of use, complexity of installation etc) and looking at what it could do in terms of cost reduction, effective systems management etc.

Back them, when I was a big Virtualization supporter, many were opposing me with the obvious issues with databases (which is the field where I work, if this wasn't already obvious) which was that I/O was slow and unreliable. Yes it was, but that can be fixed. This is not a flaw with the technology per se, but with the specific implementation and the limitations of the underlying technology at the time. Not everyone needs the highest of high performance, many can do with less. And some can easily scale out to more machines. All in all, many can benefit from Virtualization, maybe more than you think. These days, I think noone doubts that Virtualization is useful.

This is not to say I am always right, but I am not so technophobic that everything that is not something I already know is something that sucks. Also, we should be careful when comparing things. We often compare based on attributes of existing technologies and tend to forget that new technologies might well have virtues of their own (which we do not use for comparison as we are unfamiliar with these features as they don't exist in the technologies we currently use).

I think one technology that is now in a state of being seen as inferior is Cloud technologies. We look at a cloud by taking something we run on some hard iron in-house and throw it at Amazon and look at the result. Maybe we should build our applications and infrastructure differently to support clouds, and maybe, if we do that, a Cloud might well be both more cost-effective, scalable and performant than the stuff we run at our in-house data center.

So don't let new innovative technologies die just because they lack a 9600 baud modem or a serial port. Or because they are no good for washing beer glasses (even if that is a very important dishwasher feature).

/Karlsson

Big Data.. So what? Part 1

This is the first blog post in a series where I hope to raise a bit above the technical stuff and instead focus on how we can put Big Data to effective use. I ran a SkySQL Webinar on the subject recently that you might also want to watch, and a recording is available here:http://bit.ly/17TTQnJ

Yes, so what? Why do you need or want all that data? All data you need from your customers you have in your Data Warehouse, and all data you need on the market you are in, you can get from some analyst? Right?

Well, yes, that is one source of data, but there is more to it than that. The deal with Data is that once you have enough of it, you can start to see things you haven't seen before. Trend analysis is only relevant when you have enough data, and the more you have, the more accurate it gets.Big Data is different from the data you already have in that it is Bigger, hence the name, but not only that. Big Data also contains much more diverse types of data, such as images, sound, metadata and video. Also, Big Data has much more new data coming in and is constantly shifting. Research says that each day some 25 quintillion bytes of data is created, this is 25 000 000 000 000 000 000 bytes, if you ask (which is some 25 000 petabytes or 25 000 000 terabytes). And yes, that is every day. (and yes, this is using 1000 bytes per kb, not 1024 per Kb).

As I already said, what is interesting with such huge amounts of data is that once the volumes are high enough, is that you can infer things that you couldn't with smaller or more focused data. You can infer changes that you couldn't before and in some sense make qualified predictions on what will happen in the world. Does this sound like rocket science? Well, it shouldn't and truth is that we have been doing this in at least one field for a very long time, since the 1950's or so, and this was one of the first application for computers. And no, I'm not talking about Angry Birds here.

What I am talking about is weather forecasting. Using knowledge about how winds blow, temperatures, geographies and statistics, we can reasonably well predict how the weather will be. As we all know, these forecasts aren't always right, but even when they go wrong, we get to know why they went wrong. The way these predictions work is to combine large amounts of data with experience and hard facts on how the weather behaves, and the data isn't directly related to the area where we try to predict the weather either. We can do very little to influence the weather, except of course plan a picnic which is sure to create thunderstorms.

In the case of, say, sales of some consumer product, we are actually able to influence this more than we can influence the weather. And if we then add our knowledge of our market and the dynamics of it and combine that with truckloads of related and semi-related data, why shouldn't we be able to do some predictions. Not in the sense of knowing exactly what will happen in the future, but at least have an idea of what is the most likely thing to happen and have an idea of the likelihood that this will be so. Which is how weather forecasts work.

But this isn't all there is to it. Let's pop back to weather forecasting for a second. The analysis done on weather systems is a lot more complex than that done in most data warehouses, there is more to this than some summaries and averages. Also, the way this is presented: Using a way with an overlay of symbols (a Sun, a Cloud, some poor soul planning a picnic) is different from how we are used to see trend data in our data houses.

Conclusion:

We need ways of dealing with large amount of fast moving data - Big Data
We need new, better and more specialized analysis - Big Analytics
We need new ways to view data - Visualizations

I'll be back soon with something more specific on this subject, so don't touch that dial!

/Karlsson

Saturday, June 22, 2013

Some things never go out of style

Such as SQL I guess. And nerdiness and vi and emacs. Not to mention new potatoes!

Yes, Swedish new potatoes. To be honest, this was the first batch this year, but this was the first proper batch for me, with Dill and some salt and eaten with some butter and an extra pinch of salt. Yummy, better than candy! And yes, this was yesterday, on midsummer eve!

And not only are the potatoes a tradition, my blog about them is one too I guess, and for you newcomers, this started out in MySQL days with a mailing to life@mysql.com and when I left MySQL in 2010, I have continued this as a blog.

And as a final word, this years new potatoes were just as good as ever. My twins are real cute and love them dearly, but they do take a lot of time and energy from me and the wife (in particular the wife), but on the other hand, we have someone to pass the new-potato-tradition on to.

And maybe some of you was thinking there was going to be any potatoes this year. Well, you were wrong, the blog about them was just a bit late.

Cheers folks, and have a great summer
/Karlsson

Friday, May 31, 2013

MyQuery 3.5.2 Beta available

I have just posted version 3.5.2 of MyQuery on sourceforge. This version adds just one feature: The ability to run LOAD DATA LOCAL INFILE and in the process adds a few features to this, such as a progess window and the ability to stop a running load.

Download and enjoy!

/Karlsson

Monday, May 27, 2013

JSON datatype in MariaDB prototype

I have patched up a MariaDB version with JSON support, just for the fun of it. This is not the best version of MariaDB around, as I am not a MariaDB developer by any means, and although I have played with the MySQL and MariaDB sources before, I have never attemped to look like I know it in detail. So although this works, it's probably full of memory leaks, errors and bad code (my additions, that is).

That said, it actually works. Surprise! For a simple prototype, that is to show off a specific feature.

So, this is what I have: I grabbed MariaDB 10.0.2 sources and worked from there. To support the JSON specifics, I included the Jansson C-library for JSON. So far so good, then I wanted a JSON datatype, that was the first step. Adding a new datatype to MariaDB / MySQL is a pretty major undertaking though, so I decided to try something different, I decided to kidnap an existing datatype, preferably one with attributes that I could reuse for my JSON type, so I ended up grabbing the little used MEDIUMBLOB datatype.

I also decided to use plain JSON text as the representation, and not use some compacted binary representation, the only compaction I do is to remove duplicate members and unnecessary spaces. So for a JSON object like this:
{"fname": "John", "dept": { "name": "accounting", "id": 1 }, "lname": "Doe", "fname": "George"}
I will store
{"fname":"George","lname":"Doe","dept":{"name":"accounting","id":1}}
Which is no big deal I guess. So with this in hand, I can now create a table like this:
CREATE TABLE appuser(id VARCHAR(10) NOT NULL PRIMARY KEY, name JSON);

The JSON datatype checks that the input string is valid JSON, which is different from the way dynamic columns work. Also, the data is compacted as mentioned above.

Let's insert some data to test this out:
INSERT INTO appuser VALUES('johndoe', '{"fname": "John", "lname": "Doe", "dept": {"name": "accounting", "id": 1}, "role": "admin"}');

INSERT INTO appuser VALUES('susanscott', '{"lname": "Scott", "fname": "Susan", "dept": {"name": "services", "id": 2}}');

INSERT INTO appuser VALUES('janeroe', '{"lname": "Roe", "fname": "Jane", "dept": {"name": "accounting", "id": 1}}');

INSERT INTO appuser VALUES('tomdraper', '{"fname": "Tom", "lname": "Draper", "dept": {"name": "management", "id": 3}, "subordinates": ["johndoe", "susanscott"]}');

Following this, we can do some simple JSON manipulation like this;

SELECT id FROM appuser WHERE json_object_get(name, 'role') = 'admin';

And if the name of the 'admin' role changes to 'administrator' we can do this:

UPDATE appuser SET name = json_object_set(name, 'role', 'administrator') WHERE json_object_get(name, 'role') = 'admin';

And to find whoever is supervising Susan Scott, we can do this:
SELECT id FROM appuser WHERE json_array_contains(json_object_get(name, "subordinates"), "susanscott");

Finally, John Doe is actually the supervisor for Jane Roe, so let's fix that:
UPDATE appuser SET name = json_object_set(name, "subordinates", '["janeroe"]') WHERE id = 'johndoe';

I have a few more things in there, and there are many things missing, this was done just to get a feel for things. The main tool for all thsi is Jansson, as mentioned above, and Jansson has a bunch of cool tricks up it's sleeve. We could easily implement merging of object and of arrays, more array management functions and stuff like that.

Again, what do you think? Is this useful? I sure think this is easier to use than the Dynamic Columns (but these have other advantages). I still have things to do, both functions to be added as well as SELECT ... INTO OUTFILE and LOAD DATA INFILE and stuff like that.

Does anyone want to play with this? I do have a x86_64 build if someone wants to try it.

/Karlsson

Sunday, May 26, 2013

SQL and JSON, what do you think?

As you might know, I'm a big fan of JSON. One big reason is that I believe that JSON is closer to most developers view on data, whereas the Relational SQL based model is closer to what someone working with data itself or someone working with infrastructure. What I mean here is that neither view is wrong, but they are different.

So, given that, can we merge the Object JSON world with the relational model? Well, not JSON, but Hibernate does it quite well. This is one of my objects to the NoSQL world, that the datamodel is closely linked to the application at hand, and less so to data itself and to other applications. Stuff such as accounts, privileges, accounting data, orders and many other things are global, and are not specifically connected a specific application, but in many NoSQL applications, this is what it ends up being.

And there are not that many good solutions, how can I easily explore data in a NoSQL database, where the data model is application based, without me knowing the application? See what a high-ranking 10gen person, Dwight Merriman (Chairman of the board) has to say on the subject in an interview in 2011.

On the other hand, we relational folks has a few things to answer to. We insist that we know how data is to be managed (and I think we do) and we seem to ignore the fast that development tools and environments has changed sine the 1980s (we still insist that proprietary client/server protocols is the way to go, and that SQL rules), In my mind, SQL and Relational rules for data, but not necessarily for applications, and we (including yours truly) should wake up and appreciate what is going on here: We need better, more developer focused, means of accessing data, even if that data is relational data.

The NoSQL camp is getting ready to fix what they can fix: proper query languages (Cassandra CQL being one), better data access methods and probably much more that I am not aware of right now.

Which bring me to what we can do to be more developer friendly on the SQL and relational side of things. And the answer is, as I have already hinted, JSON support! MariaDB has a few trucks up it's sleeve, like the Dynamic Columns support. But this is nowhere near enough, the idea is OK but the way to use dynamic columns for a developer doesn't help much, rather the interface is more targeted at making the Cassandra Storage Engine features easier to access from the MariaDB SQL Interface. But I'll do a blog later on the Dynamic Column support and JSON later on anyway.

What I would want, but maybe that's just me, is a proper JSON datatype, with JSON functions, syntax and semantics. That is a starting point I think, and it shouldn't be too difficult. But many other JSON related things that we could do spring to mind:

SQL / JSON language mix - What I mean here is that JSON isn't just a string datatype, but that we can include plain JSON, where approprite, in the SQL:
INSERT INTO table VALUES(57, {"empno": 1, "ename": "Scott})
JavaScript stored procedures
ROW data as JSON support in the MySQL API
JSON column data as ROW data in the MySQL API and tools.
A JSON Storage Engine
JSON support in LOAD DATA INFILE
JSON support in SELECT .. INTO OUTFILE
REST / JSON interface to complement the MySQL interface
JSON attribute indexing

A then again some stuff I haven't though of so far. A reason for this blogpost if to get your opinion? What do you think? Would this be useful? Would is make MySQL easier to use?

And example why this is useful: instead of having to add columns to a table for some new attribute, attributes that are useful mostly for the applications, is then done by just adding the to a JSON column in the table and the application can add what it wants to that column. We have all seen this, haven't we? Some odd column that the application wants to keep track of in some table, some columns the data is which is pretty uninteresting as data, but is still necessary by the application? This is where a proper JSON type would help? Also, the MariaDB Dynamic Columns help with that, but I think JSON would be a far more elegant solution?

/Karlsson

Sunday, May 19, 2013

What's the deal with NoSQL?

Everybody seems to be looking at and debating NoSQL these days, and so am I and I thought I'd say a few words about it. Which is not to say I haven't said stuff before, bit them I was mainly targeting specific attributes of many NoSQL solutions (like "eventual consistency" or, as you might call it, "instant inconsistency", What I was opposing is that "eventual consistency" has anything to do with just that, consistency. Rather, what this means is that at any point in time the system is inconsistent, and even if it might be consistent, you cannot rely on it being so. Which is fine, but don't call it consistency, call it inconsistency. Allowing a database to be somewhat inconsistent doesn't necessarily mean that it's something wrong with it).

All this said, what is going on here, why are we MySQL and MariaDB users seeing so many MongoDB, Cassandra and LevelDB applications pop up? Come on, these are typically less functional implementations of a database than even the most basic MySQL setup? No transactions, no joins, no standards etc. etc. And the answer, if you want to hear what I have to say, is ease of use. So let's explore that a bit.

Following the Object Orientation frenzy of the 1990s, when any application project ended up consisting of endless sessions modeling objects, usually involving expensive consultants, dresses in expensive, blue suits. And when that was done (which took years!) you had a way cool object model, but no money left to do the actual implementation, i.e. do the real programming (shiver), and you went to some other project and the nicely dressed object design consultant left to see another OO sucker.

Now, objects are much more standard, even non-OO languages have a big chunk of OO features, and these are used enhance programmer productivity and better code and design. Which is fine (except that if you were one of those OO consultants, which means you are now out of a job, as such mundane tasks of writing is not something you would ever do, such dirty stuff is better left to "programmers". Oh no, I forgot that you are now an ITIL consultant, that just slipped my mind) but how does this relate to MySQL and MariaDB. The answer is that MySQL, which was once considered real easy to use, no longer is as easy as it used to be. The Relational data model is still brilliant when you look at data as data, and that is how many of us look at it, so we go through the process of mapping data to objects, if that is what it takes. SQL and Java, PHP or whatever merges, and the application now contains a layer mapping objects to real data. Or we use hibernate, which does this automatically for us.

But a new cadre of developers are emerging, and they look at OO as natural and they look at objects as data (it's not. Data, in my mind, should be independent from the application using it, objects on the other hand, are closely tied to the application at hand). With which I do not mean that there is something wrong with building applications using objects, quite the opposite. But if all you know is objects, then using relational technology turns difficult, and SQL, for all the good things with it, seems old-fashioned and arcane, which it is (but it is so widely used you cannot avoid it). So you go with something that looks at objects as all you need, and present that in some object format. Like JSON.

And again, there is nothing wrong with that. But if we who are on the SQL and Relational track just discards these NoSQL technologies, we are not making any friends. We have to accept that MySQL and MariaDB really aren't that easy to use anymore, at least not for newcomers.

And then there is another thing: Some data, like Big Data, has attributes that really doesn't fit well in a relational model. Data where the attribute of a value can't easily be determined once and for all, and you need to reprocess that data (large test objects, images and maps are some examples). In this case, you really need to extend the relational model, somehow.

But SQL-based relational isn't going away. The Relational model is still one of the best ways to look at data, it's just that we also need some other ways of looking at data. And it needs to be easier to access. And we shouldn't really have to push SQL down the throat of every single developer, trying to develop an application using some OO technology. The answer is we need both. And these technologies needs to interoperate. I want to use SQL for my data. But I also want JSON and REST for my data. And there shouldn't be much of a performance overhead. All in all, we SQL folks need to wake up and data easier to use again. We know data better than the Cassandra and MongoDB folks. We know transactions better than them too. But they know how to work with developers who doesn't know who The Beatles were and make Relational easy to use for them, without them having to learn JSON (and now having to listen to a tirade about todays youngsters not knowing what real music is and that it died with John Lennon! What? You don't know who John Lennon was! That's exactly what I mean, you have no taste at all!).

Just my 2 cents...

/Karlsson