Saturday 11 August 2012

Arrays in CFML (part 1)

G'day:
I'm going to write up my understanding of how arrays work in CFML. This is a mundane topic in any programming language, but it's also absolutely critical, and having a good handle on how they work is very important. I will say up front that if you're a seasoned CFML developer that there probably won't be anything in here you don't already know.



Rationale

One would think that this is a well-trod topic, but I'm not finding it so: most of the coverage I've found has been fairly superficial. The chief online docs resource - Adobe's own docs - fall into this category. I don't know of any other online resource covering this (please point then out to me so I can cross reference them here). In print, I just found a CFWACK on one of our bookshelves in the office, and checked it out: the coverage in that is also pretty superficial (more so than the online docs, actually). I was a bit disappointed by that. So, anyway, I'm gonna DIY. I'm going to start from the beginning and do a knowledge dump of everything I know, my personal approach to things, and other stuff I can dig up. I'm going to split this across a coupla postings, because the coverage I have in mind will be more than will sensibly fit into one post.

Another reason I am writing this is because I come across a small but not insignificant number of people who will say "ooh... I don't use arrays: I don't really understand them", or they use arrays when they ought to be using structs or recordsets. It might be a case of them not seeking out the knowledge, but it might be a case of them just not understanding what they have successfully sought out. I don't claim to write in a more accessible fashion than anyone else, but everyone's writing style is different, and the way the reader understands things is different too. Maybe this'll strike a chord with some people, and it'll help them "get" arrays.

What's an array?

First up, what's an array (I told you I'm starting with the basics...)? For serious reading - and more in-depth theoretical coverage than I intend to go into here - read this wikipedia article.

The short version is that an array is a complex data type ("complex" in that it comprises more than one value in a single data structure), the elements of which are accessed via a numerical index. In CFML the index must be a positive integer (whole numbers starting from one). Many languages - such as Java and Javascript - start their indexes from zero. I discuss the merits of CFML's approach in another article I wrote a coupla weeks ago. There is another type of array - an associative array - wherein the index (or more often referred to as a key) can be any value. In CFML, associative arrays are called structs, and I'll deal with them separately in some later articles unless I get booed off the stage after this one ;-).

What's the syntax?

Accessing an array is done using the following syntax:

myArray[2]

Where myArray is the name of the array variable, and 2 is a reference to the second element in the array. So the general form is:

variableName[index]

There are three terms to keep in mind when working with arrays:
  • the array variable itself, eg: myArray;
  • the array index (the number in the square brackets);
  • the array element. This is the value in the array variable at the position indicated by the index.

So given:

myArray[3] = "Toru";

The array variable is "myArray", the index is 3 (ie: the third element in the array), and the element at that index position is "Toru".

It's important to remember the difference between the index and the element. I stress this because CFML itself gets this wrong in at least one place I can think of: using <cfloop> to loop over an array: it refers to the element as the index. I've tried to get Adobe to rectify this, but they don't seem to mind that they're using the wrong term here. Not to worry.

Arrays can also be referenced in their entirety, just via the variableName. One would do this if passing the array to a function, or perform some other operation that acts on the entire array, such as a loop. I'll get into that lot later.

Before we use an array, it needs to be created. There's a coupla ways of doing this in CFML.

The old-school way is it to declare an array, then populate it, as follows:

myArray = arrayNew(1);
myArray[1] = "Tahi";
myArray[2] = "Rua";
// etc

This is fairly self-explanatory: it's a series of variable assignments.

First we set myArray to be an empty one-dimensional array using arrayNew(). The "1" indicates it's a one-dimensional array. An array can have multiple dimensions, but for the time being we'll stick to one-dimensional ones.

Next we set the elements at index positions one and two to have values in the same way we'd set any variable. Note that whilst I'm using simple string values for my array elements, an array can hold elements of any data type, and each element can have different data types. I make a point of mentioning this because other languages (eg: Java) require all elements to be of the same specific type.

Using arrayNew() and individual variable assignments is a valid - if slightly cumbersome - way of populating an array. Since CF8 one had been able to use a shorthand notation, or "array literal notation". This is equivalent to the code above:


myArray = ["Tahi", "Rua"];

This does the array creation and element assignment in one fell swoop.

This notation is often referred to as "implicit notation" in the CFML community, but this is a misnomer: there's nothing "implicit" about the notation; it's very explicitly creating an array. I think this usage stems from Adobe's own misuse of the notion "implicit" as per these docs. It's best you are at least aware of this idiom, but perhaps don't perpetuate it.

There are other ways that array variables can be created: many of CF's inbuilt functions return arrays, or convert other data types to arrays (eg: listToArray(), structKeyArray()). I'll cover those later.

So that's how arrays are created and referenced. What are they for?

When does one (not ~) use an array?

Arrays are best used for representing multiple data elements that have some notion of sequencing. Let's go back to my earlier example:

myArray = ["Tahi", "Rua"];

Clearly there's a sequence to those values: the first one comes before the second one. One might also have an array of invoice lines or line items in a shopping cart, or - jumping on the current zeitgeist - the result list in an Olympic event, or indeed the medal table is clearly multiple data elements that have a sequential relationship with each other: there's a sense of first, second, third.

(On reflection: the Olympic examples might not actually be great because occasionally participants in events finish equally, and medals are shared; also various countries have the same medal tally, so would appear in the same position in the array... which doesn't work with a simple array. But for illustrative purposes, you get my drift: there needs to be a sense of sequence).

But appropriateness of examples aside, the main point is there's an inferred numerical ordering to the data.

I stress the "numerical ordering" side of things because I occasionally come across code wherein something like a person's name is stored in an array, thus:

myName = ["Donald", "Adam", "Cameron"]; // yeah, my first name is actually Donald. Long story.

Whilst there's definitively a sense of order there: I'm "Donald Adam Cameron", not "Adam Donald Cameron", the sequence here is not explicit, it's implied by convention. A person's name is far better stored as a struct, eg:

myName = {firstname="Donald", middleName="Adam", familyName="Cameron"};

(The curly bracket notation is the struct-equivalent of the square bracket notation for an array, in case that notation is unfamiliar to you).

Another consideration demonstrated here is that my name is not a sequence of "same things". Well they're all names, sure, but one's a first name, one's a middle name, one's a family name. If the elements of the data structure have a distinct meaning individually, rather than the data structure as a whole giving "meaning" to the elements, then the data structure oughtn't be a numerically indexed array. Hmmm... maybe that's not the best way to word it. I guess if the sequential / numeric-precedence of the data elements is the chief factor that makes the data a single entity, then it's an array; otherwise it's probably a struct. Or when you're thinking of accessing the elements, you think "right, I need the 2nd item" rather than "I need the middleName", then it's an array. Make sense? Or does it make less sense now than before I started? Hmmm.

Multi-dimensional arrays

Earlier I mentioned multi-dimensional arrays. In CFML a multi-dimensional array is basically an array of arrays rather than a true multi-dimensional array. Remember how I pointed out earlier that whilst all my examples are using elements that are simple values, but an array element can actually be any data type? Well that means an element can itself be an array. This is demonstrated here:

myArray = [
    ["First row, first element", "First row, second element"],
    ["Second row, first element", "Second row, second element"]
];

Here I've split the code over multiple lines to make it clearer which array is which, plus I'm using the notion of "rows" to describe the first array. If I use <cfdump> to output myArray, I see this:

array
1
array
1First row, first element
2First row, second element
2
array
1Second row, first element
2Second row, second element

Note that it's an array with two elements, each element itself being an array.

If we were using old-school notation, we'd create this array thus:

myArray = arrayNew(1);
myArray[1] = arrayNew(1);
myArray[1][1] = "First row, first element";
myArray[1][2] = "First row, second element";
myArray[2] = arrayNew(1);
myArray[2][1] = "Second row, first element";
myArray[2][2] = "Second row, second element";

This also demonstrates how to reference elements of the second dimension of the array:

myArray[m][n]

Where m is the index of the element in the first dimension, and n is the index of the element in the second dimension.

One can also reference the entire array at each index if the first dimension using the one-dimension-array notation we covered earlier.

myArray[2]

EG:

<cfdump var="#myArray[2]#">

Which yields:

array
1second row, first element
2second row, second element


What one cannot do in CFML is return the slice of the data structure that is at a specific index of the second dimension of the multi-dimensional array, eg, given the array above, there is no syntax to retrieve the second dimension at index position 2, which would be "first row, second element" and "second row, second element". This is because a CFML "multi-dimensional" array is no such thing, it's just an array of arrays. Or indeed it could simply be an array with one element being an array amongst a bunch of other data types. So I guess even trying to support the syntax is a fool's errand, plus there's really no sense of true dimensionality beyond the first one.

Just to make clear what I said about an array being able to contain any kind of data, here's a quick example:

myArray = [
    ["First row, first element", "First row, second element"],
    "Just a string",
    {key="This is a struct"}
];


Which - when dumped - yields:

array
1
array
1First row, first element
2First row, second element
2Just a string
3
struct
KEYThis is a struct


That said, one can define and initialise an array of arrays in CFML using arrayNew(), thus:

myArrayOfArrays = arrayNew(2); // that's a "two-dimensional array"

This really just signals intent rather than doing anything special: one could easily then go ahead and do this:

myArrayOfArrays[1] = "not an array";

arrayNew() supports up to three dimensions being defined. However there's nothing to stop one defining as many dimensions as one likes, as demonstrated here:

myTenDimensionalArray = [
    "Tahi",
    [
        "Rua",
        [
            "Toru",
            [
                "Wha",
                [
                    "Rima",
                    [
                        "Ono",
                        [
                            "Whitu",
                            [
                                "Waru",
                                [
                                    "Iwa",
                                    [
                                        "Tekau"
                                    ]
                                ]
                            ]
                        ]
                    ]
                ]
            ]
        ]
    ]
];


array
1Tahi
2
array
1Rua
2
array
1Toru
2
array
1Wha
2
array
1Rima
2
array
1Ono
2
array
1Whitu
2
array
1Waru
2
array
1Iwa
2
array
1Tekau

(Whoah, that right & bottom edge of that <cfdump> looks a bit trippy).

Or to extract "tekau":

ten = myTenDimensionalArray[2][2][2][2][2][2][2][2][2][1];

The three-dimension limit is able to be superseded simply because these multi-dimensional arrays are just arrays of arrays, as mentioned above.

So why would one use a multi-dimension array? Will let's stick to two-dimensional ones to start with. What are they good for? Very little in my experience. People use 'em, sure, but usually for the purposes they're putting then to would better be implemented as an array of structs or objects, or a struct of arrays, our some combination of arrays/structs/objects. In business applications (which is mostly what CFML code is used for) their uses are limited. If one was doing maths, then they're obviously good for representing matrices, or sets of coordinates. But how often does one need to do that? There's nothing wrong with CFML catering for not-oft-needed functionality (<cfpod>, anyone?), but the thing to take away here is that if you funds yourself thinking "ah... I'll use a multi-dimension array for that...", I urge you to give it more thought: you might not be making the best decision.

One usage for at least two-dimensional arrays that I can think of is to represent the rows and columns of a spreadsheet, and if one can't be bothered wrestling with ColdFusion's spreadsheet functions. I think with a spreadsheet, despite the columns being alphabetically indexed not numerically, they still represent sequentially indexed data, and an integer served just as well add a letter (or series of letters) does.

If anyone else has good examples of where multi-dimension arrays are the best-fit for describing a data structure, let me know!

Sparse arrays

One last thing to note when considering initialising arrays is that one doesn't need to set the elements sequentially. This is legit:

mySparseArray = arrayNew(1);
mySparseArray[3] = "Toru";

Note how I've not set any elements to indexes one or two there. Here's the dump:

array
1[undefined array element] Element 1 is undefined in a Java object of type class coldfusion.runtime.Array.
2[undefined array element] Element 2 is undefined in a Java object of type class coldfusion.runtime.Array.
3Toru

(Hmmm... I'm not sure what CF thinks it's doing spitting out an error message in there!).

Observe I used old-school notation for that demonstration. This equivalent code using shorthand notation doesn't compile:

mySparseArray = [ , "Toru"];

That yields:

Invalid CFML construct found on line 2 at column 17.

ColdFusion was looking at the following text:[

(Yeah, I know I can use javaCast() to put a null in there, but that's not equivalent code).

To be frank... as evidenced by <cfdump> including an error message in its output there, ColdFusion doesn't cope well with sparse arrays, and a lot of functionality will error if it encounters empty elements. Tread with caution if you find yourself needing to use a sparse array.

Next...


I think that pretty much sums up what arrays are, and how to create them. I've not covered how to use them our what one can do with them from a CFML perspective... I'll get onto that next...

--
Adam