Tuesday, September 13, 2016

Week of Code - Sorting Algorithms

The topic of Sorting Algorithms is a very key subject within computer science algorithms study, as it is often frequently used in much larger coding projects.  There are a number of well-known algorithms, such as Merge Sort, Quick Sort, Insertion Sort and Bubble Sort, which should be well-studied by the avid computer scientist.  Before one could properly analyze each, we need to assume an understanding of Big-O Notation and algorithmic complexity analysis.

In this blog post, we look at some of these popular sorting algorithms and implement them in Python.  I've set up a way to benchmark the speeds of each and verify their correctness, and we will be able to see how they perform under different characteristics.


Bubble Sort

We start with perhaps the most trivial sorting algorithm - the Bubble Sort.

1
2
3
4
5
6
7
8
def bubbleSort(arr):
    larr = len(arr)
    for i in np.arange(larr):
        for j in np.arange(larr - i - 1):
            if arr[j] > arr[j+1]:
                temp = arr[j+1]
                arr[j+1] = arr[j]
                arr[j] = temp

This algorithm essentially iterates over the array from left to right.  At each item, it then searches for an item to the right of it that it needs to swap.  After passing through each item in this manner, it follows that all of the items will be ordered properly.

The outer loop will always run n-many times.  The inner loop varies from running one time during each pass (the best case), or n-i many times times in each pass (the worst case).  One exercise is to consider when these cases happen.  For instance, when the list is in reverse order, we can expect the worst case.  The best case on the other hand, will happen when the list is already nearly sorted.

It follows that the the worst case of Bubble Sort is O(N²), but its best case can be O(N).  A python implementation above shows best-of-5 runtimes for Bubble Short as follows.  Every time the size of the list increased ten-fold, the runtime increased one-hundred-fold.  For mostly sorted lists, the runtime was about 33% faster.

Indeed, some argue that Bubble Sort isn't even worthwhile to teach others, due to how fast it scales in runtime as the size of the sorting list grows.  However, I argue that the simplicity of the sort offers an incubational problem solving environment to many who wish to begin studying algorithms.

Mostly Sorted? N Runtime (s)
No 10 0.00012
Yes 10 0.00011
No 100 0.00349
Yes 100 0.00227
No 1000 0.32612
Yes 1000 0.20100
No 10000 33.4892
Yes 10000 20.6090


Insertion Sort

The Insertion Sort, like Bubble Sort, is not a very fast sort although it is still a relatively simple sorting algorithm.  While it remains of the same class of algorithms as Bubble Sort, the Insertion Sort has a much faster potential if the list is already mostly sorted.

1
2
3
4
5
6
7
8
def insertionSort(arr):
    for i in np.arange(1, len(arr)):
        currentvalue = arr[i]
        position = i
        while position > 0 and arr[position-1] > currentvalue:
            arr[position] = arr[position-1]
            position = position-1
        arr[position] = currentvalue

Ultimately, the insertion sort will still iterate over each element once in the outer loop.  It then makes a note of the index at that location, and then begins to iterate backwards looking for inversions that need to be corrected by inserting the out-of-order element before the indexed position.

One other way to understand this algorithm is to consider a hand of cards that you need to sort.  Typically, you'd look at the first card, and then scan right of it looking for any cards smaller than it, and you'd insert those before your first card in sorted order.

At its worst case, the Insertion sort requires N passes of the outer loop and N-i passes of the inner loop, which could happen if the list was reverse-sorted.  This is O(N²) at worst case, but O(N) at its best case.  This sounds identical to Bubble Sort, but why is Insertion Sort faster than Bubble Sort at its best case?  The code for our Insertion Sort can utilize a While loop in its inner loop, whereas the Bubble Sort used a For loop.  In the best case for Bubble Sort, it still had to iterate over N-i items in every inner loop.  However, the Insertion Sort code can opt out of its Inner While loop at any time, and on the very first pass of best cases.

The runtime analyses still demonstrates the scale of quadratic complexity.  Every time N increases by one order of magnitude, the runtime increased by two orders of magnitude.  It's worth noting that in practice, people tend to sort lists that are already sorted, meaning that the "mostly sorted" case is more frequent than a totally random list.  This is a good feature of Insertion Sort, but as we will see, there are far better sorting algorithms.

Mostly Sorted? N Runtime (s)
No 10 0.00003
Yes 10 0.00003
No 100 0.00213
Yes 100 0.00029
No 1000 0.20506
Yes 1000 0.02768
No 10000 20.7122
Yes 10000 2.39983


Merge Sort

The Merge Sort is one of our more complex algorithms, and more generally, a Divide-and-Conquer algorithm.  Using recursion, the problem size is divided into most basic elements (lists of size 1), and then they are "conquered" by recombining them into a sorted parts, and then finally into a sorted whole.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def mergeSortLen(part):
    """takes an unsorted list and returns a sorted list
    """

    if len(part) > 1:
        mid = len(part)/2
        left = mergeSortLen(part[:mid])
        right = mergeSortLen(part[mid:])

        i = 0
        j = 0
        k = 0

        llen = len(left)
        rlen = len(right)
        while i < llen and j < rlen:
            if left[i] > right[j]:
                part[k] = right[j]
                j += 1
            else:
                part[k] = left[i]
                i += 1
            k += 1

        if i < j:
            remaining = left
            remlen = llen
            r = i
        else:
            remaining = right
            remlen = rlen
            r = j

        while r < remlen:
            part[k] = remaining[r]
            r += 1
            k += 1

        return part
    else:
        return part

The source code for Merge Sort is much more complex than Insertion Sort or Bubble Sort, but the pay-off is worth it.  An understanding of Merge Sort assumes an understanding of basic Recursion.  That is, the code will call itself on a smaller part of the input until the input becomes so small it matches the recursive base case, at which point it no longer calls itself, but proceeds to reconstruct the pieces in a sorted fashion.

Consider an unsorted list:

374591

This would first be split (as into a leaf and leaf nodes) into:
374     and     591

Which then would further keep splitting:
       3   &   74    and      5  &  91
                7  & 4                    9 & 1

Once the tree has been split into each of its leaf nodes, we can then begin reconstructing the list by marrying leaf nodes with its direct sibling node.   Our lowest two pairs of sibling-pair leaf nodes are "7 & 4" and  "9 & 1".   These are both inverted, so the reconstruction would swap them into "4 & 7" and "1 & 9".

At this point, our recombination would need to marry "3 & 47" and also marry "5 & 19".  The sorting technique here is to place pointers on both sides.  At each step, we compare two elements, selecting the smaller of the two to be placed into the joined list.  For instance, with "3 & 47", we'd compare 3 vs 4, selecting the 3 to go into the first spot of the joined list.   And then we'd pick the 4, and then the 7, so we'd have a joined list of "347".   For "5 & 19", we'd have to pick 1 first, and then the 5 and 9.

This gives us a tree which looks like:
347    and    159

One final recombination step initializes pointers at the 3 and 1.   Since 1 is smaller, we select it first, which advances the right-side pointer to the 5, so that we are comparing 3 vs 5.  Pick the 3, and then we're comparing 4 vs 5, and so on.  The final combination gives us our final sorted list of 134579.

In our code, it is possible that our pointers exhaust one of the two sides before it exhausts the other.  in this case, we have to handle remaining elements and sort them in a similar fashion.

The analysis of Merge Sort could be solved with Recurrence Relations.  There are two recursive calls which operate on half of the input, and then there is a linear O(N) section that follows the recursive calls to recombine the pieces.  Hence, T(N) = 2T(N/2) + O(N), which is of the form, aT(N/b) + f(N), and we can use the Master Theorem's case two to show that T(N) = O(N log N), and that the best case and worst case are the same.

The runtime of Merge Sort can vary greatly on its actual implementation.  In our code, we utilize the len() method as little as possible to achieve a fairly minimal runtime.  For a list with 10000 elements, we can see that the runtime is about 500 times faster than Bubble Sort, and even ~33 times faster than Insertion Sort in its best case.  We can also see that Merge Sort has very little variance in the runtimes between mostly sorted lists and totally random lists, which follows from the fact that its best and worst cases are of the same complexity.

Mostly Sorted? N Runtime (s)
No 10 0.00003
Yes 10 0.00003
No 100 0.00058
Yes 100 0.00055
No 1000 0.00643
Yes 1000 0.00496
No 10000 0.06727
Yes 10000 0.06211


Quick Sort

Like Merge Sort, the Quick Sort is a recursive Divide-and-Conquer function.  Its primary difference is that it relies on the selection of a pivot to decide on how to split the parts into two.  After the parts are split, the "conquering" recombination portion of the code is moot, because all that is needed is to recombine the pieces as each side will be ready to marry in perfect sort order.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def quicksort(arr):
    quicksorthelper(arr, 0, len(arr)-1)

def quicksorthelper(arr, first, last):
    if first < last:
        pivot = partition(arr, first, last)
        quicksorthelper(arr, first, pivot-1)
        quicksorthelper(arr, pivot+1, last)

def partition(arr, first, last) :
    pivot = last
    swap(arr, pivot, last)
    for i in np.arange(first, last):
        if arr[i] <= arr[last]:
            swap(arr, i, first)
            first += 1
    swap(arr, first, last)
    return first

def swap(A, x, y):
  A[x],A[y]=A[y],A[x]

The selection of a pivot for Quick Sort is a big topic.  In general, three easy choices can be either the first element, the last element or the middle element.  In this source code, we select the last element, because it is known that Quick Sort can be slow if it chooses the first element on an already sorted list.

Quick Sort works as follows.  After selection of a pivot, the list is divided into the section to the left and right of the pivot (of its value, not index).  The elements of each side are then recursively sorted in the same way.  The recombination step has nothing to do, since the pivot partition was done in such a way that elements greater than the pivot were moved into the right side, and elements less than the pivot are moved into the left side.

Asymptotically, Quick Sort is in the average case, an O(N log N) algorithm, however its worst case can be O(N²).  This can happen in the case of an unbalanced splitting of the list, e.g. one side has all of the elements on every split.  This can make the selection of the pivot a very important factor.  The reason the first element is often a poor choice is because in practice, lists are usually already sorted or mostly sorted, which can lead to poor performance behavior.

Indeed, we see that for mostly-sorted lists, Quick Sort performs much worse.  However, it is still far faster than Bubble Sort and Insertion Sort.  Compared to Merge Sort, there is basically no difference except for its mostly-sorted case.

Mostly Sorted? N Runtime (s)
No 10 0.00002
Yes 10 0.00005
No 100 0.00036
Yes 100 0.00084
No 1000 0.00559
Yes 1000 0.01608
No 10000 0.06909
Yes 10000 0.16705


Tim Sort

So far, we could argue that Merge Sort is a clear winner of all algorithms shown so far.  How could one make Merge Sort faster?  Python's native sorting algorithm is one called Tim Sort, which is a hybrid algorithm that combines features of both the Merge Sort and Insertion Sort.  We have seen that Insertion Sort performs very well on mostly-already sorted sequences.  In Tim Sort, the algorithm finds subsequences of mostly-sorted data and exploits those regions to more efficiently sort the remaining portions.

1
2
def timsort(arr):
    arr.sort()

The source code for using Tim Sort is basically to just call the sort() method on any list data that can be sorted.  We use a wrapper to bring benchmarks on the sorting algorithm for comparison with the others.

As we can see, the runtimes for Tim Sort are improved with mostly-sorted lists, and even in its slower cases, it is about 20 times faster than even Merge Sort, and up to 40 times faster for already-sorted lists.   Asymptotically, Tim Sort is worst case O(N log N) and best case of O(N), making Tim Sort a linear algorithm at best.

Mostly Sorted? N Runtime (s)
No 10 0
Yes 10 0
No 100 0.00002
Yes 100 0.00001
No 1000 0.00027
Yes 1000 0.00015
No 10000 0.00369
Yes 10000 0.00154


Utility Source Code

We used a random list generator that used Numpy.  To ensure sparsity of the generated random numbers, we make the range of numbers to generate much bigger than the number of elements that is needed.  To generate mostly sorted lists, we took 5% of the elements and inverted them randomly.

To ensure the correctness of our algorithms, we ran them through a verify_correct() method which iterates over every subsequent pair of elements to ensure that they were ordered properly.

Our benchmark tool ran for N = 10, 100, 1000 and 10000 elements, and for both randomized lists and slightly-out-of-order lists.  For each level of N and orderliness, we ran the sorting algorithm 5 times and chose the best runtime (best-of-five), meaning that every sorting algorithm ran 5*4*2 = 40 times.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
def genRandomList(n = 21, neatness = 0.50, sparsity=20):
    return list(np.random.randint(low=-n*sparsity, high=n*sparsity, size=n))

def genMostlySortedList(n = 21, neatness = 0.50, sparsity=20):
    arr = list(np.random.randint(low=-n*sparsity, high=n*sparsity, size=n))
    arr.sort()
    f = 0.05
    larr = len(arr)
    for i in range(int(f * larr) + 1):
        x = random.randint(0, larr-1)
        y = random.randint(0, larr-1)
        temp = arr[x]
        arr[x] = arr[y]
        arr[y] = temp
    return arr

def verify_correct(arr):
    for item,next in zip(arr[:-1], arr[1:]):
        if item <= next:
            continue
        else:
            return False
    return True


def benchmark(method):

    for n in [10, 100]:#, 1000, 10000]:
        best_runtime = None
        correctness = True
        for repeat in range(5):
            x = genRandomList(n=n)
            time_of_start = time.time()
            method(x)
            time_of_end = time.time()
            if best_runtime is None:
                best_runtime = time_of_end - time_of_start
            else:
                best_runtime = min(best_runtime, time_of_end - time_of_start)
            correctness = correctness and verify_correct(x)
        print "{:<20s}".format(str(method.__name__)) + ", " + "mostly sorted=False" + ", " + "{:<8s}".format(str(n)) + ", " + str("%.5f" % (best_runtime)) + ", " + str("correct" if correctness else "incorrect")

        best_runtime = None
        correctness = True
        for repeat in range(5):
            x = genMostlySortedList(n=n)
            time_of_start = time.time()
            method(x)
            time_of_end = time.time()
            if best_runtime is None:
                best_runtime = time_of_end - time_of_start
            else:
                best_runtime = min(best_runtime, time_of_end - time_of_start)
            correctness = correctness and verify_correct(x)
        print "{:<20s}".format(str(method.__name__)) + ", " + "mostly sorted=True " + ", " + "{:<8s}".format(str(n)) + ", " + str("%.5f" % (best_runtime)) + ", " + str("correct" if correctness else "incorrect")

    print "-----------------------------------------------------------"


Test Driver

Finally, to run our code, we ran the benchmark() method on each of our sorting algorithms, passing along the function call of each method to be benchmarked.  The result is a print-out of the performance of each algorithm.

1
2
3
4
5
benchmark(mergeSortLen)
benchmark(timsort)
benchmark(quicksort)
benchmark(insertionSort)
benchmark(bubbleSort)


Full Code

The full code was written with Python 2.7.8 and Numpy 1.9.0.
Code available at: http://pastebin.com/YwY5K404

No comments:

Post a Comment