Computer science

Object Oriented Programming (OOP) 2020-08-22

to be completed

Objects are intuitive because the world is full of objects, each with properties and abilities. The first object oriented programming (OOP) language Simula was for simulating real world objects. The OOP language Smalltalk was used for the first user interfaces with Desktop, Windows, Menus, metaphors for real world objects with properties and methods of use. Programs need to abstract real world objects, customers, orders, products, contacts, so OOP is naturally fits.

The was a well known book written in the 1970's Algorithms + Data structures = Programs, written before Object Oriented Programming became mainstream. To cover OOP a title might be Algorithms + Data structures + Abstract Data Types = Programs. For example, one can push and pop from a stack but the stack could be implemented using a number of different data structures and the names push and pop do not constitute an algorithm. Rather push and pop together constitute an Abstract Data Type.

Concepts in OOP include: abstraction, composition and aggregation, data hiding and encapsulation, polymorphism.

Encapsulation is about hiding the complexity of a system or a class behind interfaces. Data hiding controls or restricts access to the data inside a class, typically in C++ and in Java by using private, protected or friend syntax.

Dependency injection involves objects only knowing about abstractions, objects are not responsible for constructing the data contain, and so need not know the concrete type of the data only their interface.

Aggregation versus Composition

With aggregation a child can exist independently of it's parent container, for example a class has many students. With a child cannot exist independently of it's parent container, for example a street has many houses.

From a relational database/model perspective if a child has a foreign key reference to the parent then the child cannot be independent. Rows in an 'order detail' table would all have a foreign key reference the the parents order and so this would be composition.

Polymorphism

A function takes parameters each of which has a type. If the function interface is not bound to a single implementation we have polymorphism.

For instance the function 'x+y' can add integers, floating point numbers or to append strings together. Traditionally, strongly typed languages supported a kind of ad hoc polymorphism that worked for arithmetic operators and built in functions but not for user defined functions.

In OOP we can write or use code without caring for the implementation and so polymorphism allows old code to call new code,

With interface inheritance we create a interface that extends another, a circle or square class can extends a shape and so code can just ask for it's area without caring whether it is a square, a circle or some other. ...

With Parametric polymorphism allows functions or methods to implemented in a generic way that it can handle many different types. ...

Virtual Machines and Containers 2020-09-21

Before virtual machines

Long ago there were far more people than computers, computers were shared. Then everyone had their own computer and servers for each application. Each application environment: develop, test, Q/A, production, demonstration, disaster recover might have it's own server. Since servers were rarely under constant load they mostly they sat idle.

One bank had some 29,000 servers mostly under little load, yet when everyone was asked to make savings only some 20 odd were decommissioned. It was hard to use the idle capacity.

One server I looked after was a critical front office system, whose load went up dramatically over the years. It had very high load at 9am, at 4am, and even higher every third Thursday and on other special days. We did much optimization and working with the client teams to optimize their usage. There was no straight forward way to scale in and out depending according to actual load.

Another bank, failover for disaster recovery (DR) was tested one weekend each year. When a data center did actually go offline during the working day, we fount some applications had their primary in the secondary center and their primary in the secondary. Some had both primary and secondary servers in the primary center and some in the secondary.

My solution, to avoid weekend work, was to run primary and secondary in different data centers as a fault tolerant pair. They exchanged messages, if one fell over, the other would notice and take over as primary. We tested this every morning and also, we could release new versions to the secondary, kill the primary, let the secondary take over. If there were issues we could easily rollback to the old version running on the primary. If all went well we could upgrade both primary and secondary to the new version. Release and disaster recovery are similar in that they both involve restarting software or failover to a new machine. Still, we had a secondary box idle all of the time and a primary partially idle apart from peaks.

My server ran in London, conveniently between Tokyo and New York, though ideally one might like versions in every region to handle regional traffic.

In summary, the issues with dedicated physical servers are: idleness and waste, difficulty in scaling to higher load, and inflexibilities caused by physical location for disaster recovery and regional traffic.

The Cloud answers all these issues. Elastic computing involves scaling-in to avoid idleness and scaling-out to handle load. The cloud provides great flexibility in provisioning (the process of setting up IT infrastructure) without any physical effort one can allocate and copy infrastructure between physically separate availability zones (AZ) and regions that are physically, administratively and (to an extent) legally separate,

Virtual Machine

An operating system (OS) is software that abstracts hardware so that an application programmer just sees processes, memory, disk, network without having to know any details about the physical devices that make up the computer.

In a traditional computer, a computer runs one instance of an operating system. With virtual machines (VM) one computer can run multiple instances of an operating system, sharing the physical resources. One could have multiple versions of Linux and various versions of Windows all running on your server (or laptop).

Virtualization

Virtualization means to create a something that behaves just like something real. Virtual memory creates an illusion of a huge single expanse of memory, that is actually made up of multiple memory chips and storage devices. A virtual address looks just like a physical address to a program but is actually translated to a location on a physical memory chip (or storage device) by a memory management unit (MMU).

Another example of virtualization is a virtual private clouds (VPC), that creates an illusion of having your own physically secured and isolated network and machines (like your own data center), which actually runs on shared hardware provided by Amazon.

Hypervisor

hardware virtualization refers to hiding the physical components of a machine a creating an illusion of an isolated CPU, memory and I/O device mapping.

Traditionally an operating system is designed to have full control of the hardware and so might run in the CPU's privileged mode. A Hypervisor is a control program that allows multiple operating systems to run on the same hardware. The operating system kernel is also known as a supervisor and the term hypervisor refers to a supervisor of supervisors. A hypervisor is also known as a virtual machine monitor (VMM). When virtual machines talk to the hypervisor might they think they are really talking directly to the hardware.

A type 1 hypervisor (a native or “bare metal” hypervisor) runs directly on the system hardware. While a type 2 (or hosted) hypervisor is just a normal program running on the host operating system. With type 2 the server is like any other, it runs a single instance of a host operating system. All the virtual machines just run as though they were normal processes.

Virtual machines (virtualized operating systems) must be isolated from each other and so are non-privileged, mostly they just run as normal but calls to privileged instruction must be replaced either statically by changing the source code or dynamically at run-time. The CPU might trap such a call and emulate it safely. It is important for performance that dominant fraction of machine instructions are executed without the hypervisor.

With para-virtualization, the operating system source is modified to call a special API (a hypercall) and then recompiled. So special binaries are required. With full virtualization, the VM simulates a complete (or enough) hardware environment to run an unmodified guess OS (unmodified binaries) in isolation. Full virtualization can be implemented by modifying the binary code (binary translation) at run time, which can be expensive. With hardware-assisted virtualization (or accelerated virtualization), the CPU itself provides supports for full-virtualization at the chip level, helping to avoid the costs of binary translation; for example in Intel VT-x and AMD-V. There is also OS-level virtualization or see containers which are described in detail below.

It is worth noting that although virtual machines in UNIX/Linux might seem relatively recent they are really nothing new and have been around since the 1960s. Examples of hypervisors include Xen a type 1 hypervisor. Microsoft Hyper-V is a type 1 hypervisor. VMware Workstation is type 2. The AWS Nitro Hypervisor builds on Linux Kernel-based Virtual Machine (KVM). The KVM module converts the Linux kernel into a type-1 hypervisor.

In AWS virtual machines are known as EC2 instances, in Azure they are known as VMs. EC2 instances launched into a VPC have a tenancy attribute. The default is to run on shared hardware where other customer's might be using your CPU at the same time although there is no legitimate means for you to interfere with each other. With 'dedicated' the virtual machine runs on single-tenant hardware, a single CPU with no one else, although the actual CPU might change. With 'host' the virtual machine runs of a Dedicated Host, your own physical server. AWS also provides various pricing models on-demand, spot, reserved which are largely unrelated to virtualization.

Hardware

A host machine is the hardware on which a VM is installed. In the Cloud this is mostly hidden, indeed the hardware changes over time. When a virtual machine restarts it may typically restarts on different hardware.

Both Azure and AWS provided Dedicated Host, dedicated physical servers, to run your software. With AWS this might be done by setting the tenancy attribute to 'host'.

Images

An image is a snapshot of a virtual machine's memory. One can restart new virtual machines from a image, One can download images for many different versions of operating systems.

An image contains the operating system itself, data and programs that are running. Auto scaling relies on images, when a new instance is provisioned it is initialized from an image. One can also load data on to an image which is useful if you have a lot of static data, start up is faster and there is no need to access a database. I have worked with systems that loaded the operating system, application and 100GB of static data.

In AWS these are called Amazon Machine Images (AMI), VM images in Azure or VMware images in VMware.

Containers

Every UNIX user knows that the top level directory is / and contains /bin, /usr/, /home; a single namespace shared by all user's on the machine. There is also a global list of devices, free memory and processes.

A program running in a container feels just like it is running on an isolated Linux box because it has it's own top level root directory, list of devices, free memory and processes. The resources a container can use are constrained so that it cannot say use up all the physical memory or network capacity as that would impact other processes and containers running on the same server.

Mechanisms that support containers in Linux

To support containers an operating system must address global namespaces and provide resource isolation.

The chroot(2) can change a processes (apparent) top level directory to some other directory, so that the process cannot see the shared namespace, it is stuck in it's own chroot jail. This is nothing new, it has been around since 1979

The cgroups are a Linux feature that sets limits on the resource usage of a set of processes and hence isolates their usage from other processes. Resource usage includes CPU, memory, disk I/O and network.

The Linux Kernel namespaces. They provide process trees where parent process tree can see child namespaces, but the children cannot see their parents. Network namespace provides a way for processes to see distinct and isolated sets of network interfaces. There are namespaces for mount, user, IPC and UTS. Each set of processes can see it's own set of resources.

since containers share the same kernel they are less isolated than other virtual machines and so are arguably less secure, if someone could find a vulnerability in the kernel and an exploit.

Docket and overlay file systems

I worked with AMI's that include the operating system, application, 100GB of data and a small start-up script. One tiny change to the script required recreating and redistributing the entire AMI which was distributed to some 100 virtual machines. Testing was very time consuming and expensive.

Docker gets around this problem by using a union capable (or overlay) file system. Essentially an image is a linked list of images, each layer links to the predecessor, so if I only change the top layer with the script then nothing else need change or be copied.

A Dockerfile provides instructions to build an image layer by layer.


	FROM ubuntu:18.05
	COPY . /app                # Copies /app folder on clients local machine
	RUN make /app              # Runs make
	CMD python3 /app/main.py   # Says which command to run in the container
	ENV VERSION 9.6            # Set an environment variable
	ENTRYPOINT ["/app/go.sh"]  #

For the sake of illustration, the following creates an overlay file system in standard Linux without any containers:


	$ cc tmp
	$ mkdir aa bb cc work merged
	$ echo aaa > aa/a
	$ echo bbb > bb/b
	$ echo ccc > cc/c
	$ sudo  mount -t overlay overlay -o lowerdir=./aa:./bb,upperdir=./cc,workdir=/tmp/work ./merged
	$ ls merged/
	a  b  c

Volumes

Containers are emphemeral

Network

Orchestrators

Orchestrators manage the life cycle of containers: provisioning, deployment, redundancy and availability, scaling and so on. Orchestrators include docker swarm, kubernetes and Amazon Elastic Container service (ECS).

Docker Swarm

Docker swarm is perhaps the easiest to understand because the commands largely mirror those for managing individual docker containers.

A docker swarm is a collection of computers (provisioning ?).

A docker service is a single program that might be ran as multiple instances (on a swarm).

A docker stack is a collection of services, that might make up an entire application, a stack is a set of services.

Docker stack is similar to docker compose (originally a python program), both run a set of services (a stack) described in a docker-compose.yml file.


	    docker stack deploy --compose-file docker-compose.yml ...stack1...
	    
	    $ docker-compose -f docker-compose up

Docker stack commands are as follows:


	    $ docker stack deploy
	    $ docker stack ls
	    $ docker stack ps         # List the tasks in the stack
	    $ docker stack rm
	    $ docker stack services   # List the services that make up a stack

AWS ECS and Fargate

...

Kubernetes

...

Prolog and messaging 2020-08-21

Prolog

My experience is that the word Prolog rarely mentioned and draws a negative reaction. I am perhaps unusual in that I actually got paid money to program it in a (theoretically) commercial environment, having worked on an expert system for the UK Atomic Energy Authority.

Now I can understand that it is completely non-intuitive compared to most programming languages having neither explicit an if, loops or variables that bear resemblance to those in other languages; and the high minded jargon: unification, predicate grammars and higher order logic don't serve to illuminate. Once one gets past that, I personally found it a thing of beauty.

SQL

It's actually conceptually similar to SQL as commonly used in databases, which at a basic level is highly intuitive, for example:
$ select name, address, email from Customer;

In Prolog:
?- customer(N, A, E).

One can think of traditional programming languages as scalar, that is working on one element at a time, whereas SQL and Prolog can select, join and filter tables or sets of data. Behind the scenes, one might think of them as processing streams of records.

While basic SQL queries can be very simple and intuitive one can quickly creates monsters, that I reckon could be much more elegantly expressed in Prolog. Having said that, traditional SQL is based around the practicalities of performance, and once optimized can be anything but elegant. Having said that, databases nowadays mostly fit into memory (rather than running from disk) and so I notice over recent years that query optimization isn't always such an issue as it once was.

Messaging

In an entirely different context, working with publish-subscribe systems which send streams of messages between publishers and subscribers.

I noticed that one could see a stream of messages as a table with an endless number of rows. So one could in principle have a version of SQL that selects from streams rather than tables. One could not join two streams together but one could join a stream with a table.

This can be described as a subscribe with a where clause:

select * from stream where ...

I had built a framework for filtering messages based on a UNIX pipeline, something like:
$ subscribe topic | message.type == "..." | ...

Of course one needs to have conditionals so that different things happen to different message types, so what I did was allow each filter in a chain to either succeed or fail, if it failed it would return immediately and the predecessor could try another branch. So the pipeline became an and-or tree. The notion of backtracking and the way I used variables was remarkably similar to the way that Prolog works, the equivalent of a cut operator fitted nicely. Not that everyone loves, backtracking, it was discarded from Erlang.

Race condition

Where two or more threads can change shared data; the threads can be scheduled and so may change the shared data in any order. Therefore what changes are made depend on the thread scheduler rather than what may be required.

One can think of threads racing towards the shared data and one never knows in advance which will win.

The problem and the solution occurs when we have a check then act pattern. For instance only perform a given action when a condition is true. This is a problem if more than one thread pass the check before the action takes place, in which case there is no check on which thread acts first.

The solution is to use a CPU level instruction such as test and set or compare and swap. These compare a check and action into an operation that only one thread can execute at a time. An atomic operation is one that cannot be interupted by other threads.

Terms that should understood in this context include: serializable, linearizablem, consistency and coherence.

In the context of databases we can think of tasks or threads executing transactions against shared data and say that a transaction schedule is serializable if it's outcome is the same had it's transactions had been executed serially or in parallel.

A consistency model is a set of rules that if followed (by a programmer) will result in predictable results from read and writes, for instance the Java memory model.

Coherence is seeing that all processors see the same sequence of writes (a global order) to a single piece of data. Consistency is to ordering of reads and writes to multiple data by all processors.

TODO linearizable

DB serializable

Tenancy 2020-09-23

With single-tenancy a single instance of an application is dedicated to a single customer together with the resources required to support it . With multi-tenancy the application and resources are shared between customers. So the former costs more than the latter and one would typically like to bill back the costs to the customer. The distinction is not binary but a spectrum of possibilities that depends on how much a customer is prepared to pay and how much isolation is possible or practical.

To create a unique environment for a customer on-premises would involve provisioning physical infrastructure in your own data center. In the Cloud, infrastructure is just a text file and so in principal it can be straight forward to create a unique environment for specific customers, or for groups of customers. This is Infrastructure as code (IaC), examples include Amazon CloudFormation, Azure resource manager and Terraform. In AWS such a deployment is known as a stack and one can tag resources to bill a customers usage.

Legal and regulatory

In Azure and AWS an availability zone (AZ) is one or more of highly connected physically separate data centers within a physical region. A region in Azure and AWS is an isolated collection of availability zones. Regions are meant to be largely independent and autonomous and will be subject to local laws and regulations. AWS allows infrastructure to be placed in regional infrastructure, so European customers eu-west-1 (Ireland) or eu-west-2 (Frankfurt). There are regions for the US and Asia. There is also a special US Government region and regions for China.

There are many options as to how much isolation is required and how much sharing is allowed. Virtual machines in AWS (EC2 Instances) are by default shared between customers but one can have instances dedicated to one customer at a time or even pay for dedicated hardware known as dedicated hosts; the latter is required by certain licenses and regulations. At an extreme a customer might specify your application runs on a particular Cloud or even on the customer's own Cloud and pay for the privileged.

Mostly, the regions look technically identical, one uses the same Infrastructure as code approach to create dedicated stacks within each region. There may of course be database tables or back end systems that are shared by all applications which will require cross-region communication.

AWS Identity & Access Management (IAM) is a global service and so accounts work across regions. One could even create individual accounts for particular customers. AWS Organizations can be used to manage multiple accounts.

Database

One might not be able to allocate individual databases to each customers because some tables are shared. For example, all customers must access product tables and modify inventory tables. A list of products is global to all customers while a list of orders is specific to a customer although each order modifies the global inventory.

One can address sharing in a number of ways for example by creating shared services or table partitions. One can partition tables horizontally, so there is logically one database but rows belonging to individual customers are stored on separate physical infrastructure. Postgres provides support for table partitions. Creating shared services can help but also create cost and complexity particularly if there are many dependencies.

A corporate customer might have a number of individual user accounts but if the customers are individuals their data can be kept in the user's mobile device or laptop. This approach also avoids issues with GDPR and managing PPI. It is wise to encrypt such data on the device.

Backend and external systems

Applications typically do not live in isolation but depend on or feed to multiple back end and external systems over which a product owner would have no control. As such single-tenancy might only be possible for the application under governance.

One has to be careful of complex interactions, if an application is made of many microservices and databases, dependent on many external systems then one might find introducing isolation and single tenancy technically difficult.

Concurrency and transactions

In some ways single-tenancy makes things much easier in the same way that having our own bathroom or kitchen avoids all the problems associated with sharing the same kitchen and bathroom. One might notice the mess that quickly appears with careless housemates or children. At the other extreme imagine the strict protocols and discipline required on a submarine or the military in general. Avoiding concurrency issues in software is more like the latter requiring disciple and interesting protocols.

Concurrency is when multiple entities share some resource and concurrency control is a highly technical field in computer science that is difficult to achieve even if the practitioners know what they are doing. Concurrency and parallelism can introduce non-determinacy into programs and so can exhibit unpredictable and difficult to test behaviour. Weird things can happen under production load that are never caught by simple unit tests.

Summary

On the Cloud single-tenancy will not necessarily give you any better performance because the Cloud will mostly allocate sufficient resources to handle increasing load. It will give you isolation that improves security and might be necessary for regulatory reasons. It might also suffer less from concurrency and transactional issues since there is less concurrency.

It is easy to release updates to a single tenant stack because no one else is impacted. It is also easy not to release updates to individual stacks if a customer is wary of changes. On a multi-tenant stack all customers get updates at the same time and so many customers might be impacted if there are any issues with a release.

In principal one badly behaved customer or instance of an application can impact all the others. Virtualization technology can address this by limiting resource usage, for example Linux cgroups puts limits on the resource usage of individual Docker containers.

Creating separate environments for development, test, Q/A, demonstration and production is straight forward with Infrastructure as Code and the same approach can be taken to create extra production environments for particular customers or groups of customers. A mix of single and multiple tenancy is achievable in the Cloud.

Feeds 2020-09-29

A feed is a program that pulls in data from an external system and in some cases push data to external systems. Typically the data is then written to a database and sometimes pushed out directly to consumers.

Development

Writing a feed requires getting to know the external system. Reading documentation; modifying and extending examples; sometimes reverse engineering existing code; experimentation; proofs of concept; contacting their support people. It means getting hold of credentials and licenses which typically involves management.

Feed APIs are by definition specialist and have their own idiosyncrasy so there is often not a lot of information available on search enginers and the documentation and examples are often lacking. In an organization there tends to be one or very few developer(s) who spends enough time to really understand a feed, which is a problem if they leave even for a few days. At the other extreme, throwing different developers at a feed each time there is an issue is likely to cause confusion and not fix anything because none of them know anything, too many cooks. So it might be best to have a team or small number of developers in charge of feeds so there is always some specialist around.

There are a number of patterns or use cases.

Typically one does not want the expense and delay of uploading all data, so a feed that supports a delta to pull just the changes since a particular time stamp.

You may also want to be able to pull all data to repopulate a database.

You will likely want to be able to pull the latest version of an individual or a list.

Some feeds will push updates asynchronously

Feeds must be idempotent, you must be able to run and re-run them multiple times without any semantic failure.

Random reasons to fail

Feeds can stop running without anyone noticing, sometimes for a long time, only when the users start to notice that more and more data is stale. In short feeds can require a lot of care.

Switched off

Feeds can be switched off without anyone noticing. Our users in Japan said they wanted a different data provider because the current one was always out of date. It turned out the old feed had been switched off six months before and the new one had not been set up in Japan. I saw another example that took a year before the users complained loudly enough about missing data. One can lose a lot of money if you are trading on prices that are out of date because the feed fell over while someone was on holiday. So it's really important to be able to monitor feeds, perhaps just a notification to operations if the number of updates each day suddenly falls to zero.

Semantic changes

Nothing technical changes but the semantics of the data does. For instance, the day TomTom listed on the stock market a data feed did not mention any dividends which is utterly normal but it broke one system that expected dividends even when there were none. One feed was written to handle firms that were a going concern. On bankruptcy, one firm was split into two entities, a bankrupt and a going concern, unfortunately the identifier we used was for the bankrupt.

Fixed protocol implementation

There is a maxim that servers should be very liberal in their implementation of a protocol while clients should be very conservative. This applies to humans too, it saves a lot of grief to be accepting of others and be very strict with one's own behaviour. A feed might publish a specification but implement it liberally, a subsequent release can implement it strictly. It still meets it's specification but doesn't do what it used to and so a feed can suddenly just stop working.

Authentication, authorization and expiry

A feed can fail because authentication and authorization become invalid. The supplier might roll over the authentication credentials, or might switch off access briefly if they miss a payment.

Authorization for certain parts of the data set might change, so one still gets updates but not for all the data you expect.

Many credentials have a built in expiry. One production system fell over because the Azure connection string had a one year expiry, the same thing happened the following year because everyone had forgotten by then. So I would avoid long expiry times and program to frequently request new credentials with relatively short expiry times.

Server failure

Servers and application will fail eventually due to hardware or bugs. Even if a server comes back on line quickly it might have missed updates.

Feeds are typically just stateless functions, they take an input and write the output to a database, and perhaps send out notifications. So feeds present an excellent use case for serverless functions (AWS Lambdas or Azure Functions) because they can be involved by a clock or at any time by an service or the operator without having to worry about infrastructure. AWS Lambdas can only run for a limited time so one might need to set up AWS Step Functions.

Feeds, Reports and Functions

Feeds are rather like the reverse of Reports. A feed calls an external API, de-serialize the response and persists to a database. A report reads from a database, serializes the response and sends to an external API. A feed might send out notifications, a report might execute on such a notification.

As mentioned above this makes an excellent use case for serverless functions like AWS Lambda or Azure Functions.