ARM assembler code

This is the code of some random simple function as it is produced by a disassmbler. Apple was so kind to leave most of the symbolic funtion names in the debugger ROM, making life comparatively easy.

SetResult__8TMonitorFP5TTaskl:                  @ 0x00120010: void TMonitor::SetResult(...)
        @ ARM R0 = type: 'TMonitor'*
        @ ARM R1 = type: 'TTask'*
        @ ARM R2 = type: 'long'
        ldr     r0, [r1, #108]
        tst     r0, #8388608
        beq     L00120024
        teq     r2, #0
        moveq   pc, lr
L00120024:
        str     r2, [r1, #16]!
        mov     pc, lr


Literal translation into C

The following code was generated by applying the Einstein JIT compiler. Instead of creating byte code, TJIT was modified to write the C code that would have been called via byte code (ioUnits).

Even though it looks quite uncommon, the code below is legal C code and compiles just fine.

We can also see the main issue with transcoding an operating system (vs. a user mode application): NewtonOS relies heavily on the Memory Management Unit (MMU) for memory access. The LDR and STR commands must therefore be replaced by function calls (ManagedMemoryRead, etc.) which have a negative impact on performance. On the good side, once we transcoded everything, we can remove MMU dependency alltogether. But that is another story.

void Func_0x00120010(TARMProcessor* ioCPU)
{
L00120010: /*/ 0xE591006C  ldr	r0, [r1, #0x06c]
	{
		KUInt32 offset = 0x0000006C;
		KUInt32 theAddress = ioCPU->mCurrentRegisters[1] + offset;
		KUInt32 theData = ioCPU->ManagedMemoryRead(theAddress);
		ioCPU->mCurrentRegisters[0] = theData;
	}
L00120014: /*/ 0xE3100502  tst	r0, #0x00800000
	{
		KUInt32 Opnd2 = 0x00800000;
		KUInt32 Opnd1 = ioCPU->mCurrentRegisters[0];
		const KUInt32 theResult = Opnd1 & Opnd2;
		SetCPSRBitsForLogicalOp( ioCPU, theResult, Opnd2 & 0x80000000 );
	}
L00120018: /*/ 0x0A000001  beq	00120024
	if (ioCPU->TestEQ()) {
		goto L00120024;
	}
L0012001C: /*/ 0xE3320000  teq	r2, #0x00000000
	{
		KUInt32 Opnd2 = 0x00000000;
		KUInt32 Opnd1 = ioCPU->mCurrentRegisters[2];
		const KUInt32 theResult = Opnd1 ^ Opnd2;
		SetCPSRBitsForLogicalOpLeaveCarry( ioCPU, theResult );
	}
L00120020: /*/ 0x01A0F00E  moveq	pc, lr
	if (ioCPU->TestEQ()) {
		KUInt32 Opnd2 = ioCPU->mCurrentRegisters[14];
		const KUInt32 theResult = Opnd2;
		SETPC(theResult + 4);
		return;
	}
L00120024: /*/ 0xE5A12010  str	r2, [r1, #0x010]!
	{
		KUInt32 offset = 0x00000010;
		KUInt32 theAddress = ioCPU->mCurrentRegisters[1] + offset;
		KUInt32 theValue = ioCPU->mCurrentRegisters[2];
		ioCPU->ManagedMemoryWrite(theAddress, theValue);
		ioCPU->mCurrentRegisters[1] = theAddress;
	}
L00120028: /*/ 0xE1A0F00E  mov	pc, lr
	{
		KUInt32 Opnd2 = ioCPU->mCurrentRegisters[14];
		const KUInt32 theResult = Opnd2;
		SETPC(theResult + 4);
		return;
	}
}


Compiler optimization

After running the C code above through our "C" compiler (Xcode 5.1.1 for Intel), we receive this optimized code. So 7 commands of ARM code have become 44 commands of Intel code. What looks like a bad ratio is not so bad at all, considering that this was a pretty primitive automated translation without human interaction.

Considering that half of the 8MB ROM is ARM code, out Intel ROM would grow to 27MB. By todays standards, an entire graphical OS including apps in 27MB would be incredibly small.

__Z15Func_0x00120010P13TARMProcessor:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	subl	$16, %esp
	movl	8(%ebp), %esi
	movl	4(%esi), %eax
	addl	$108, %eax
	movl	%eax, 4(%esp)
	movl	%esi, (%esp)
	calll	__ZN13TARMProcessor17ManagedMemoryReadEm
	movl	%eax, (%esi)
	testl	$8388608, %eax
	je	LBB0_1
	movw	$0, 64(%esi)
	movb	$0, 66(%esi)
	movl	8(%esi), %eax
	testl	%eax, %eax
	movb	$0, 65(%esi)
	js	LBB0_4
	movb	$0, 64(%esi)
	jmp	LBB0_7
LBB0_1: 
	movw	$256, 64(%esi)
	movb	$0, 66(%esi)
	movl	8(%esi), %eax
	jmp	LBB0_7
LBB0_5:
	movw	$256, 64(%esi) 
	jmp	LBB0_8
LBB0_4:
	movb	$1, 64(%esi)
LBB0_7: 
	movl	4(%esi), %edi
	addl	$16, %edi
	movl	%eax, 8(%esp)
	movl	%edi, 4(%esp)
	movl	%esi, (%esp)
	calll	__ZN13TARMProcessor18ManagedMemoryWriteEmm
	movl	%edi, 4(%esi)
LBB0_8:
	movl	56(%esi), %eax
	addl	$4, %eax
	movl	%eax, 60(%esi)
	addl	$16, %esp
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
Ltmp28:
Lfunc_end0:


Handmade code

Automatic transcoding generates a lot of overhead. Let's hand-optimize the generated "C" code and see where we end up:

void TTask::SetResult(long aResult)
{
	KUInt32 inTask   = ioCPU->mCurrentRegisters[1];
	KUInt32 inResult = ioCPU->mCurrentRegisters[2];
	KUInt32 state = ioCPU->ManagedMemoryRead(inTask + 0x0000006C);
	if (state==0x00800000)
		goto L00120024;
	if (inResult==0x00000000) {
		ioCPU->mCurrentRegisters[15] = ioCPU->mCurrentRegisters[14] + 4;
		return;
	}
L00120024:
	ioCPU->ManagedMemoryWrite(inTask+0x00000010, inResult);
	ioCPU->mCurrentRegisters[15] = ioCPU->mCurrentRegisters[14] + 4;
	return;
}


Compiler result

The compiled code is fascinating: after hand-optimizing a few lines of code into something that looks much cleaner to the human eye, the resulting code is still 31 commands long! This tells me two things: the automated conversion is not too bad at all, thanks to highly optimizing modern compilers, and, as a result of that, hand-optimizing is a complete waste of time.

The code we optimized away was the superfluous calculation of various flags in the comparison commands. I have not looked at the TJIT implementation yet. Maybe it can be further improved. The code is not optimized away by the compiler, because the flags are stored in a global location, and may be used later (well, we know that they won't, but the compiler can't know that).

__Z15Func_0x00120010P13TARMProcessor:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$12, %esp
	movl	8(%ebp), %esi
	movl	4(%esi), %edi
	movl	8(%esi), %ebx
	leal	108(%edi), %eax
	movl	%eax, 4(%esp)
	movl	%esi, (%esp)
	calll	__ZN13TARMProcessor17ManagedMemoryReadEm
	cmpl	$8388608, %eax 
	je	LBB0_2
	testl	%ebx, %ebx
	je	LBB0_3
LBB0_2:
	addl	$16, %edi
	movl	%ebx, 8(%esp)
	movl	%edi, 4(%esp)
	movl	%esi, (%esp)
	calll	__ZN13TARMProcessor18ManagedMemoryWriteEmm
LBB0_3:
	movl	56(%esi), %eax
	addl	$4, %eax
	movl	%eax, 60(%esi)
	addl	$12, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret


One more thing!

There is a way that makes our transcoded app as slick and slim as the original. Here is some C++ code that may look similar to what the original NewtonOS source code may have looked like. What will happen if we translate that?

void TMonitor::SetResult(TTask *task, long aResult)
{
    long state = task->pState;
    if ( state!=0x00800000 || aResult==0) {
        return;
    } else {
        task->pResult = aResult;
    }
}


So it does work in the end

Compiling the hand-written code generates only 12 lines of assembler, proving that Intel code is not much longer than ARM code, and also proving that rewriting NewtonOS in C++ will lead to the best code.

But more importantly, this test proves, that automated transcoding can lead to something quite usable. In this particular case however, the MMU handling must be eliminated by hand-coding all memory allocation and task management functions.

__ZN8TMonitor9SetResultEP5TTaskl:
	pushl	%ebp
	movl	%esp, %ebp
	movl	16(%ebp), %eax
	movl	108(%eax), %ecx
	cmpl	$8388608, %ecx 
	jne	LBB0_3
	movl	16(%ebp), %ecx
	testl	%ecx, %ecx
	je	LBB0_3
	movl	%ecx, 16(%eax)
LBB0_3:
	popl	%ebp
	ret