| ||||||||||||||||||||||||||||||||||||||||||||
Newton: Download «NewtonOS» Emulators NewtonScript | ||||||||||||||||||||||||||||||||||||||||||||
ARM assembler codeThis is the code of some random simple function as it is produced by a disassmbler. Apple was so kind to leave most of the symbolic funtion names in the debugger ROM, making life comparatively easy. SetResult__8TMonitorFP5TTaskl: @ 0x00120010: void TMonitor::SetResult(...) @ ARM R0 = type: 'TMonitor'* @ ARM R1 = type: 'TTask'* @ ARM R2 = type: 'long' ldr r0, [r1, #108] tst r0, #8388608 beq L00120024 teq r2, #0 moveq pc, lr L00120024: str r2, [r1, #16]! mov pc, lr |
Literal translation into CThe following code was generated by applying the Einstein JIT compiler. Instead of creating byte code, TJIT was modified to write the C code that would have been called via byte code (ioUnits). Even though it looks quite uncommon, the code below is legal C code and compiles just fine. We can also see the main issue with transcoding an operating system (vs. a user mode application): NewtonOS relies heavily on the Memory Management Unit (MMU) for memory access. The LDR and STR commands must therefore be replaced by function calls (ManagedMemoryRead, etc.) which have a negative impact on performance. On the good side, once we transcoded everything, we can remove MMU dependency alltogether. But that is another story. void Func_0x00120010(TARMProcessor* ioCPU) { L00120010: /*/ 0xE591006C ldr r0, [r1, #0x06c] { KUInt32 offset = 0x0000006C; KUInt32 theAddress = ioCPU->mCurrentRegisters[1] + offset; KUInt32 theData = ioCPU->ManagedMemoryRead(theAddress); ioCPU->mCurrentRegisters[0] = theData; } L00120014: /*/ 0xE3100502 tst r0, #0x00800000 { KUInt32 Opnd2 = 0x00800000; KUInt32 Opnd1 = ioCPU->mCurrentRegisters[0]; const KUInt32 theResult = Opnd1 & Opnd2; SetCPSRBitsForLogicalOp( ioCPU, theResult, Opnd2 & 0x80000000 ); } L00120018: /*/ 0x0A000001 beq 00120024 if (ioCPU->TestEQ()) { goto L00120024; } L0012001C: /*/ 0xE3320000 teq r2, #0x00000000 { KUInt32 Opnd2 = 0x00000000; KUInt32 Opnd1 = ioCPU->mCurrentRegisters[2]; const KUInt32 theResult = Opnd1 ^ Opnd2; SetCPSRBitsForLogicalOpLeaveCarry( ioCPU, theResult ); } L00120020: /*/ 0x01A0F00E moveq pc, lr if (ioCPU->TestEQ()) { KUInt32 Opnd2 = ioCPU->mCurrentRegisters[14]; const KUInt32 theResult = Opnd2; SETPC(theResult + 4); return; } L00120024: /*/ 0xE5A12010 str r2, [r1, #0x010]! { KUInt32 offset = 0x00000010; KUInt32 theAddress = ioCPU->mCurrentRegisters[1] + offset; KUInt32 theValue = ioCPU->mCurrentRegisters[2]; ioCPU->ManagedMemoryWrite(theAddress, theValue); ioCPU->mCurrentRegisters[1] = theAddress; } L00120028: /*/ 0xE1A0F00E mov pc, lr { KUInt32 Opnd2 = ioCPU->mCurrentRegisters[14]; const KUInt32 theResult = Opnd2; SETPC(theResult + 4); return; } } |
Compiler optimizationAfter running the C code above through our "C" compiler (Xcode 5.1.1 for Intel), we receive this optimized code. So 7 commands of ARM code have become 44 commands of Intel code. What looks like a bad ratio is not so bad at all, considering that this was a pretty primitive automated translation without human interaction. Considering that half of the 8MB ROM is ARM code, out Intel ROM would grow to 27MB. By todays standards, an entire graphical OS including apps in 27MB would be incredibly small. __Z15Func_0x00120010P13TARMProcessor: pushl %ebp movl %esp, %ebp pushl %edi pushl %esi subl $16, %esp movl 8(%ebp), %esi movl 4(%esi), %eax addl $108, %eax movl %eax, 4(%esp) movl %esi, (%esp) calll __ZN13TARMProcessor17ManagedMemoryReadEm movl %eax, (%esi) testl $8388608, %eax je LBB0_1 movw $0, 64(%esi) movb $0, 66(%esi) movl 8(%esi), %eax testl %eax, %eax movb $0, 65(%esi) js LBB0_4 movb $0, 64(%esi) jmp LBB0_7 LBB0_1: movw $256, 64(%esi) movb $0, 66(%esi) movl 8(%esi), %eax jmp LBB0_7 LBB0_5: movw $256, 64(%esi) jmp LBB0_8 LBB0_4: movb $1, 64(%esi) LBB0_7: movl 4(%esi), %edi addl $16, %edi movl %eax, 8(%esp) movl %edi, 4(%esp) movl %esi, (%esp) calll __ZN13TARMProcessor18ManagedMemoryWriteEmm movl %edi, 4(%esi) LBB0_8: movl 56(%esi), %eax addl $4, %eax movl %eax, 60(%esi) addl $16, %esp popl %esi popl %edi popl %ebp ret Ltmp28: Lfunc_end0: |
Handmade codeAutomatic transcoding generates a lot of overhead. Let's hand-optimize the generated "C" code and see where we end up: void TTask::SetResult(long aResult) { KUInt32 inTask = ioCPU->mCurrentRegisters[1]; KUInt32 inResult = ioCPU->mCurrentRegisters[2]; KUInt32 state = ioCPU->ManagedMemoryRead(inTask + 0x0000006C); if (state==0x00800000) goto L00120024; if (inResult==0x00000000) { ioCPU->mCurrentRegisters[15] = ioCPU->mCurrentRegisters[14] + 4; return; } L00120024: ioCPU->ManagedMemoryWrite(inTask+0x00000010, inResult); ioCPU->mCurrentRegisters[15] = ioCPU->mCurrentRegisters[14] + 4; return; } |
Compiler resultThe compiled code is fascinating: after hand-optimizing a few lines of code into something that looks much cleaner to the human eye, the resulting code is still 31 commands long! This tells me two things: the automated conversion is not too bad at all, thanks to highly optimizing modern compilers, and, as a result of that, hand-optimizing is a complete waste of time. The code we optimized away was the superfluous calculation of various flags in the comparison commands. I have not looked at the TJIT implementation yet. Maybe it can be further improved. The code is not optimized away by the compiler, because the flags are stored in a global location, and may be used later (well, we know that they won't, but the compiler can't know that). __Z15Func_0x00120010P13TARMProcessor: pushl %ebp movl %esp, %ebp pushl %ebx pushl %edi pushl %esi subl $12, %esp movl 8(%ebp), %esi movl 4(%esi), %edi movl 8(%esi), %ebx leal 108(%edi), %eax movl %eax, 4(%esp) movl %esi, (%esp) calll __ZN13TARMProcessor17ManagedMemoryReadEm cmpl $8388608, %eax je LBB0_2 testl %ebx, %ebx je LBB0_3 LBB0_2: addl $16, %edi movl %ebx, 8(%esp) movl %edi, 4(%esp) movl %esi, (%esp) calll __ZN13TARMProcessor18ManagedMemoryWriteEmm LBB0_3: movl 56(%esi), %eax addl $4, %eax movl %eax, 60(%esi) addl $12, %esp popl %esi popl %edi popl %ebx popl %ebp ret |
One more thing!There is a way that makes our transcoded app as slick and slim as the original. Here is some C++ code that may look similar to what the original NewtonOS source code may have looked like. What will happen if we translate that? void TMonitor::SetResult(TTask *task, long aResult) { long state = task->pState; if ( state!=0x00800000 || aResult==0) { return; } else { task->pResult = aResult; } } |
So it does work in the endCompiling the hand-written code generates only 12 lines of assembler, proving that Intel code is not much longer than ARM code, and also proving that rewriting NewtonOS in C++ will lead to the best code. But more importantly, this test proves, that automated transcoding can lead to something quite usable. In this particular case however, the MMU handling must be eliminated by hand-coding all memory allocation and task management functions. __ZN8TMonitor9SetResultEP5TTaskl: pushl %ebp movl %esp, %ebp movl 16(%ebp), %eax movl 108(%eax), %ecx cmpl $8388608, %ecx jne LBB0_3 movl 16(%ebp), %ecx testl %ecx, %ecx je LBB0_3 movl %ecx, 16(%eax) LBB0_3: popl %ebp ret |
(c) 2014 elektriktrick@matthiasm.com - Impressum |