Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque 刘刘刘 Lau Shyh Tzer, David isiting Student, IIIS, Tsinghua Universit Summer 2013 BSc. Computer Science, The Chinese University of Hong Kong
Improving Andriod App Development's Efficiency and Quality through Machine Learning Techinque
刘世泽 Lau Shyh Tzer, David
Visiting Student, IIIS, Tsinghua UniversitySummer 2013
BSc. Computer Science, The Chinese University of Hong Kong
Background•Problem: The Growing of Android API
• Difficult for developers to master the usage of Android API, especially the inexperienced developer
• Level 1-18, over 20k API Methods
•Aim: Adapt Machine Learning and Reverse Engineering Technique to Analyze the Usage Pattern
• Possibly developer a helper tool to suggest/fix the Android API usage during development stage
Workflow
Adapt reverse engineering technique to retrieve the needed data from
packaged Android App (.apk)
1
Perform data mining on the result raw data to dig out
interesting API usage pattern, relationship.
2
1Reverse Engineering on Android App
• Need to retrieve information from package (.apk) file not source code
• There’re basically three options:
Perform static analysis not
dynamic analysis
.apk Dalvikbytecode
Retrieve .dex file
1Low level, Lack of
analysis tools (Few)
.apk smalicode
Disassembly
2Good for hacking.
Take time to familiar with smali
code
.apk .jar (.class)bytecode
Decompile
3Becomes a Java
problemLots of analysis
tool
1Reverse Engineering on Android App
.apk .jar (.class)bytecode
Decompile
• Use dex2jar open source tool: https://code.google.com/p/dex2jar/
• Support directly decompile .apk to .jar file 2 2 2 linux sh dex jar/d j-dex jar.sh someApk.apk
2jjjjj2jjjjj2jjjjjjj - ..
• We can then redirect it into a Java problem and focus on the static analysis with Java bytecode
1Reverse Engineering on Android App
• In order to understand the usage pattern of Android API, we have to know the structure of the code
• The easy and abstract approach to understand the structure of the code is to look at its Abstract Syntax Tree (AST)
Generate AST from Java bytecode• It’s obvious and easy to parse Java source
code into Abstract Syntax Tree, but parse the bytecode is not• Bytecode is a set of instructions that JVM interpret to perform stack execution to run the program• The stack execution on the bytecode is different than the common program flow that we observe at source code
Bytecode Outline Plugin for Eclipse
http://asm.ow2.org/eclipse/index.html
Generate AST from Java bytecode• Intuitively, bytecode is interpreted by JVM as
the stack execution, so we can ‘recover’ the code structure and construct the AST through simulating the JVM stack operation
Example: Variable Assignment
Thread Stack
1i 1
Abstract Syntax Tree
=
Source
code
bytecode
int1i= ;
1ICONST
0 ISTORE
Example 2: From Previous Bytecode Example
Source code
bytecode
1 2public int method(inti ,int i ){ j3jj1*j2= j3*2j; }
1ILOADjjjjj2jjjjjjjjjj3jjjjj3
2ICONSTjjjjjjjjjjj
Thread Stack Abstract Syntax Tree
method
*
i2
*
2
=
i3
i1
return
i3i1i2*i32*
Generate AST from Java bytecode• There are various kinds of AST structure,
such as condition statement, goto statement, compound statement, but they can all be ‘recovered’ from bytecode by using the previous technique to simulate the stack execution• However, read directly on the .class file result in binary format that useless for our parsing
• So we need a systematic way to parse the bytecode
• ASM is an all purpose Java bytecode manipulation and analysis framework http://asm.ow2.org/
• It provides two powerful APIs: Core API and Tree API
• Core API creates an interface of visiting bytecode
• Tree API parses bytecode into Objects
ASM - Bytecode Engineering Library
Refer to http://download.forge.objectweb.org/asm/asm4-guide.pdf
for complete usage
ASM - Bytecode Engineering Library• Tree API is particularly useful for generating the AST from bytecode
• It provides two important interfaces: ClassNode and MethodNode which enables the developer to assess to the bytecode information directly
The bytecode (opcode) of the
respective method is stored at InsnList
instructions
ASM - Bytecode Engineering LibraryUsage: 1= ( “ . ” );
jjjjjjjjj jj j jjj jjjjjjjjjjjjjjjjjjjjjjjjj0jj
ACM parse the whole class and
store the respective objects
over here
jjj jjjjj jjjj.; <- jjjjj jjjjj jjjjjjjjj.; <-jjjjjjjjjjjjjjj jj jjjjj jjjjj jjjjjjj
Assess Class Information: All the information is stored at the properties of the object, so simply
retrieve from them
Assess Method Information: <> = . ;jjjjjjjjjjjjjj : ) {jjjjjjj jj jjjjjj jjjj jjjjjjjjj jjjjjjjjjj jjj . <- (jjjjjj jjjjj jjjjjjjjjjjjjjjjjjjjj jj jjj jjjjjjjjjj jjjjjjj}
Each class contains several methods, so
it’s List<MethodNode>
type
Assess Bytecode Intructions:
jjjjjjjjjjjjjjjj= . ;jjjjjjjj jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj jjjjjjjjjjjjjjjjjjjjjjjjjjjj( ) ;jjj jjjjjj j jjjjjjjjjjjjjjjj = . ( ) ;
jjj jjjjj jjjjjjjjj jjjj jj //jjjjjjjj jjj}
ASM - Bytecode Engineering Library
An abstract class to wrap the
instructions. ASM separate instruction
into 16 different kinds
The bytecode instruction is defined
as an integer constant in ASM. For
example,3 = ICONST_021 = ILOAD
54 = ISTORE
The type of the instructions is also defined in integer
constant. For example:
4 = FIELD_INSN5 = METHOD_INSN
Detail of the instruction, like which store to which local variable, the Field
Variable ID, invoked method’s signature are stored in this object as
properties.
My Design of AST Generator• With the support of ASM Tree API, we can parse the bytecode and simulate each stack execution to construct the respective Abstract Syntax Tree systematically• In order to suit with the needed data for data mining, I designed a customized Abstract Syntax Tree structure
Abstract Parent Class
ASTNode
ASTArithmeticNode
ASTArrayNode
ASTArrayValueNod
e
ASTCastNode
ASTClassNode
ASTConstantNode
ASTFieldNode
ASTFunctionNode
ASTJumpNode
ASTLabelNode
ASTLocalVariableNod
e
ASTMethodNode
ASTObjectNode
ASTReturnNode
ASTSwitchNode
Inherited
Specifically designed class
to suit each code structure
My Design of AST Generator
ASTNode• getASTKind• setName getName• setSignature getSignature• setCallBy getCallBy• setUsedBy getUsedBy• setUsedAsObject getUsedAsObject
Example: S t r i n g B u i l d e r s b =
jjjjjjjjjjjjjj;jjjjjj jjjj j “ ” ;jjjjjjjjjjjjjjjj jjjjjj j =
. ( ) ;
sb
append
result
toString
text
“Sample”
setUsedBy
setUsedBy
setUsedAsObject
setCallBy
setUsedBy
CallBy/UsedBy/UsedAsObject are
stored as ArrayList<ASTNode> to
handle multiple connections
Object and its methods have doubly connections to ensure bidirectional traverse
ASTMethodNode
• addParameter getPara
My Design of AST GeneratoraddParameter are
stored as ArrayList<ASTNode> to
handle multiple connections
ASTLocalVariableNode
• setIndex getIndex• setVariableType getVariableType• setVariableValue getVariableValue
ASTFieldNode
• setFieldValue getFieldValue
Trick:Local variable are stored separately at JVM Method Area by index. So in order to track the changing of local variable assignment (such as one variable can be used multiple times), create a hash table to record the pointers reference to the local variable. So the update of the variable assignment can be done easily while parsing the bytecode
Trick:Same case with Local Variable, create a hash table to have pointers reference to the Field Variable. Be careful that the hash table is clear while accessing each method, but Field Value is tracked through the whole class
Data Flow Analysis on the AST• With complete Abstract Syntax Tree for an Android App, it gives a very useful details to perform various kinds of static analysis
• My research is mainly focused on its data flow analysis:
jjjjjjjjjj?? = .( , , ) ;
1 Where does the return value goes?
Where are these arguments come from?
2
This methods is called by what kind of object?3
Data Flow Analysis on the AST• Performed depth-first-search on the AST to trace the return value path, argument path and call by path
Collect data from Android API Invocation
• Trick: use hash table/list to record down the path to avoid infinite loop within the AST
ASTClassNode
QQ Android
ASTClassNode
ASTMethodNode ASTMethodNode
2 Mining on the Analysis Data
Convert App into Jar format
Generate AST from Jar file
Define To-Do Analysis Format
Preparing Mining Raw Data• Coded a web crawler to download free Android Apps from http://apk.gfan.com/ open Android Market• Successfully grabbed 10, 266 valid Android Apps and generated respective AST, analysis data through the self-developed ASTGenerator by using Amazon Web Service High Memory cluster.
• Result data structure:
appname-android/net/Uri buildUpon-0
1 1 1 0 0 1 1 0
1 1 1 0 0 0 0 1
andro
id/n
et/U
ri b
uildU
pon
andro
id/n
et/U
ri t
oSri
ng
andro
id/n
et/U
ri$Builder
appen
dQ
uer
yPar
amet
er
andro
id/n
et/U
ri$Builder
build
java
/lan
g/S
trin
g index
Of
java
/lan
g/S
trin
g s
ubst
ring
java
/lan
g/S
trin
gBuilder
<in
it>
java
/lan
g/S
trin
gBuilder
appen
d
appname-android/net/Uri parse-0
Mining on the Analysis Data• Adapted Weka 3 to perform data mining task. http://www.cs.waikato.ac.nz/ml/weka/
• It’s convenient to convert the analysis raw data into Weka ARFF input format, especially its support of sparse matrix format
Mining on the Analysis Data• Adapted Hierarchical Agglomerative Clustering in the result matrix to discover the apis’ relationship and their usage pattern
Mining on the Analysis Data• The analysis result data from 10, 266 apps
is huge (~50GB text file), it’s time-consuming and unnecessary to mine directly on them
• Designed a MapReduce task and ran it at Hadoop to categorize the result data into methods by methods and compute the statistics of their invocation numbers• Then perform hierarchical clustering on methods that have enough data to discover meaningful pattern (like the number of invocation reached a threshold)
Mining on the Analysis Data
Total 19, 250 Android API methods discovered to be
invoked at least one time among 10, 266 apps
Methods that have high numbers of invocation don’t directly mean they’re having higher possible to find meaningful pattern. They’re more likely to be really common
usage like UI elements, Log
Methods that have average number of invocation, especially those indicating specific feature
like geo-location, network, database might be the target of
data mining
Clustered Result:• Weka’s hierarchical clustering result in Newick Format, we can use software like Dendroscope to visualize ithttp://ab.inf.uni-tuebingen.de/software/dendroscope/
android/net/Uri buildUponSome clustering result has obvious clusters which may implied an obvious usage pattern with
this method
Clustered Result:
android/location/Location <init>
• Trick: Some analysis that shows many redundant on the column key (related APIs), perform a filter to throw away those no obvious relations (such as only related one time) before sending the data for clustering
May perform several trials
and observe the best cut-off
branch
Clustered Result:
android/location/Geocoder <init>
• The result Newick Format can be parsed back to get the list of related APIs for each cluster
There may have many unique usage pattern like this which probably an
error pattern or special usage
Mining on the Analysis Data• By tracing back the result Newick Tree, we got the related APIs for each methods
• Interesting Results:• The clustering result shows the android/location
package classes are strongly inner-connected. Classes like GpsSatellite, Location, Location Manager are highly relied on each other.
Package Analysis
Identify Good and Error Usage
Pattern
• By having App’s name as identifier, we can trace back to the information of the nature of the app (its download times, rating, popularity) to determine feature, possible good/bad pattern of the result clusters.
Complete APIs Relationship
• Perform clustering on all the useful data and retrieve the APIs relationship from the cluster after identify its usage pattern. It can eventually conclude a ‘Good’ usage pattern suggestion list when respective API methods are called
Mining on the Analysis Data
Android API vs Java Library
• There are many Android APIs have data flow relations with Java Library methods, by digging into the methods’ clusters, we can discover some obvious usage pattern of Android APIs and Java Library methods.
App’s Permissions vs
Method invocation
• Android Permissions information is stored at AndroidManifest.xml which is an binary-encoded XML file in .apk package. It can be decoded by using tool such as APK-tool (https://code.google.com/p/android-apktool/) to read it.• Android Permissions declare at statically at compile time and can’t grant dynamically (at run time)
• So with the AST generated from the bytecode, we can check/traverse the AST to determine the correctness of one app’s declared permissions.
• For example, one app may declared BroadcastReceiver permissions and then there’re no related methods/functions are found at the AST