Basic Knowledge

This chapter introduces you to the basic knowledge related to Android automation. Please be sure to read through the descriptions in this chapter, as they will not be repeated later. Android automation is quite different from conventional web automation, but they also have many commonalities. In conventional web automation, you can easily view the layout of the webpage and element ID information through the F12 developer tool, and then get elements through XPATH to perform operations such as clicking and waiting. Of course, Android has similar logic. You can also select elements through something called a selector and perform operations such as clicking and judging, so you don’t have to worry about it being difficult to get started.

Commonalities and Differences Between Mobile and Web Automation

Mobile automation shares many commonalities with web automation, but there are also many differences. Taking selenium as an example, typically when using selenium for web control, you need three things: a browser, a webdriver, and selenium. Similarly, for mobile, the mobile phone is equivalent to the browser, FIRERPA is equivalent to the webdriver, and the Python library of FIRERPA, which is lamda, is equivalent to selenium. Their goals are the same: to execute simulated user operations, implement testing, data collection, or automated task execution. Both are driven by scripts and perform operations such as locating, clicking, taking screenshots, and making judgments on elements. Looking at it this way, they are quite similar.

But they are also different. First, mobile automation requires you to have a mobile phone and a computer, while web automation can be done on your own computer. They do not use the same set of solutions. For web, common tools include Selenium, Puppeteer, and Playwright. For mobile, it’s usually FIRERPA, AutoJS, Appium, uiautomator, etc.

For web, common positioning methods are mainly xpath or csspath based on HTML DOM structure, and the element hierarchy is relatively intuitive. For mobile, common positioning methods are selectors. Of course, Android application interfaces also use XML layouts, and you can also select through XML using xpath. Typically, web automation does not need to consider too many compatibility issues, and in most cases, compatibility issues due to devices can be bypassed by fixing the browser and setting the startup resolution size. But for Android, due to the different brands and models of devices, screen sizes, system versions, etc., may all affect the compatibility of automation code. But don’t be afraid, although there is an impact, it’s not significant.

Differences Between Various Automation Tools

As mentioned above, there are many differences between the common Android automation tools we mentioned. Let us first state our position: FIRERPA is the most stable, full-featured, powerful, and most suitable for project management and application among all automation tools.

Note

Our position is not biased, but based on 6 years of continuous exploration and optimization. The paths you’ve taken and the pitfalls you’ve encountered, we’ve basically gone through.

AutoJS

Commonly used AutoJS and related derivative products belong to the self-controlled category, requiring the installation of an APK on the device and writing scripts in JS for operations. Usually, AutoJS can only perform automation-level operations. The advantage is that it is suitable for beginners or amateur use, with a low entry barrier. But its design itself is not suitable for large-scale script control, management, and updates. It is in a decentralized, unmanaged state, and cannot perform precise large-scale control.

Appium

Appium, commonly used by testers, belongs to the C/S architecture, and is relatively more suitable for cluster control than AutoJS. But it has obvious shortcomings. Because it’s applicable to automation of most systems, supporting not only Android but also iOS, it becomes bulky and bloated, which is very unsuitable for large-scale deployment and use.

u2

Finally, uiautomator2 also belongs to the C/S architecture. Compared to Appium, it’s more precise and has reached a sufficiently streamlined level without too many redundant things, and its functions are just right. But why did we abandon it? The main reason is that it’s not very stable in multi-device situations. Secondly, the automatic installation-related logic is suitable for beginners but seems redundant and not easy to control for professional cluster control, and it’s not maintained.

Of course, they are all very suitable for regular use. But we happen to be not very regular, because in business, it’s usually impossible to only automate. For example, if there’s a task to test a certain APP, needing to record requests and responses, request times, etc. Think about how you would do it. I believe in your solution, it either involves a lot of additional manual operations, or it’s extremely unstable or has compatibility issues. In the world of FIRERPA, you can perform all operations using code. You only interact with code, and stability and compatibility are left to FIRERPA. We shouldn’t actually compare with these, because in terms of functionality, FIRERPA is a superset of all the above solutions. It includes all the pitfalls we’ve encountered and paths we’ve taken.

Basic Automation Process

Usually, you need to pre-research the plan: whether to just perform regular automation or to capture application runtime data while automating. Typically, you have two options for data capture: one is through a man-in-the-middle for HTTP/s communication interception, the other is through Hook for data interception. The man-in-the-middle method is relatively simple and suitable for regular use, but may not work for some applications. The Hook approach requires a lot of reverse engineering knowledge, is difficult to get into, and not suitable for beginners, but is suitable for edge cases.

Man-in-the-Middle Capture

The man-in-the-middle approach is relatively simple. You only need to find the content related to installing certificates and setting proxies in the documentation, and use it with mitmproxy to implement it. If you are completely unclear, you can refer to our official startmitm.py script, where all the logic has been written for you and you can copy or reuse it at any time.

Hook Capture

The Hook approach requires you to have junior-level or higher reverse engineering capabilities. If you haven’t heard of it before, you don’t need to consider this approach at this time. In general, the Hook approach is to write frida scripts, hook related function calls, get parameters or return values and submit them, inject applications, etc. You can learn about a simple Demo and usage methods in the “Using Frida to Report Data” section.

Automation Code

Of course, automation code is also an indispensable member, because you need to use automation to trigger related logic. For writing automation code, you should usually follow this process. First, you should open the FIREPRA remote desktop. You should see the following interface.

Remote Desktop

Now, please open the APP you want to automate, then click the eye icon in the upper right corner of the remote desktop, and you will see the following interface. At this point, select the element you want to operate and click to view the element information.

Tip

Of course, you can also open it with code. These will all be written later.

Select Element

You can see the element information on the right, such as text, resourceId, and other information. Now we want to click this element, so write the following code. The meaning of this code is “click the element with text ‘Agree’”.

d(text="同意").click()

Note

This is just an example. There are many ways to write selectors. The example only introduces the simplest way of writing.

Alright, you already understand the simplest way of writing. Now please write if-else and other control logic, combined with exists and other interfaces, and you can implement a complete automation operation process. See, it’s not that hard.

Interface Layout Inspection

Normally, writing automation code cannot be separated from interface layout inspection, which is also the only path for you to get selector conditions. First, you need to open the device’s remote desktop in a browser. Then click the eye icon in the upper right corner of the remote desktop to enter layout inspection. At this point, you can click on the dotted line boxes on the left screen to view the information of the corresponding elements. You can use the properties in it as parameters for the selector. Clicking the eye icon again will close the layout inspection. The layout inspection will not refresh as the page changes; it always shows the screen layout at the moment you pressed the shortcut key. If you need to refresh the layout, please manually press the shortcut key CTRL + R.

Inspect Element

Hint

You can also press the TAB key in the layout inspection interface to traverse and view all elements.

Interface Selector

The interface selector is used to manipulate Android elements. You can also understand it as Xpath rules, although different, but the general purpose is the same. In FIRERPA, the selector is Selector. In most cases, you don’t need to directly touch this class. You should have seen this in the text above. Complete, it includes the following optional parameters.

Match Type Description
text Text exact match
textContains Text contains match
textStartsWith Text starts with match
className Class name match
description Description exact match
descriptionContains Description contains match
descriptionStartsWith Description starts with match
clickable Can be clicked
longClickable Can be long-pressed
scrollable Can be scrolled
resourceId Resource ID match

In most cases, only resourceId, text, description, textContains, etc. are used as parameters. If the element has a normal resourceId, prioritize using it as a Selector, such as d(resourceId="com.xxx:id/mobile_signal"). Otherwise, text will be used, such as d(text="Click to enter"), or more vaguely d(textContains="Click"). Description is similar to text, but description is used less frequently.

Hint

The selector is constructed from the relevant main parameters you obtain through the interface layout monitoring function mentioned above.

Screen Coordinate Definition

In the process of automation operations, you will inevitably encounter situations where you need to operate through detailed coordinates or region coordinates. But since many people may not be very clear about coordinate issues, we introduce the knowledge of Android screen coordinates here.

As we all know, images have resolution sizes, and so do screens. For Android screens, regardless of whether it’s in landscape, portrait, or auto-rotate mode, the top-left corner is uniformly set as the origin 0,0, and the coordinate system extends to the right and down, with X as the horizontal axis and Y as the vertical axis, as shown in the figure.

Screen Coordinates

From the figure above, we know that the coordinates of the top-left corner of the screen are 0,0, the top-right corner is 1080,0, the bottom-left corner is 0,1920, and the bottom-right corner is 1080,1920. You can use this information to convert the coordinates of any point on the screen.

Note

Regardless of whether the screen is originally portrait or landscape, or auto-rotate, the top-left corner of the screen in its current orientation is uniformly used as the origin.

Points on the Screen

In FIRERPA, there are two definitions about the screen. Some operations like clicking or taking screenshots require you to provide region or coordinate information. For common coordinate points, we use the following definition, which represents a point on the screen with coordinates 100,100.

Point(x=100, y=100)

Definition of Regions

The definition of a region is a rectangular area on the screen. Its definition is slightly complex, please read carefully. We use Bound to represent an area on the screen, which requires you to provide four parameters: top, left, bottom, and right. You might be a bit confused, so please carefully understand the following: top represents the pixel distance from the top of the rectangle to the top of the screen, left represents the pixel distance from the left of the rectangle to the left of the screen, right represents the distance from the right of the rectangle to the left of the screen, and bottom represents the distance from the bottom of the rectangle to the top of the screen. In short, you can understand all distances as distances from the XY axis radiating from the origin. Below we use a picture to aid your understanding. The phone screen is still 1080x1920, and the device is currently in portrait mode.

Screen Region

Now let’s assume the screen is divided into four equal parts, and we need to get the definitions of the top-left and bottom-right regions as shown. According to the rules, for region 1, the top of the rectangle is 0 pixels from the top of the screen, the left of the rectangle is 0 pixels from the left of the screen, the bottom of the rectangle is 960 pixels from the top of the screen (1920÷2), and the right of the rectangle is 540 pixels from the left of the screen (1080÷2). So its definition should be:

Bound(top=0, left=0, right=540, bottom=960)

Similarly, for region 2, the top of the rectangle is 960 pixels from the top of the screen, the left of the rectangle is 540 pixels from the left of the screen, the right of the rectangle is 1080 pixels from the left of the screen, and the bottom of the rectangle is 1920 pixels from the top of the screen. So the definition of the second rectangle is:

Bound(top=960, left=540, right=1080, bottom=1920)

Automation Auxiliary Measures

In automation business, not all applications are suitable for positioning using selectors. Some interfaces, such as games, are rendered in real-time and do not have page layouts at the Android level. So for such applications, you can only operate and judge through OCR or image matching. We provide complete OCR auxiliary solutions and built-in image SIFT and template matching solutions to help achieve these business goals. You can find relevant interfaces and their usage methods in the documentation.

Updating…