Opening the Blackbox of VirusTotal: Analyzing Online ... · Online scan engines, designed to scan malware files and malicious websites, are critical tools for detecting new threats
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Opening the Blackbox of VirusTotal: Analyzing Online PhishingScan Engines
Figure 2: Screenshots of experiment phishing pages.
1 2 3 4 5
Time (week)
Phishing Page
Gets Online
Submit URL to
3rd-party VendorSubmit URL to
3rd-party VendorChange Phishing
Page to Benign
VirusTotal Scan
External Event
Figure 3: Illustration of the main experiment on a givenphishing site. The third-party vendor is one of the 18 ven-dors that provide their own scan APIs.
more than 30% of phishing URLs at major blacklists are targeting
PayPal [34]. IRS, as a comparison baseline, is not commonly targeted.
We replicate the original sites of PayPal and IRS, and modify the
login form so that login information will be sent to our servers.
By default, we disable any form of cloaking for the phishing sites.
Cloaking means a phishing site hides itself by showing a benign
page when it recognizes the incoming request is from a known
security firm [21, 32]. The robots.txt is also set to allow web
crawlers to access the phishing page.
Domain Names. We register fresh domain names for our phish-
ing sites. This is to make sure the domain names do not have any
past history that may interfere with the measurement. To prevent
innocent users from mistyping the domain names (i.e., accidentallyvisiting our websites), we register long random strings as domain
names (50 characters each) from NameSilo [5]. For example, one of
the domain names is “yzdfbltrok9m58cdl0lvjznzwjjcd2ihp5pgb295hfj5u42ff0.xyz”.
Web Hosting. We host the phishing websites at a web hosting
service called Digital Ocean [1] on static IPs. Before the experiment,
we made sure all the IPs and domain names are publicly accessible,
and are not blacklisted by any major blacklist. We have informed
Digital Ocean of our research, and have received their consent.
3.2 Experiment DesignThe experiments were conducted from March to April in 2019,
including a main experiment and a baseline experiment.
Main Experiment. The main experiment is designed to mea-
sure (a) the phishing detection accuracy of VirusTotal and vendors;
(b) the potential inconsistency between VirusTotal API and the
vendors’ APIs; (c) the reaction of VirusTotal to changes of phishing
sites. Recall that there are 18 vendors that have their own scan APIs.
To accurately capture their impact, we set up separate phishing
sites (1 PayPal and 1 IRS) for each vendor (36 sites in total).
For each phishing site, we conduct a 4-week experiment as il-
lustrated in Figure 3. We periodically submit the phishing URL
Table 2: The number of incoming network requests to fetchthe phishing URLs, and the unique number of IPs per phish-ing site over the 4-week period. We show the average num-ber of total “malicious” labels from VirusTotal per phish-ing site (if a vendor once gave a malicious label and thenchanged it back later, we still count it).
to VirusTotal’s scan API. The VirusTotal scan API will trigger the
scanning of (some of) the third-party vendors. VirusTotal scanning
is conducted twice a week on Mondays and Thursdays. At the same
time, we schedule 4 external events (on the Mondays of each week):
(1) Week1: We put the phishing site online.
(2) Week2: We submit the phishing URL to one of the 18 ven-
dors who have their own scan APIs.
(3) Week3: We take down the phishing page, and replace it with
a benign page (i.e., a blank page).
(4) Week4: We submit the phishing URL to the same third-party
vendor as week2.
Note that (2) and (4) are designed to measure the consistency be-
tween VirusTotal scanning and the vendors’ own scanning. Each
phishing site is only submitted to one vendor API so that we can
measure the differences between vendors.
During the experiment, we collect two types of data. First, we col-
lect the labels for all the phishing URLs using VirusTotal’s queryingAPI. Note that after a URL is submitted for scanning, the scanning
results (i.e., labels) might not be immediately available in the Virus-
Total database. So we crawl the labels every 60 minutes to track
the fine-grained dynamic changes. Second, we log the incoming
network traffic to all of the phishing servers.
Baseline Experiment. The baseline experiment is to measure
the long-term reaction of VirusTotal after a single VirusTotal scan.We set 2 additional phishing sites (PayPal and IRS) and only submit
the URLs to VirusTotal scan API for once at the beginning of the first
week. Then we monitor incoming traffic to the phishing servers,
and query the VirusTotal labels in the next 4 weeks.
Summary. In total, 38 websites are set up for our experiments
(36 for main, 2 for baseline). There are 19 PayPal sites and 19 IRS
sites. All the PayPal sites have identical web page content (hosted
under different domain names). All the IRS sites share the same
content (with different domain names).
4 MEASUREMENT RESULTSOur measurement returns a number of important results.
(a) IncomingNetwork Traffic. Table 2 shows the statistics the
incoming network requests that fetch the phishing URLs. Clearly,
PayPal sites have received significantly more network traffic than
IRS sites. On average, each PayPal site has received more than
12,000 requests while an IRS site has only received 335 requests.
IMC ’19, October 21–23, 2019, Amsterdam, Netherlands P. Peng, L. Yang, L. Song, and G. Wang
0
400
800
1200
1600
1 2 3 4 5
# Requests
/ Day
Weeks
PayPalIRS
Figure 4: The number of incoming net-work requests per day per phishing site(main experiment).
0
100
200
300
400
500
1 2 3 4 5
# Requests
/ Day
Weeks
PayPalIRS
Figure 5: The number of incoming net-work requests per day per phishing site(baseline experiment).
0
4
8
12
16
1 2 3 4 5# Malicious Labels
/ Site
Weeks
PayPalIRS
Figure 6: The average, maximum, andminimum number of malicious labelsper site (main experiment).
As shown in Figure 4, IRS sites barely have any traffic in the first
week, and only start to receive more traffic in the second week.
Interestingly, the traffic volume is correlated with the “labels”
received by the sites. Figure 6 shows the number of VirusTotal
vendors that flagged a phishing site as malicious (i.e., number of
malicious labels per site). PayPal sites get flagged by some vendors
right away in the first week, while IRS sites are only detected at a
much later time (after vendor API scan). The hypothesis is that after
a phishing site is flagged by some vendors, then it will be shared withother vendors to perform more in-depth scanning. Figure 5 further
confirms this intuition. For the IRS site (baseline experiment), we
only submit its URL to VirusTotal scan for once which failed to
detect it. Then there is almost nomore traffic in the following weeks.
The PayPal site, since it got flagged after the scan, continues to
receive incoming traffic.
After looking into the traffic log, we notice that not all the re-
quests are pointed towards the submitted phishing URLs. Some scan-
ners also attempted to retrieve the resources under the root direc-
tory (“/”) or non-existing pages such as “payload.php” or “shell.php”.
For example, in the baseline experiment, the PayPal site has received
6,291 requests for the phishing URL (see Table 2), and 19,222 re-
quests for other URLs or resources. This indicates that the scanners
are looking for signs of malware hosting or website compromise.
(b) Delay of Label Updating. A closer examination of Figure 6
shows that VirusTotal has a delay of updating the labels to its data-
base. More specifically, the x-axis in Figure 6 is the label querying
time (label crawling is done every hour). We observe that only after
the second VirusTotal scan will the first scan result get updated to
VirusTotal database.
For example, in the first week, we submit the PayPal URLs to
VirusTotal on day-1. The querying API returns “benign” labels since
these URLs were never scanned before by any vendor. Then after we
submit the URLs again on day-4, the querying API starts to return
“malicious” labels from some vendors. Based on the “scanning time”
on the returned labels, we see that theses “malicious” labels are
actually originated from the scan of day-1. This means, although
some vendors have already detected the phishing page on day-1,
the results would not be updated to VirusTotal database until the
next scan request on day-4.
The result shows VirusTotal uses “pull” (instead of “push”) to
get scanning results from vendors. The pull action is only triggered
by VirusTotal’s scan API but not the querying API. Our baseline
Vendor Name Brand VTotal Vendor VTotalBefore (week-2) After
Table 4: A list of all the vendors that successfully detectedthe phishing pages (during the first 2 weeks).
As shown in Table 3, there are in total 8 vendors that show incon-
sistent results. Most vendors have a “0-1-0” pattern for PayPal sites
including Forcepoint, Sucuri, Quttera, URLQuery, ZeroCERT, andGoogle Safe Browsing. This means through VirusTotal scan, these
vendors return the label “benign”, even though their own scan APIs
can detect that the page as “malicious”. A possible explanation is
that these vendors did not give VirusTotal the permission to trigger
their scanners. Instead, VirusTotal runs stripped-down versions of
the scanners [9, 27], which cannot detect the phishing page.
For IRS pages, we show that Fortinet, Google Safe Browsing,and Netcraft have detected these IRS pages via their own scan
APIs. However, only Netcraft has shared this result to VirusTotal
after the scan. It should be noted that we have tried to analyze
which scanners indeed visited the phishing sites. This attempt
failed because scanners were actively hiding their identity by using
proxies and cloud services (see §5). Overall, the result shows the
VirusTotal does not always reflect the best detection capability of a
vendor. If possible, researchers should cross-check the results with
individual vendors’ APIs.
(e) Detection Accuracy of Vendors. In Table 4, we list all 15
vendors that detected at least one phishing site during the first two
weeks (we took down the phishing pages after week-2). We show
that even the best vendors cannot detect all phishing sites. The most
effective vendors such as Netcraft flagged 14 (out of 18) PayPal
pages and 12 (out of 18) IRS pages. It is not clear why some sites are
not detected given that all 18 PayPal (IRS) sites have the identical
content (except for using a different random string as the domain
name). In addition, we observe that some of the vendors always flag
the same subset of phishing sites. For example, Netcraft, Emsisoft,and Fortinet flagged the same 26 sites. Similarly, Malwarebytes,BitDefender and ESET flagged the same 15 sites. This indicates
the possibility that certain vendors would copy (synchronize with)
each other’s blacklist. To validate this hypothesis, more rigorous
experiment is needed in future work.
(f) Reaction to Phishing Take-down. We observe that ven-
dors do not quickly take a URL off the blacklist after the phishing
site is taken down. On the Monday of week-3, we took down all
the phishing pages and replaced them with benign pages. However,
0
4
8
12
16
1 2 3 4 5
# D
ete
cte
d S
ite
s
Week
FortinetAvira
CyRadarCLEAN MX
Figure 7: Four vendors have a sign of reaction to the phish-ing take-down (PayPal sites).
Figure 6 shows the number of malicious labels does not drop even
after multiple re-scans.
After examining the results for each vendor, we find 4 vendors
that flip some “malicious” labels to “benign” after the third week
(for PayPal sites only). Figure 7 shows these 4 vendors and the
number of phishing sites they flagged over time. CyRadar and CLEANMX already started to flip their malicious labels in week-2 (before
phishing take-down), which is not necessarily a reaction to the take-
down. Fortinet flipped the label on one site in week-4. Avira islikely to be reacting to the take-down since it changed all “malicious”
labels to “benign” right after the event. Interestingly, the labels were
quickly reversed to “malicious” in the next scan.
5 OTHER CONTROLLED EXPERIMENTSOur experiments lead to new questions: which vendors have in-
deed visited the phishing sites? What would happen if a phishing
site applies simple obfuscation techniques or sets “robots.txt” toprevent crawling? How well can VirusTotal detect benign pages?
To answer these questions, we conduct additional controlled exper-
iments by setting up 27 new sites.
Vendor Identification. Vendor identification based on the net-
work traffic is very difficult. On average each phishing site was
visited by more than 2000 unique IPs (PayPal, Table 2). Leverag-
ing the whois records, User-Agents, and the known IP ranges of
security vendors, we only successfully confirmed the identity of 5
vendors, including Dr. Web, Forcepoint, Google Safe Browsing,Quttera, and ZeroCERT. We also tried more controlled experiments
by submitting URLs to each of the 18 vendors (one URL per vendor).
Even so, we cannot build a reliable identifier for all 18 vendors. The
reason is that most vendors route their traffic via proxies or cloud
services. The IP set of each vendor dynamically changes too. 32.9%
of the traffic comes from known cloud services such as Amazon,
Digital Ocean, M247, Feral Hosting, and Linode. It is likely that
security vendors are trying to hide their identity to overcome the
cloaking of phishing sites [32].
Additional Experiments on Label Updating. So far, our
main experiment shows that it takes two scan requests to push
the scanning results back to VirusTotal (§4(b)). However, the previ-
ous experiment is limited to new URLs that are never detected by
vendors before. A follow up question is, what if the URL is already
blacklisted by the third-party vendor? Do we still need two requests
to push the label to VirusTotal? To answer this question, we per-
formed a small controlled experiment. We set up three fresh PayPal
IMC ’19, October 21–23, 2019, Amsterdam, Netherlands P. Peng, L. Yang, L. Song, and G. Wang
Obfuscation # Sites Malicious Labels Per SiteMethod (PayPal) Min. Max. Avg.Redirection 2 12 12 12
Image 2 3 6 4.5
PHP Code 2 1 3 2
Table 5: The number of “malicious” labels per site after ap-plying different obfuscation methods.
pages under three new domain names. Then we choose NetCraft,Forcepoint, and Fortinetwhich are capable of detecting the Pay-
Pal page in the main experiment. We first submit the three URLs to
individual vendors for scanning (one URL per vendor). Same as be-
fore, the URLs get immediately blacklisted by the respective vendor.
Then we submit the URLs to VirusTotal for the first scan. VirusTotal
returns a “benign” label for all the URLs. After 4 days, we submit the
URL to VirusTotal for the second scan. Interestingly, the returned
labels are still “benign”. This indicates NetCraft, Forcepoint, andFortinet do not share their blacklists with VirusTotal. Otherwise,
the labels should have been “malicious” after the second VirusTotal
scan. It is more likely that VirusTotal runs stripped-down versions
of the scanners that fail to detect the phishing pages .
Impact of Obfuscation. Obfuscation is used to deliberately
make it harder to understand the intent of the website. In this
case, the attacker can apply simple changes so that their website
still looks like the target website, but the underlying content (e.g.,code) becomes harder to analyze. We examine the impact of three
obfuscation methods: (1) Redirection: we use a URL shortener
service to obfuscate the phishing URL. (2) Image-based Obfuscation:we take a screenshot of the PayPal website, and use the screenshot
as the background image of the phishing site. Then we overlay the
login form on top of the image. In this way, the phishing site still
looks the same, but the HTML file is dramatically different. (3) PHPCode Obfuscation: within the original PHP code, we first replace
all user-defined names with random strings (without affecting the
functionality). Then we remove all the comments and whitespace,
and output encoding in ASCII. For each of the obfuscation methods,
we build 2 new PayPal sites (6 sites in total). We submit the URLs
to VirusTotal for scan, wait for a week, submit again (to trigger
database update), and retrieve the labels.
Table 5 shows the number of malicious labels per site. As a
comparison baseline, without obfuscation, the PayPal site in the
main experiment (§4) received 12.1 malicious labels on average. This
number is calculated based on the first scan of week-2 in the main
experiment (instead of the four weeks of result) to be consistent
with the setting of the obfuscation experiment. We observe that
redirection does not help much. However, image and code-based
obfuscations are quite effective — the average number of malicious
labels drops from 12.1 to 4.5 and 2 respectively. This suggests that
these vendors are still unable to handle simple obfuscation schemes.
Robots.txt. To see the impact of robots.txt, we set up 18
new domains where the robots.txt disallows crawling. Then we
submit these 18 URLs to the 18 vendors’ scan APIs. We find that the
traffic volumes are still comparable with the previous experiment.
The result indicates that most scanners would ignore robots.txt.
Detection of Benign Pages. All the experiments so far are
focused on phishing pages. A quick follow-up question is how well
can VirusTotal detect benign pages. We did a quick experiment
by setting up one benign page under a new domain name (a long
random string as before). The page is a personal blog, and it does
not try to impersonate any other brand. We submit the URL to
VirusTotal scan API twice with 3 days apart, and then monitor the
label for a month. We find that the labels are always “benign”. Given
the limited scale of this experiment, it is not yet conclusive about
VirusTotal’s false positive rate. At least, we show that VirusTotal
did not incorrectly label the website as “malicious” just because it
has a long random domain name.
6 DISCUSSIONS & OPEN QUESTIONSOur experiments in §4 and §5 collectively involve 66 (38+28) exper-
imental websites. We show that vendors have an uneven detection
performance. In the main experiment, only 15 vendors have de-
tected at least one site. Even the best vendor only detected 26 out
of 36 sites. Given that vendors have an uneven capability, their
labels should not be treated equally when aggregating their results.
In addition, we show the delays of label updating due to the non-
proactive “pull” method of VirusTotal. We also illustrate the label
inconsistency between VirusTotal scan and the vendors’ own scans.
As a simple best-practice, we suggest future researchers scanning
the URLs twice to obtain the updated labels and cross-checking the
labels with the vendors’ own APIs.
Limitations. Our experiments have a few limitations. First, the
long domain names may affect the detection accuracy. However, we
argue that the long domain names actually make the websites look
suspicious, and thus make the detection easier. The fact that certain
scanners still fail to detect the phishing sites further confirms the
deficiency of scanners. Second, the use of “fresh” domain names
may also affect the detection performance of vendors, since certain
vendors might use “infection vendors” as features (e.g., reportsfrom the victims of a phishing site). In practice, the vendors might
perform better on phishing sites that already had victims.
Future Work. During our experiments, we observe interesting
phenomena that lead to new open questions. First, the vendors’
models perform much better on PayPal pages than on IRS pages.
Future work can further investigate the “fairness” of vendors’ classi-
fiers regarding their performance on more popular and less popular
phishing brands. Second, we observe that some vendors always
detect the same subset of phishing sites (Table 4). If these vendors in-deed fully synchronize their labels, then their labels are essentially
redundant information. As such, these vendors should not be treated
as independent vendors when aggregating their votes. Future work
can further investigate the correlation of results between differ-
ent vendors. Third, many vendors (e.g., Kaspersky, Bitdefender,Fortinet) also provide API for file scanning to detect malware. File
scan can be studied in a similar way, e.g., submitting “ground-truth”
malware and benign files to evaluate the quality of labels and the
consistency between vendors and VirusTotal.
ACKNOWLEDGEMENTWe would like to thank our shepherd Gianluca Stringhini and the
anonymous reviewers for their helpful feedback. This project was
supported by NSF grants CNS-1750101 and CNS-1717028.
Y., Qian, Z., and Duan, H. How you get shot in the back: A systematical study
about cryptojacking in the real world. In Proc. of CCS (2018).[21] Invernizzi, L., Thomas, K., Kapravelos, A., Comanescu, O., Picod, J., and
Bursztein, E. Cloak of visibility: Detecting when machines browse a different
web. In Proc. of IEEE S&P (2016).
[22] Kantchelian, A., Tschantz, M. C., Afroz, S., Miller, B., Shankar, V., Bach-
wani, R., Joseph, A. D., and Tygar, J. D. Better malware ground truth: Techniques
for weighting anti-virus vendor labels. In Proc. of AISec (2015).[23] Kim, D., Kwon, B. J., and Dumitraş, T. Certified malware: Measuring breaches
of trust in the windows code-signing pki. In Proc. of CCS (2017).[24] Kim, D., Kwon, B. J., Kozák, K., Gates, C., and DumitraÈŹ, T. The broken
shield: Measuring revocation effectiveness in the windows code-signing pki. In
Proc. of USENIX Security (2018).
[25] Kleitman, S., Law, M. K., and Kay, J. ItâĂŹs the deceiver and the receiver:
Individual differences in phishing susceptibility and false positives with item
profiling. PLOS One (2018).[26] Korczynski, D., and Yin, H. Capturing malware propagations with code injec-
tions and code-reuse attacks. In Proc. of CCS (2017).[27] Kwon, B. J., Mondal, J., Jang, J., Bilge, L., and Dumitraş, T. The dropper effect:
Insights into malware distribution with downloader graph analytics. In Proc. ofCCS (2015).
[28] Lever, C., Kotzias, P., Balzarotti, D., Caballero, J., and Antonakakis, M. A
lustrum of malware network communication: Evolution and insights. In Proc. ofIEEE S&P (2017).
[29] Li, B., Vadrevu, P., Lee, K. H., Perdisci, R., Liu, J., Rahbarinia, B., Li, K., and
Antonakakis, M. Jsgraph: Enabling reconstruction of web attacks via efficient
tracking of live in-browser javascript executions. In Proc. of NDSS (2018).[30] Miramirkhani, N., Barron, T., Ferdman, M., and Nikiforakis, N. Panning
for gold.com: Understanding the dynamics of domain dropcatching. In Proc. ofWWW (2018).
[31] Neupane, A., Saxena, N., Kuruvilla, K., Georgescu, M., and Kana, R. K. Neural
signatures of user-centered security: An fmri study of phishing, and malware
warnings. In Proc. of NDSS (2014).[32] Oest, A., Safaei, Y., Doupé, A., Ahn, G., Wardman, B., and Tyers, K. Phishfarm:
A scalable framework for measuring the effectiveness of evasion techniques
against browser phishing blacklists. In Proc. of IEEE S&P (2019).
[33] Oprea, A., Li, Z., Norris, R., and Bowers, K. Made: Security analytics for
enterprise threat detection. In Proc. of ACSAC (2018).
[34] Peng, P., Xu, C.,Quinn, L., Hu, H., Viswanath, B., andWang, G. What happens
after you leak your password: Understanding credential sharing on phishing
sites. In Proc. of AsiaCCS (2019).[35] Razaghpanah, A., Nithyanand, R., Vallina-Rodriguez, N., Sundaresan, S.,
Allman, M., Kreibich, C., and Gill, P. Apps, trackers, privacy, and regulators:
A global study of the mobile tracking ecosystem. In Proc. of NDSS (2018).[36] Sarabi, A., and Liu, M. Characterizing the internet host population using deep
learning: A universal and lightweight numerical embedding. In Proc. of IMC(2018).
[37] Schwartz, E. J., Cohen, C. F., Duggan, M., Gennari, J., Havrilla, J. S., and
Hines, C. Using logic programming to recover c++ classes and methods from
compiled executables. In Proc. of CCS (2018).[38] Sharif, M., Urakawa, J., Christin, N., Kubota, A., and Yamada, A. Predicting
impending exposure to malicious content from user behavior. In Proc. of CCS(2018).
[39] Szurdi, J., and Christin, N. Email typosquatting. In Proc. of IMC (2017).
[40] Tian, K., Jan, S. T. K., Hu, H., Yao, D., and Wang, G. Needle in a haystack:
Tracking down elite phishing domains in the wild. In Proc. of IMC (2018).