Why MySQL could be slow with large tables ? June 9, 2006 By Peter Zaitsev125 CommentsIf you¶ve been reading enough database related forums, mailing lists or blogs you probably heard complains about MySQL being unable to handle more than 1.000.000 (or select any o ther number) rows by some of the users. On other hand it is well known with customers like Google, Yahoo, LiveJournal,Technocarati MySQL has installations with many billions of rows and delivers great performance. What could be the reason ? The reason is normally table design and understanding inner works of MySQL. If you design your data wisely considering what MySQL can do and what it can¶t you will get great perfomance if not, you might become upset and become one of thouse bloggers. Note ± any database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically. The three main issues you should be concerned if you¶re dealing with very large data sets areBuffers, Indexes and Joins.BuffersFirst thing you need to take into account is the fact ± situation when data fits in memory and when it does not are very different. If you started from in-memory data size and expect gradual performance decrease as database size grows you may be surprised by serve drop in performance. This especially apples to index lookus and joins which we cover later. As ever ything usually slows down a lot once it does not fit in memory the good solution is to make sure your data fits in memory as good as possible. This could be done by data partitioning (ie old and rarely accessed data stored in different servers), multi-server partitioning to use combined memory and a lot of other technics which I should cover at some later time. So you understand how much having data in memory changed things here is small example with numbers. If you have your data fully in memory you could perform over300.000 of random lookups per second from single thread depending on system and table structure. Now if you data fully on disk (both data and index) you would need 2+ IOs to retrieve the row which means you get about 100 rows/sec. Note multiple drives do not really help a lot as we¶re speaking about single thread/query here. So difference
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
8/7/2019 Why MySQL could be slow with large tables
If you¶ve been reading enough database related forums, mailing lists or blogs you
probably heard complains about MySQL being unable to handle more than 1.000.000
(or select any other number) rows by some of the users. On other hand it is well known
with customers like Google, Yahoo, LiveJournal,Technocarati MySQL has installations
with many billions of rows and delivers great performance. What could be the reason ?
The reason is normally table design and understanding inner works of MySQL. If you
design your data wisely considering what MySQL can do and what it can¶t you will get
great perfomance if not, you might become upset and become one of thouse bloggers.
Note ± any database management system is different in some respect and what works
well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way
around. Even storage engines have very important differences which can affect
performance dramatically.
The three main issues you should be concerned if you¶re dealing with very large data
sets areBuffers, Indexes and Joins.
Buffers
First thing you need to take into account is the fact ± situation when data fits in memory
and when it does not are very different. If you started from in-memory data size and
expect gradual performance decrease as database size grows you may be surprised by
serve drop in performance. This especially apples to index lookus and joins which we
cover later. As everything usually slows down a lot once it does not fit in memory the
good solution is to make sure your data fits in memory as good as possible. This could
be done by data partitioning (ie old and rarely accessed data stored in different servers),
multi-server partitioning to use combined memory and a lot of other technics which I
should cover at some later time.
So you understand how much having data in memory changed things here is small
example with numbers. If you have your data fully in memory you could perform over 300.000 of random lookups per second from single thread depending on system and
table structure. Now if you data fully on disk (both data and index) you would need 2+
IOs to retrieve the row which means you get about 100 rows/sec. Note multiple drives
do not really help a lot as we¶re speaking about single thread/query here. So difference
8/7/2019 Why MySQL could be slow with large tables
is 3.000 times! It might be a bit too much as there are few completely uncached
workloads but 100+ times difference is quite frequent.
Indexes
What everyone knows about indexes is the fact they are good to speed up accesses to
database. Some people would also remember if indexes are helpful or not depends on
index selectivity ± how large proportion of rows matches to particular index value or
range. What is often forgotten about is ± depending if workload is cached or not
different selectivity might show benefit from using indexes. In fact even MySQL
optimizer currently does not take it into account. For In memory workload index
accesses might be faster even if 50% of rows are accessed, while for disk IO bound
accessess we might be better of doing full table scan even if only few percent or rows
are accessed.
Lets do some computations again. Consider table which has 100 byte rows. With
decent SCSI drive we can get 100MB/sec read speed which gives us about 1.000.000
rows per second for fully sequential access, jam packed rows ± quite possible scenario
for MyISAM tables. Now if we take the same hard drive for fully IO bound workload it will
be able to provide just 100 row lookups by index pr second. The difference is 10.000
times for our worse case scenario. It might be not that bad in practice but again it is not
hard to reach 100 times difference.
Here is little illustration I¶ve created the table with over 30 millions of rows. ³val´ column
in this table has 10000 distinct value, so range 1..100 selects about 1% of the table. Thetimes for full table scan vs range scan by index:mysql> select count(pad) from large;
+------------+
| count(pad) |
+------------+
| 31457280 |
+------------+
1 row in set (4 min 58.63 sec)
mysql> select count(pad) from large where val between 1 and 100;
+------------+| count(pad) |
+------------+
| 314008 |
+------------+
1 row in set (29 min 53.01 sec)
Also remember ± not all indexes are created equal. Some indexes may be placed in
sorted way or pages placed in random places ± this may affect index scan/range scan
8/7/2019 Why MySQL could be slow with large tables
speed dramatically. The rows referenced by indexes also could be located sequentially
or require radom IO if index ranges are scanned. There are also clustered keys in
Innodb which combine index access with data access, saving you IO for completely disk
bound workloads.
There are certain optimizations in works which would improve performance of index
accesses/index scans. For example retrieving index values first and then accessing
rows in sorted order can be a lot of help for big scans. This will reduce the gap but I
doubt it will be closed.
Joins
Joins are used to compose the complex object which was previously normalized to
several tables or perform complex queries finding relationships between objects.
Normalized structure and a lot of joins is right way to design your database as textbooks
teach you« but when dealing with large data sets it could be recepie to disaster. The
problem is not the data size ± normalized data normally becomes smaller, but
dramatically increased number of index lookups which could be random accesses. This
problem exists for all kinds of applications, however for OLTP applications with queries
examining only few rows it is less of the problem. Data retrieval, search, DSS, business
intelligence applications which need to analyze a lot of rows run aggregates etc is when
this problem is the most dramatic.
Some joins are also better than others. For example if you have star join with dimention
tables being small it would not slow things down too much. On other hand join of fewlarge tables, which is completely disk bound can be very slow.
One of the reasons elevating this problem in MySQL is lack of advanced join methods
at this point (the work is on a way) ± MySQL can¶t do hash join or sort merge join ± it
only can do nested loops method which requires a lot of index lookups which may be
random.
Here is good example. As we saw my 30mil rows (12GB) table was scanned in less
than 5 minutes. Now if we would do eq join of the table to other 30mil rows table and it
will be completely random. We¶ll need to perform 30 millions of random row reads,which gives us 300.000 seconds with 100 rows/sec rate. So we would go from 5
minutes to almost 4 days if we need to do the join. Some people assume join would be
close to two full table scans (as 60mil of rows need to be read) ± this is way wrong.
8/7/2019 Why MySQL could be slow with large tables
Do not take me as going against normalization or joins. It is great principle and should
be used when possible. Just do not forget about performance implications designing the
system and do not expect joins to be be free.
Finally I should mention one more MySQL limitation which requires you to be extra
careful working with large data sets. In MySQL single query runs as single thread (with
exeption of MySQL Cluster) and MySQL issues IO requests one by one for query
execution, which means if single query execution time is your concern many hard drives
and large number of CPUs will not help. Sometimes it is good idea to manually split
query into several, run in parallel and aggregate result sets.
So if you¶re dealing with large data sets and complex queries here are few tips
Try to fit data set you¶re working with in memory ± Processing in memory is so
much faster and you have whole bunch of problems solved just doing so. Use multiple
servers to host portions of data set. Store portion of data you¶re going to work with in
temporary table etc.
Prefer full table scans to index accesses ± For large data sets full table scans are
often faster than range scans and other types of index lookups. Even if you look at 1%
or rows or less full table scan may be faster.
Avoid joins to large tables Joining of large data sets using nested loops is very
expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload
them to memory before join so there is no random IO needed to populate the caches.
With proper application architecture and table design you can build applicationsoperating with very large data sets based on MySQL.
Filed Under: Insight for DBAs, Production, Tips
About Peter Zaitsev
Peter managed the High Performance Group within MySQL until 2006, when hefounded Percona. Peter has a Master's Degree in Computer Science and is an expert in
database kernels, computer hardware, and application scaling.
Comments
8/7/2019 Why MySQL could be slow with large tables