WEBVTT

00:00.000 --> 00:11.400
Hello, I am Marianne, I work in Cust, and today I will present my work done in GCC, which

00:11.400 --> 00:19.000
is already merged into upstream, its CRC detection and optimization.

00:19.000 --> 00:28.000
CRC, or cyclic redundus, is used to detect errors in digital digital data transmission,

00:28.000 --> 00:35.000
and CRC generated checks based on the input data.

00:35.000 --> 00:46.520
There are many CRC implementations in software, for example, Bitwise method, which is very

00:46.520 --> 01:00.520
slow, careless multiplication based method, and table based method.

01:00.520 --> 01:05.520
And there are many projects that use Bitwise CRC implementation, which I already taught

01:05.520 --> 01:12.520
with work slowly, and here are some examples from real-world projects.

01:12.520 --> 01:18.520
And so, as I already told with Bitwise is slow, we want to detect Bitwise implementations

01:18.520 --> 01:25.240
in projects, and when replaced with faster ones, for example, if the hardware already has

01:25.240 --> 01:32.560
CRC instruction, we will use it, or not if it has careless multiplication instruction,

01:32.560 --> 01:38.520
use careless multiplication instruction, if neither of those instructions are supported.

01:38.520 --> 01:47.520
We generate table based CRC, and how we detect CRC's Bitwise CRC's.

01:47.520 --> 01:54.520
It consists of two main stages, first is initial checks, and the second is verification.

01:54.520 --> 02:02.520
So, firstly, we do some simple checks to filter out known CRC's, and afterwards, we

02:02.520 --> 02:08.520
verify those by using symbolic execution.

02:08.520 --> 02:17.520
Now, what we do in initial checks, here I will slightly say what we do, but actually we do more deep

02:17.520 --> 02:21.520
and complex analysis.

02:21.520 --> 02:33.520
We detect the loop, which iteration is 8, 16, 42, or 64, in this case it's 16, and it must be the

02:33.520 --> 02:34.520
innermost loop.

02:34.520 --> 02:39.520
Also, contain if condition and under it, accelerate operation.

02:39.520 --> 02:44.520
Also, there must be a shift operation, in this case shift left we have.

02:45.520 --> 02:55.520
So, we do some kind of checks, after what if two criteria are okay, we verify those.

02:55.520 --> 03:03.520
All these things are done on-game representation, for what I have added separate paths.

03:03.520 --> 03:13.520
So, here is broke the green presentation of this loop, what we have here, and afterwards comes verification.

03:13.520 --> 03:24.520
So, to verify what this is really CRC, we create a LFSR for the polynomial, symbolically execute the loop,

03:24.520 --> 03:32.520
and when check what LFSR and the output of the execution match.

03:32.520 --> 03:37.520
So, what is polynomial, XCRC has its polynomial.

03:37.520 --> 03:42.520
So, CRC is the remainder of a modular to polynomial division of the data.

03:42.520 --> 03:49.520
In this case, we have this polynomial, and polynomials are represented by binary numbers.

03:49.520 --> 03:57.520
Here we have the coefficients, where we have one, here we have one, and our values are zeros,

03:57.520 --> 04:02.520
and in the example above, leading one is omitted.

04:02.520 --> 04:08.520
So, there is use this value, but it is the polynomial.

04:09.520 --> 04:14.520
And what is LFSR? LFSR stands full in your feedback sheet register.

04:14.520 --> 04:24.520
LFSR is a register, where the input depends on the linear function of its previous text,

04:24.520 --> 04:28.520
and with LFSR we can calculate CRC.

04:28.520 --> 04:33.520
Here, this LFSR calculates CRC with the following polynomials.

04:33.520 --> 04:37.520
In the places where coefficients is one, we have XOR operation.

04:37.520 --> 04:47.520
Here, just this one is used only for first one, and the second.

04:47.520 --> 04:56.520
And how we create LFSR structure for us, we execute the loop, firstly we have some specific values,

04:56.520 --> 05:02.520
to extract the polynomial, because in CRC it is not always obvious,

05:02.520 --> 05:08.520
which is the polynomial, and when create LFSR, using that extracted information.

05:08.520 --> 05:14.520
Also, during initial checks, we extract some more information about CRC, its size,

05:14.520 --> 05:25.520
shift direction, and so on, and here is created LFSR structure for the above example.

05:26.520 --> 05:32.520
Here is used the following polynomial, it is in binary representation.

05:32.520 --> 05:41.520
And here we have shift left, because of that we got this LFSR, here we gave symbolic values,

05:41.520 --> 05:48.520
and in square braces, we have the fit position.

05:49.520 --> 06:01.520
So here we have CRC 15, if it shifts left, it will come to a front, and CRC 0 is moved one bit left.

06:01.520 --> 06:07.520
So, after doing all these things, we execute the loop, fifth symbolic values.

06:07.520 --> 06:15.520
For example, here CRC doesn't have any concrete value, because of that, we gave it some symbolic value,

06:15.520 --> 06:23.520
and execute the loop on a bit level, because of it, I gave a number of positions.

06:23.520 --> 06:29.520
And after that, we track each variable on each part.

06:29.520 --> 06:40.520
So, after this CRC got its symbolic values, after what we calculate CRC 9 variables value,

06:40.520 --> 06:49.520
which is shifted back one bit, so we got 0, and after what CRC 14, 0, and so on.

06:49.520 --> 06:59.520
When we calculate the value of variable 7, and here comes the conditional, if so, after what we have two branches,

06:59.520 --> 07:05.520
as we don't know the value of 7, we will execute both branches.

07:05.520 --> 07:14.520
So, we duplicate the state, create a new state for the false branch, and first we can go by true branch,

07:14.520 --> 07:17.520
and afterwards continue from the false branch.

07:17.520 --> 07:21.520
So, for the true branch, we keep the condition.

07:21.520 --> 07:31.520
So, if 7 is less when 0, it means that the last bit is 1, and in case it's positive, it will be 0.

07:31.520 --> 07:40.520
Because of it, we hear it, but last bit equals to 1, and for the false branch, last bit equals to 0.

07:40.520 --> 07:50.520
After what, for true branch, we will calculate CRC 10th value, and continue.

07:50.520 --> 08:03.520
And here, this CRC is for us interesting. We get its value, and we also have what CRC 15 equals to 1.

08:03.520 --> 08:12.520
Also, this is our final state for true branch, and we also continue for false branch.

08:12.520 --> 08:22.520
For the false branch, we will have that CRC 1, so while you is the following, and the condition is what CRC 15 equals to 0.

08:22.520 --> 08:30.520
After the execution, after one iteration, we have two states and variable values.

08:31.520 --> 08:37.520
After what, we check what those conditions will be with God, much the LFS are what we created.

08:37.520 --> 08:43.520
Above is the LFS are what we have created, and these two are those what we got.

08:43.520 --> 08:53.520
So, for true branch, we had what CRC 15 equals to 1, so if we put it in there, we will have here 1, CRC 0, etc.

08:53.520 --> 09:09.520
After what CRC 4X or 1, and so on, and as you can see, it marks. For the false branch, it marks 2, because if we put 0, instead of CRC 15, we will have 0, CRC 0, and so on.

09:09.520 --> 09:14.520
So, we got what the CRC what we found is real CRC.

09:15.520 --> 09:22.520
And, after what, we replace what we found, we foster CRC versions.

09:22.520 --> 09:39.520
For what, we instead of for loop, put internal function, it may be table based implementation, cell move based or direct CRC based on the hardware.

09:40.520 --> 09:53.520
Here is how we generate cell move based. It's in assembly, but here I show in some kind of our way, so it is more understandable.

09:53.520 --> 10:03.520
Here is how we generate table based, and about the result.

10:04.520 --> 10:19.520
We have verified free Bitcoin CRC implementation in Linux kernel. Here is one example, but also we verified 19 Bitwise implementations in Fedora packages.

10:19.520 --> 10:26.520
Here are also some examples.

10:26.520 --> 10:37.520
Also, we checked edit 24 CRC cases, and 12 not CRC cases, to check how good our detection works.

10:37.520 --> 10:48.520
So, the true positive rate is 21, false negative is free, but false positive is 0.

10:48.520 --> 11:06.520
So, we implemented our algorithm in such way that we don't get any false positives, because if we change wrong CRC, so if it's not CRC and we change it, the code will work not correctly.

11:06.520 --> 11:16.520
Because of what we have false negative, so we don't find all cases, but what we find is real CRC.

11:16.520 --> 11:28.520
Also, we have tested how fast works, those optimized versions. Here is test on x8664.

11:28.520 --> 11:57.520
We ran CRC FFF times to check how fast it works. So, we here gave out many different CRC with different sizes and different data sizes. Here is not all the cases, but in average it gave 54% improvement for the table based generation, and 43% for sale based generation.

11:57.520 --> 12:12.520
So, here is the time for bitwise how long it works, how long works table based, how long worked sale move based. For example, in this case we got 91% improvement.

12:12.520 --> 12:33.520
Also, we checked on risk 5. Here we got 66% improvement on table based generation on ARxistic for we got 50% improvement.

12:33.520 --> 12:52.520
And here are examples for CRC 42 instruction. In this example, we didn't have CRC 42 instruction how was working. Here are some other tests, but use CRC 42 instructions polynomial.

12:52.520 --> 13:19.520
And here are table based CRC 42 instructions. If you can see CRC 42 works the fast test. And here are results for ARxistic for, and here we got 93% improvement. But for table based it's slow down.

13:19.520 --> 13:32.520
And here is average improvement for free ARxitectures and the promise for free cases.

13:32.520 --> 13:42.520
Also, we have published an article. There is written more deeply how everything works.

13:42.520 --> 13:52.520
Thank you.

13:52.520 --> 13:58.520
Did you work on detecting other types of CRC algorithms such as the table based and?

13:58.520 --> 14:12.520
No, we worked only bitwise CRC. If there is a CRC instruction, the 9 bitwise implementation ends up being the best, which is the same.

14:12.520 --> 14:28.520
It's hardware instruction and it's implemented in better way. So if you replace whole for loop with one instruction, it works better. But for table based and CRC 2, more operations. So it takes more time.

14:28.520 --> 14:42.520
So hardware CRC instruction works faster and to swear software implementation, what I show you.

14:42.520 --> 14:54.520
So how is this enabled? This element would be just special. And the second is this on the jibble.

14:54.520 --> 15:03.520
Okay, first question. How it's enabled? There is minus f CRC flag, but enables. And also it's enabled by default.

15:03.520 --> 15:16.520
Starting from all two optimizations. And here are two parts, first part for the detection part. I added a pass, separate pass.

15:16.520 --> 15:45.520
It works on gimpon. And after the replacement part is vacant. So we put internal function and for x808664 and ARx64 and bit risk 5. I added expender. So with expender, I generate either table based, either cell-mull based or direct.

15:45.520 --> 16:11.520
CRC instruction. But CRC is only for 32 bit. Now hardware has support. So CRC 42 instruction, not always we can use, because it's polynomial is also specific. So we check if it is what CRC we use CRC in other ways, we use a mobile based or table based.

16:11.520 --> 16:35.520
Okay. So you show statistics by the IRC RL for this process. But the whole population is only 23 things. And just certainly not enough to have a pass like this. We need to act for proof that it will not do the whole thing.

16:35.520 --> 16:50.520
The proof is that the algorithm is written in such a way that there can be false positive, because it comes from the algorithm.

16:50.520 --> 16:57.520
So my question is, what are the invitations on the debug view for here?

16:58.520 --> 17:05.520
The debug information is lost.

17:05.520 --> 17:10.520
No.

17:10.520 --> 17:19.520
I'm curious why the detection not work for those three cases that we show. The false negatives.

17:19.520 --> 17:39.520
Yes, there are some cases where we don't find there is limitations. For example, if the implementation used the polynomial, where leading one is not emitted, we can detect, because for example, if you use CRC 8 to keep the polynomial with one.

17:39.520 --> 17:47.520
We didn't beat, you will use 9 bit. Yes, because what you will go into for example, 16 bit memory.

17:47.520 --> 17:57.520
And for what we can't know exactly, you wanted to calculate 16 bit CRC or 8 bit CRC with leading one polynomial, for example.

17:57.520 --> 18:03.520
And there are some other cases where we don't detect. Now that was one example.

18:03.520 --> 18:10.520
You want to live home for future invitations.

18:33.520 --> 19:02.520
In some cases, table based work slower when they are moved based. It comes from the implementation. Here we have good persons, but for example, in these cases we got better results.

19:02.520 --> 19:19.520
It comes because here CRC 42 is a bit backwards CRC. So in that case, we have to reflect input value and reflect output value and more times goes to reflect those values.

19:19.520 --> 19:32.520
Because of that, it worked slower. But if you use some, for example, instructions where we will reflect very fastly, better algorithm to reflect the value, it will work faster.

19:32.520 --> 19:40.520
Just now the implementation's use not what fast method.

19:40.520 --> 19:47.520
Thank you.

