Using Debug Dialog to Troubleshoot BOM Bug in Init Text

Desmanto · Post by **Desmanto** » 26 Feb 2018 15:55

Using Debug Dialog to Troubleshoot BOM Bug in Init Text

Prologue
I am currently building my speech flow project specifically for showing off automagic. The core of this project require a huge database that store 2 layer of map and 3rd layer of list. So it is a nested map and list (map map list).

Code: Select all

db = newMapFromValues(
"English", newMapFromValues(
"good (.*)", newList("Good %1", "%1 too", 'Hi, "Good" %1 too'),
"tell them the secret", newList("Your wish is my command!", "As you wish") ),
"Japanese", newMapFromValues(
"おはよう",newList("おはようございます")),
"Chinese", newMapFromValues(
"新年快樂", newList("新年快樂紅包拿來") ));

Choosing JSON as database format
I have proof test some concept, and make a test map/list and run the flow against it. It works well. Now I need to store the database to somewhere, so I can initialize it. Storing it in script is not efficient, as the structure becomes very ugly and unreadable after adding so many elements. The best medium so far is JSON, because it can deal with nested map and list. XML also can be used, but Automagic currently only support basic Xpath. evaluateXPathAsString() also quite slow when used in loop. JSON is simpler and effective to this use case.

Code: Select all

js = toJSON(db, true)

Testing the JSON
I test the to save the database toJSON() file and then reinitialize it back to see if the result are the same. Try to write file. Then create another init text and check the result, both are the same. The good news is JSON support non alphabet character as well. I type some chinese/japanese character and it still works well.

Building database in excel
After choosing the format and now it can work properly, now I proceed on building my own JSON directly from excel 2010. I make the formula so it will concatenate the cell in excel and output a valid JSON that can be used by automagic. I even mimic the indentation exactly as what Automagic will output when saving toJSON().

After excel is done, i start inputting more response and keyword to the database. Output the database to text file (in JSON structure) to be initialized by automagic later. When I don't use any non-alphabet character, the text saved fine using notepad. But when i started to use chinese/japanese char, the notepad now prompt me that I can't save using ANSI encoding. If I ignore it, all chinese/japanese char will be converted to question mark (?), and the text is gone. So I have to choose save as UTF-8 encoding, the default encoding in init text in Automagic. File saved fine, reopened and the char is still there.

Problem have arisen
Now here comes the problem. When I load the json notepad.txt that has been saved using UTF-8 in notepad, using Init Variable Text File, it loads up fine. I can see the file_text contain exactly as the content of the txt. But when I use toJSON(), it gives nothing. db is just a blank string, has no map nor list inside.

Code: Select all

db = toJSON(file_text);

While using the content from the same line outputted from automagic itself, it works fine. All map and list populated properly as it should. Why the JSON file from automagic works well, while the one from notepad and excel doesn't work? So here comes the troubleshooting. I am going to show you how I use debug dialog to go through the whole troubleshooting process to find out the culprit when something doesn't work.

: 1. Debug.png (181.06 KiB) Viewed 8467 times

Test 1
I try to use init text file on both file, notepad (file_text_note) and automagic (file_text_am) and check the result in debug dialog. At a single glance you can see both file_text_note and file_text_am have exactly the same content (so why the notepad's one didn't work?). I am very frustated and curious at the same time. Because I have spent several days to create all the formula in excel and now the output JSON doesn't work. At first, I still thought something is wrong with the excel. Because during the process, there is a bug for double quote, which can be solved by pasting it first to microsoft word (I have sorted it out). But the output file appears to be exactly the same, so don't know what's wrong here.

Test 2
Now I think both variable must be different at certain part, that's why one works, the other doesn't. I put a new line of script and check if they are the same.

Code: Select all

a = file_text_am == file_text_note;

The result in debug dialog, a = false. So I confirm there must be some invisible character at one of the file. Now I know they are different, I proceed on checking their file size.

Test 3
Notepad JSON is 328 bytes, while AM JSON is 325 bytes, smaller by 3 byte. So I know that notepad version must have add something to the text file before saving. But for the ANSI encoding (before I add Chinese/Japanese Char), the size is the same. The whole text file is the same (MD5 said so too). So I know the notepad version should be bigger. It has to be something with the UTF version. I try to confirm it in script.

Code: Select all

b = length(file_text_am);
c = length(file_text_note);

AM version, b = 275, Notepad version c = 276. What!!! b is bigger by 3 bytes, but how can when checked using length(), only bigger by one character? Is it possible that one character length is 3 bytes in total?

Test 4
I know that notepad JSON is bigger by 3 bytes and 1 char length. So notepad version should be longer. If I use replace() function and replace it with am version using blank string, then only the character differences will remain. The logic is if notepad = "x123456" and am = "123456". By replacing the "123456" out with blank string, the remainder should be "x". So I can find the culprit. using the script

Code: Select all

d = replace(file_text_note, file_text_am, "");

Check with debug dialog, d = '' (also a blank string, seems invisible). What! (again) OK, something just came across my mind.

Test 5
Actually I am not that surprised anymore, since I know this mysterious char must be a white space or any invisible character. We can know it is not a empty String by using

Code: Select all

d = replace(file_text_note, file_text_am, "");
e = isEmpty(d);

The result is e = false, means it is not empty. Now comes the next problem, if it is not empty and I can't see the content (since it is invisible), how can I check the content? If it is real world, I can throw it some paint, so I can see who is this "invisible man"

Test 6
I know I must look into the raw data, the hex format of the data. But Automagic don't have the hex support yet. I can use java to convert it, but it seems not suitable here. I run thru a glance of all available function, searching for "hex", "char", "code" and wait, I remember encodeURL(). This will convert any illegal character back to their %hh (hexadecimal) format. So i just use encodeURL() to the mysterious invisible character.

Code: Select all

d = replace(file_text_note, file_text_am, "");
f = encodeURL(d);

The result is f = %EF%BB%BF Nah, I finally know it is a 3 bytes but single character length hex. So it should be EF BB BF. I can confirm this by double cheking again using hex editor app and open the text file from notepad, I can see these 3 bytes at the beginning of the notepad version; which doesn't exist in AM version.

Googling Time
Time to googling, using the EF BB BF as the keyword will directly throw us a wikipedia article explaining about the BOM (Byte Order Mark). After reading for a while, I know it is to mark the UTF-8 file.
https://en.wikipedia.org/wiki/Byte_order_mark

The second article from stackoverflow even explain more detail, especially when dealing with JSON. But I haven't read this yet (before I start to type this thread).
https://stackoverflow.com/questions/222 ... ithout-bom

I now know why notepad append the BOM to the txt file when I choose to save in UTF-8. Windows enforce this, just as they enforce using CR LF as the line separator. While android doesn't need BOM to identify UTF-8, and having it may breaks scripts (as I experience here). If saved in ANSI format, notepad doesn't add the BOM. But i need UTF-8 since I have chinese/japanese character inside the text. So I need to remove the 3 bytes of BOM if I choose to use the notepad version. And of course I have to, since managing the database in excel is much easier and well organized.

Test 7
Now I know I need to remove these 3 bytes. But how? I tried using replace, but I can only replace the visible character. Using findAll() or replaceAll(), try using regex, I can't find the EFBBBF byte. Try using \u00EF\u00BB\u00BF will give unicode 00EF 00BB 00BF not byte EFBBBF. I tried to use the string which can refer to EF BB BF, but checking it agains the d, still give false. The only thing I can confirm the d content is just using the encodeURL(). However encodeURL() won't replace the original bytes at the file_text. I probably can use encodeURL(), replace, decode back to original one. Seems troublesome and probably can cause problem with the database after conversion.

Code: Select all

d = replace(file_text_note, file_text_am, "");
f = encodeURL(d);

g1 = findAll(d, "EFBBBF");
g2 = replace(d, "EFBBBF", "");
g3 = "\u00EF\u00BB\u00BF";
g4 = d == "\u00EF\u00BB\u00BF";

Test 8
After trying for a while (probably around 30 minutes), I realize just now the length of the 3 bytes is only one character. So if I use substring() and take starting from the second character (index 1), then I will get the whole correct JSON without BOM (skipping the first 3 byte). Or I can use left() and only take one character, it will give me the first 3 bytes (invisible), which can be use to replace. So I have found the way to remove it. I simply choose the substring(), since it is single line. Then convert it from JSON to map back. (from now on, I only init the text file from notepad, and save it to file_text).

Code: Select all

file_text = substring(file_text, 1);
db = fromJSON(file_text);

Finally, I have get rid of the BOM and can convert back the JSON from the excel + notepad to the original map map list in Automagic. Horray! Oh, wait....

Test 9
Another problem arise. I planned to maintain the primary database in excel, means most of the time I should use substring to skip thru the first 3 bytes. However I also create another way to input the database directly, when I don't have access to my PC. The file created from Automagic won't have the BOM. So if I skip the first character using substring, It will strip out the first curly braces, breaking the whole JSON. So I need a way to check if the current text file has the BOM (save from notepad), then remove it. If not (save from Automagic), then do nothing.

Since the BOM itself is the first character of the file, I can just simply use left() to get it. Then use encodeURL() to convert it to %hh format. This converted form can be checked now against the BOM. If it is, then strip out the BOM. If not, do nothing.

Code: Select all

if(encodeURL(left(file_text,1)) == "%EF%BB%BF")
  file_text = substring(file_text,1);

Now, my text database file is safe already. I can edit it from excel, save to notepad, copy to my phone; or I can edit it directly using another flow and save it. Both won't break my flow as these 2 lines check against the BOM and strip it out accordingly.

Debug Dialog in troubleshooting
As you can see, in all test, I use debug dialog to check the result of the test. I really don't remember how many times debug dialog has helped me throughout several "almost seems impossible to check" problem. That's why when something doesn't work, debug dialog will be your best friend.

: 2. Example.png (363.08 KiB) Viewed 8467 times

I have actually encounter this problem since last month, but only can fully documented it now. The whole process is not exactly the same, I maybe do another tests those I forgot to include here. Don't be scared off by the long process, as the whole troubleshooting actually takes around 1 hour. That is also because I am doing other things while troubleshooting it too. If I focus, maybe 15 minutes is enough. In fact, documenting this takes much longer time (probably 5-6 hours focus, not included reviewing). I also want to use this example to show how powerful the debug dialog is. That's why I can only finish it now

In your daily debugging when your flow is not working properly as you want, your troubleshooting process maybe only require 1 or 2 tests that takes no longer than a minute to spot the problem. In this case, I am dealing with "Mysterious Invisible Character", thus takes longer. I don't encounter this kind of "long troubleshooting" (1 hour length) everyday. That's why I decided to use this as the example case of troubleshooting using debug dialog. Most of my other troubleshooting cases are very fast and too boring, that probably it is not even worth to mention it here.

It should be the same case too with your daily flow troubleshooting, most can be solved easily using debug dialog.

Test Flow to check

Bug BOM.zip: (2.74 KiB) Downloaded 946 times

As usual, I attached the example flow and text file for the AM and notepad version, so you can follow the logic here. Since total are 3 files, I zipped them so it became single attachment. Extract BOM bug.zip out and you will have 3 files :
1. BOM bug.xml (the flow)
2. json AM.txt (proper JSON file exported from Automagic using Write File)
3. json notepad.txt (json from notepad, after it opened and resaved again in notepad, with additional 3 bytes, BOM header).

Put the json AM.txt and json notepad.txt at /storage/emulated/0/Automagic/Speech/ Import the flow_Bug_BOM.xml flow. I have separate, copy the elements and label it per step. So you only need to connect/disconnect the trigger from each branch to see check the result at debug dialog. This way, you can look from my eyes, for what I have seen during my troubleshooting.

BOM bug report
@Martin : this documentation also serves as the bug report. As I know from the link, any text reader app should strip out the BOM properly when loading a possible UTF encoded text file. No matter if the text has BOM or not, when the app detect it is UTF, It should load up only the text content, not including the BOM into the variable. The BOM will break the JSON and probably a lot of other script as well, so It should be eliminated when using action Init Variable Text File.

Workaround
For current workaround (AM 1.34.0), using my script in the Test 9, will ensure the BOM is removed if it exist.

Code: Select all

if(encodeURL(left(file_text,1)) == "%EF%BB%BF")
  file_text = substring(file_text,1);

Epilogue
Thanks for reading until here. Hope this documentation can helps you when something gone wrong, especially about how to use debug dialog to troubleshoot a peculiar case of "Mysterious Invisible Character".