This forum is in archive mode. You will not be able to post new content.

Author Topic: Regex Help  (Read 4673 times)

0 Members and 2 Guests are viewing this topic.

Offline bubzuru

  • Knight
  • **
  • Posts: 395
  • Cookies: 21
  • everything is contained in the data
    • View Profile
    • New School Tools
Regex Help
« on: August 13, 2012, 03:46:14 AM »
im trying to parse the hidemyass proxy list but its very strange (i guess its so people cant parse it lol)

here is an example:
Code: (html) [Select]
<tr class="altshade"  rel="12113224">
         <td class="leftborder timestamp" rel="1344818823"><span class="updatets ">
32 secs</span></td>
         <td><span><style>
.IoxS{display:none}
.tCrX{display:inline}
.kQk4{display:none}
.RUhO{display:inline}
</style><span class="165">84</span><span style="display:none">220</span><div style="display:none">47</div>.41<span class="kQk4">53</span><div style="display:none">252</div>.<span class="51">108</span><span style="display:none">136</span><span class="IoxS">246</span><span></span>.74</span></td>   
         <td>
8080</td>
         
         <td rel="si"><span class="country"><img src="http://static.hidemyass.com/flags/si.png" alt="flag" /> Slovenia</span></td>
         
         <td> <div class="speedbar response_time" rel="5205">
    <div class="medium" style="width:48%"> </div>
        </div>
         </td>
             <td> <div class="speedbar connection_time" rel="3041">
    <div class="medium" style="width:39%"> </div>
             
        </div>
             </td>
     
             <td>HTTP</td>
             <td class="rightborder">Low</td>
         
         </tr>

the output is something like this:
  32 secs 84.41.108.74 8080 flag Slovenia HTTP Low

i need to extract the info , anyone got any ideas ?
         
« Last Edit: August 13, 2012, 03:46:30 AM by bubzuru »
Damm it feels good to be gangsta
http://bubzuru.comule.com

Offline Deque

  • P.I.N.N.
  • Global Moderator
  • Overlord
  • *
  • Posts: 1203
  • Cookies: 518
  • Programmer, Malware Analyst
    • View Profile
Re: Regex Help
« Reply #1 on: August 13, 2012, 10:25:46 AM »
Regex alone is not suitable for this. Use an HTML parser library to get the contents of the table.

Offline Simba

  • Serf
  • *
  • Posts: 47
  • Cookies: 1335
  • programisiai.lt
    • View Profile
    • Programisiai.lt
Re: Regex Help
« Reply #2 on: August 13, 2012, 12:59:07 PM »
Do you need this done automatically?
I believe it's javascript which populates table.
So you would need generated source code to use regex.
On that page paste this in URL bar:
Code: [Select]
javascript:%20var%20win%20=%20window.open();%20win.document.write('<html><head><title>Generated%20HTML%20of%20%20'%20+%20location.href%20+%20'</title></head><pre>'%20+%20document.documentElement.innerHTML.replace(/&/g,%20'&amp;').replace(/</g,%20'&lt;')%20+%20'</pre></html>');%20win.document.close();%20void%200;and you will get generated source code.

Offline bubzuru

  • Knight
  • **
  • Posts: 395
  • Cookies: 21
  • everything is contained in the data
    • View Profile
    • New School Tools
Re: Regex Help
« Reply #3 on: August 13, 2012, 03:52:00 PM »
i hve the generated source i just need to parse it.
Deque's idea sounds the beszt i will look into it
« Last Edit: August 13, 2012, 03:53:31 PM by bubzuru »
Damm it feels good to be gangsta
http://bubzuru.comule.com

Offline bubzuru

  • Knight
  • **
  • Posts: 395
  • Cookies: 21
  • everything is contained in the data
    • View Profile
    • New School Tools
Re: Regex Help
« Reply #4 on: August 14, 2012, 06:14:02 AM »
this is just way to hard , im going to need to find a difrent way
Damm it feels good to be gangsta
http://bubzuru.comule.com

Offline NeX

  • Peasant
  • *
  • Posts: 74
  • Cookies: 5
    • View Profile
Re: Regex Help
« Reply #5 on: August 16, 2012, 02:27:03 AM »
If you're able to extract the contents from the page (like get the data from the table), then the first thing I can came up with is:
Code: [Select]
^(.+)\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(\d{1,5})\s(\w+)\s(\w+)\s(\w+)$Haven't tested it, but should work on your case.. Extractions are:
1. The time
2. The IP address
3. The port
4. Country
5. Type
6. Speed/anonymity/whatever ?

Oh, and also, I presumed that the results would be ALWAYS right, like, no 999.999.999.999  type IP addresses, and no ports bigger than 65535,etc..
I've heard there's tool for regex (regex buddy, if I'm right), to make your life easier XD
If you have any other questions you can ask here or PM me :)




EDIT:
Code: [Select]
http://regexpal.com/
This website says that I've forgot a few +'es... Fixed regex:
Code: [Select]
^(.+)\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+(\d{1,5})\s+(\w+)\s(\w+)\s+(\w+)$
« Last Edit: August 16, 2012, 02:33:14 AM by NeX »

Offline bubzuru

  • Knight
  • **
  • Posts: 395
  • Cookies: 21
  • everything is contained in the data
    • View Profile
    • New School Tools
Re: Regex Help
« Reply #6 on: August 16, 2012, 07:22:01 PM »
+1 for the help, but i dont think regex is the way to go.
me and xzid are working on something, he is an excellent coder btw. jhe should get more props than he does
Damm it feels good to be gangsta
http://bubzuru.comule.com

Offline bubzuru

  • Knight
  • **
  • Posts: 395
  • Cookies: 21
  • everything is contained in the data
    • View Profile
    • New School Tools
Re: Regex Help
« Reply #7 on: August 20, 2012, 03:13:38 AM »
ok i got this to work (thanx to xzid) here is the code for those interested
(C# HtmlAgilityPack)

get the column containing the ip info and pass as HtmlNode. returns ip as string
Code: (c) [Select]
        public static string DecodeIp(HtmlNode html)
        {
            string ip = ""; // Will hold our decoded ip
            List<string> DisplayInlineNames = new List<string>(); // Contains our good class names
            List<string> Bits = new List<string>(); // Contains all the bits of the IP

            ///////////////////////////////////////////////////////////////
            // Save the names of the {display:inline}'s into an list
            ///////////////////////////////////////////////////////////////
            string[] ClassNameList = html.InnerText.Split('}');
            foreach (string str in ClassNameList)
                if (str.Contains("inline")) DisplayInlineNames.Add(str.Substring(0, str.IndexOf("{")).Replace('}', ' ').Remove(0, 1));
            ///////////////////////////////////////////////////////// ///// 

            // Store all nodes from column in HtmlNodeCollection
            HtmlNodeCollection IPInfo = html.SelectNodes("span/node()");

            // Loop through nodes and grab good ip bits
            foreach (HtmlNode node in IPInfo)
            {
                string classname = "." + node.GetAttributeValue("class", string.Empty); //classname of the node
                string style = node.GetAttributeValue("style", string.Empty); //style att of the node

                // If the style atrabute contains "display:inline" add to bits
                if (style.Contains("display: inline")) Bits.Add(node.InnerText);

                // If the first char in class name is numeric add to bits
                foreach (char c in classname.Replace(".", ""))
                {
                    if (Char.IsNumber(c)) Bits.Add(node.InnerText);
                    break;
                }

                // If the class name is "good" add to bits
                for (int i = 0; i < DisplayInlineNames.Count; i++)
                    if (classname.Contains(DisplayInlineNames[i])) Bits.Add(node.InnerText);

                // If lone text add to bits
                if (!node.OuterHtml.Contains("<")) Bits.Add(node.InnerText);
            }
           
            //
            // Time to sort all our bits into an ip
            //
            foreach (string p in Bits) ip += p + ".";     
            ip = ip.Remove(ip.Length - 1, 1); //remove trailing '.'

            // Repace multiple periods with a single one '...' becomes '.'
            Regex regex = new Regex(@"[.]{2,}", RegexOptions.None);
            ip = regex.Replace(ip, @".");

            return ip; //return decoded ip
        }
« Last Edit: August 20, 2012, 02:59:31 PM by bubzuru »
Damm it feels good to be gangsta
http://bubzuru.comule.com

 



Want to be here? Contact Ande, Factionwars or Kulverstukas on the forum or at IRC.